Design and Analysis of Clinical Trials Instructor: Jen-pei Liu, Ph.D. Department of Statistics National Cheng-Kung University Division of Biostatistics, National Health Research Institutes Lecture III: Statistical Principles for Analysis of Clinical Data
Statistical Methods for Biotechnology Products II Statistical Principles for Analysis of Clinical Data Instructor: Jen-pei Liu, Ph.D. Division of Biometry Department of Agronomy National Taiwan University, and Division of Biostatistics and Bioinformatics National Health Research Institutes
Types of Data • Continuous Endpoints • Numerical discrete data • Heart beats per minutes • Total NIHSS • Total Hamilton Rating Scale for Depression • Total Alzheimer’s Disease Assessment Scale
Types of Data • Continuous Endpoints • Numerical continuous data • Age • Weight • ALT • Peak flow rate (liters per minute) • FEV1 (% of predicted value)
Types of Data • Categorical Endpoints • Nominal scale data Classification of patients according to their attributes • Gender • Race • Occurrence of a particular adverse reaction • Occurrence of ALT>3 times upper normal limit
Types of Data • Ordered (ordinal scale) categorical data • A certain order among different categories • Symptom score 0 = no symptom, 1 = mild, 2 = moderate, 3 = severe • Severity of adverse reactions • Severity of disease
Types of Data Censored Endpoints • Time to the occurrence of a pre-defined event • Time (continuous) and occurrence (categorical) • The occurrence of the event may not observed for some patients. Then the time to the occurrence of the event for these subjects is censored
Types of Data • Chapman, et al (NEJM 1991; 324: 788-94) The use of prednisone in reduction of relapse within 21 days of the treatment of acute asthma in the emergency room. • Primary endpoint Time to unscheduled visit to clinics because of worsening asthma.
Types of Data Cross-sectional vs. longitudinal data • Cross-sectional data (snap shot at one time point) Clinical data are collected and evaluated at a particular time point during the trial • Longitudinal data (snap shots at several time points) Clinical data collected and evaluated over a series of time points during the trial
Example Knapp et al (JAMA 1994; 271: 985-991) • A multi-center trial with 33 centers • Double-blind, randomized, 4 parallel groups • Forced escalation 30 weeks of randomized treatment • 6 visits The start of randomized treatment (baseline) 6,12,18,24, and 30 weeks • Cross-sectional data CIBI and ADAS-cog evaluated at the start of randomized treatment • Longitudinal A series of CIBI and ADAS-cog evaluated at the start of the study, the start of randomized treatment,6,12,18,24, and 30 weeks
Types of Comparison • Within-group (patient) comparison Comparison of the changes within the same patients at different time points during the trial. • Between-group (patient) comparison Comparison between groups of patients under different treatments.
Example: Major depression disorder Stark and Hardison (VCP, 1985;46,53-58) Cohn and Wilcox (JCP,1985:46,21-31) • Double-blind, randomized, three parallel groups • One-week placebo washout period • Fluoxetine vs. imipramine vs. placebo • 6 weeks of randomized treatments • Primary efficacy endpoint HAM-D score at the last follow-up visit • Within each group Change from baseline in HAM-D score • Between groups Comparison of the change from baseline in HAM-D score between groups
Endpoints • Raw measurements at a time point. • Change at a time point from baseline. • Percent change at a time point from baseline. • Clinically meaningful targeted value attained at a time point, i.e. sitting DBP <= 85 mm Hg • Selection of time points should be able to measure the effect of the intervention.
Selection of Endpoints • Endpoints should reflect the change of clinical status caused by the intervention. • Endpoints should be sensitive to the change of clinical status caused by the intervention. • Endpoints should be validated. • Raw measurements at a time point can only measure the static clinical status. • Change at a time point from baseline can measure the magnitude of the change of clinical status caused by the intervention. • Change from baseline has the same unit as the raw measurement
Selection of Endpoints • Percent change at a time point from baseline measures the relative magnitude of the change of clinical status caused by the intervention. • Percent change from baseline is unitless. • The same percent change may reflect different magnitudes of change • 20/100 = 2/10 = 200/1000 = 20%
Selection of Endpoints • One of the key inclusion criteria for clinical trial in treatment of mild to moderate essential hypertension is sitting DBP being between 95-115 mm Hg. • Three changes from baseline: 115 105, 105 95, 95 85. • 95 Changes from baseline: 8.7%, 9.5%, 10.5% • Only 95 85 reaches the clinically meaningful targeted value.
Selection of Endpoints • Endpoints should reflect clinically meaningful interpretation and applicability. • Clinically meaningful targeted value > change from baseline > percent change from baseline. • Clinical investigators should have responsibility for determination of the efficacy endpoints used in the clinical trials.
Selection of Endpoints LDL HDL TG Targeted Value < 100mg/dL 40-60 mg/dL < 150 mg/dL Bile acid Binding Resin 15-30% 3-5% no change Nicotinic acid 5-25% 15-35% 15-25% Fibric acid 5-20% 10-20% 20-50% HMG-CoA 18-55% 3-5% 7-30% Inhibitor
Descriptive Statistics All statistics are estimates with sampling errors • Continuous Data • Central tendency Mean: arithmetic average of all observations y Median: the middle observation • Dispersion Standard deviation s Minimum: the smallest observation Maximum: the largest observation Range: maximum minus minimum • Log-transformation: Mean on the log-scale exp (mean on the-scale) = geometric mean on the original scale
Descriptive Statistics • Presentation of results • Individual groups • Comparative difference • Example Adkinson, et al (NEJM 1997;336:324-31) Immunotherapy for asthma in allergic children
Categorical Data • Proportion of the patients with a certain attribute: the number of the patients with the attribute divided the total number of the patients in the group • Presenting both of counts and proportions m, p • Chapman, et al (NEJM 1991; 324: 788-94) The use of prednisone in reduction of replapse within 21 days of the treatment of acute asthma in the emergency room
Measures for comparison between groups Difference in the proportions • Relative risk The ratio of the proportions of the test group to the control. • Odds ratio The ratio of the odds of the test group to the control. • Odds The number of patients with the attribute to that without the attribute.
Categorical Endpoints • Difference in proportions provides the absolute magnitude of difference. • Both relative risk and odds ratio gives the relative magnitude of difference. • 50% 25% and 0.05% 0.025% both yield a relative risk of 50% but differences in proportion are 25% and 0.025% respectively. • Relative risk and odd ratio are appropriate when the proportion of the event for control group is small (<5%). • When the proportion of the event is small (<5%), the relative risk Odds ratio.
Censored Data • Kaplan-Meier curve (Actuarial probabilities) The proportions of the patients with occurrence of a pre-defined event over a period of time. • Median survival The time to the pre-defined event (e.g. death) occurring in 50% of the patients. • Hazard ratio The hazard of the occurrence of a pre-defined event of the test group to the control group
Example: Crawford, et al (NEJM 1989; 321: 419-24) • A controlled trial of leuprolide with and without flutamide in prostatic carcinoma • Randomized, double-blind, 2 parallel groups • Primary endpoint: overall survival
Kaplan-Meier Estimates of the Risk of Serious CV Events in the APC Trial by Treatment Arm*
Kaplan-Meier Estimates of the Risk of Serious CV Events in the APC Trial by Treatment Arm* 671 *In this analysis, “serious CV events” include death from CV causes, MI, stroke, or heart failure Solomon SD, et al: N Engl J Med 352, 2005
Inferential Statistics • Inference from the sample to the target population • A decision process for clinical hypotheses based on the trial objective through statistical testing procedures
Example: Farlow et al (JAMA 1992; 268: 2523-2529) • Randomized, double-blind, parallel groups • Objective To compare the tacrine (20, 40, 80 mg per day) versus placebo for probable Alzheimer’s disease • Null hypothesis No difference in ADAS-cog scale between 80 mg of tacrine and placebo. • Alternative hypothesis There exists a true difference in ADAS-cog scale between 80 mg of tacrine and placebo.
Example: The NINDS rt-PA Stroke Study Group (NEJM 1996; 335: 841-7) • Objective for partⅠ A greater proportion of patients with acute ischemic stroke treated with t-PA, as compared with those given placebo, have early improvement (>= 4 from baseline on NIHSS). • Primary efficacy endpoint Proportion of patients with improvement • Null hypothesis No difference in the proportions of patients with improvement between t-PA and placebo. • Alternative hypothesis The minimal difference in the proportions of patients with improvement between t-PA and placebo is at least 24%.
Decision Based on Results • Significance level: The consumer’s risk The chance that the decision based on the results there is a minimal difference of 24% improvement between t-PA and placebo when in fact there is no difference. • Power = 1 – producer’s risk The chance that decision based on the results concludes a minimal difference of 24% improvement between t-PA and placebo in fact there is.
Statistical Testing Procedures • Step1 State the null and alternative hypotheses • Null hypothesis: the one to be questioned No difference in the proportions of patients with improvement between t-PA and placebo. • Alternative hypothesis: the one of particular interest to investigators The minimal difference in the proportions of patients with improvement between t-PA and placebo is at least 24%.
Statistical Testing Procedures • Step 2 Choose an appropriate test statistics such as two-sample t-statistics. • Step 3 • Select the nominal significance level the risk of typeⅠerror you are willing to commit Usually 5%
Statistical Testing Procedures • Step 4 • Determine the critical value, rejection region and decision rule For large samples, two-sided alternative andα= 0.05, the critical value is z(0.025) = 1.96 and rejection region will be the one such that the absolute value of the test statistic is greater than 1.96. • Decision rule reject the null hypothesis if the resulting test statistic is in the rejection region.
Statistical Testing Procedures Step 1 to step 4 should be determined and pre-specified in the Statistical Method section of the protocol before initiation of the study.
Statistical Testing Procedures • Step 5 When the study is completed or the data are available for interim analysis, complete the value of the test statistic specific in Step2 (protocol). • Step 6 Make decision based on the resulting value of the test statistic and decision rule specified in Step 4 (protocol).
Statistical Testing Procedures • Conclusion • Reject the null hypothesis The sampling error is an unlikely explanation of discrepancy between the null hypothesis and observed values and the alternative hypothesis is proved at a risk of 5%. • Fail to reject null hypothesis The sampling error is a likely explanation and the data fail to provide sufficient evidence to doubt the validity of the null hypothesis. • Do NOT claim that the null hypothesis is accepted.
P - value • If there is no difference in ADAS-cog between the two groups (i.e., the null hypothesis is true), the chance of obtaining a mean difference at least as large as the observed mean difference. • If p-value is small, it implies that the observed difference is unlikely to occur if there is no difference in ADAS-cog scale between 80mg of tacrine and placebo.
P - value • How small the p-value is sufficient enough to conclude that there exists a true difference in ADAS-cog scale between 80 mg of tacrine and placebo? • It depends upon the risk that the investigator is willing to take for committing type I error. • Nominal significance level = risk of type I error (The chance of concluding existence of a true difference in ADAS-cog when in fact there is no difference)
P - value • If the observed p-value < the nominal significance level (i.e., the observed p-value < risk of type Ⅰerror), then conclude there exists a true difference in ADAS-cog. • The nominal significance level = 5% or 1% • The p-value for the observed difference in mean ADAS-cog is 0.015. • If the nominal significance level is 5%, then it is concluded that there is a difference in ADAS-cog between 80mg of tarcine and placebo in target population of patients with probable Alzheimer’s disease.
P - value • We can not make the same decision if the nominal significance level is chosen to be 1%. • Should always reported the observed p-value and let readers and reviewers judge the strength of evidence by themselves and do not use p-value < 0.05.
Confidence Interval • Example Adkinson, et al (NEJM 1997; 336: 324-31) Immunotherapy for asthma in allergic children
Confidence Interval • Estimates about the true population difference. • Random intervals which can be different if the same trial is repeated. • A 95% confidence interval implies 95% chance that the interval (-7.8, 0.1) will cover the true difference in average PEFR between the two groups. • A 95% confidence interval for the difference will not include 0 if and only if p-value < 0.05.