Sample Size Issues Involved in Sequential Analysis/Sequential Trials

Sample Size Issues Involved in Sequential Analysis/Sequential Trials Jonathan J. Shuster Dept of Epidemiology and Health Policy Research College of Medicine University of Florida August 5, 2006

Everyone wants to Peek

Outline of Talk • Motivation for Group Sequential Methods in Clinical Trials • Motivation for Group Sequential Methods in Tissue Bank case-control studies. • A non-technical look at Brownian Motion and its role in Sample Size determination for Group Sequential methods

Outline (Continued) • Sample Reference Designs • Real Example • Brief look at Continuous Monitoring by O’Brien-Fleming Method • Take Home Messages

Motivation • International Sudden Infarct Study #2

ISIS #2 (Clot busters) • Lancet 8/88 P349-360. • International Sudden Infarct Study #2 • 3 year accrual. Major goal to prevent early deaths (5 week mortality) • Design: Double Blind 2*2 factorial of Aspirin vs. Placebo and Streptokinase vs. Placebo.

ISIS #2 (Cont) Death Rates (@ 5 weeks) (1) A/SK: 343/4292 (8.0%) (2) P/SK:448/4300 (10.4%) (3) A/P: 461/4295 (10.7%) (4) P/P:568/4300 (13.2%)

ISIS #2 (Cont) • Z (Pooled variance) for Double Drug vs. Double Placebo: 7.85, P=4.2*10-15 • Z (Mantel-Haenszel) Aspirin vs. Placebo: 5.23, P=1.7*10-8 • Z (Mantel-Haenszel) Strepokinase vs. Placebo: 5.90, P=3.6* 10-10

How would existing Group Sequential Designs have Fared? • O’Brien-Fleming (OF) or Pocock (P) Design with three equally spaced looks with same operating characteristics. • OF: Double drug vs. Placebo has average predicted sample size of 8542 (slightly under 50% of the fixed.) • P: Double drug vs. Placebo has average predicted sample size of 6545 (under 40% of the fixed.) • Savings: about 18-24 months of accrual with public informed earlier.

Tissue Banking Case-Control Studies • Childhood Leukemia Bone Marrow Bank (Children’s Oncology Group). There are about 10,000 patients with available samples for research. Samples cannot be reused. • Is a Genetic Marker (+/-) prognostic for survival in a well defined subgroup (including defined therapy). • Available material: 1000 patients (With sufficient follow-up).

Planning Parameters • Frequency of Marker: about 20% • Planning occurrence: Long term survivors 15% vs. Failures 25%. (Odds ratio is near 2 (1.9)). • Fixed sample size needs: 248/Group (496 total)

Using a 2-Stage Reference Design (Shuster et. al. 2002 from Table 1b) • Stage 1: Take 64% of the single stage study (64% of 496)=318 (159 cases and 159 controls). Stop for futility if |Z|<1.08 and for significance if |Z|>2.28. • Stage 2: Take 113% of the single stage study (49% more) (560: 280 cases and 280 controls or 121 more of each). Declare significance if and only if |Z|>2.00.

Properties • The power is 80% at P=.05 (two-sided) • The expected sample size is less than 426 (86% of the fixed), irrespective of the true proportions positive amongst cases and controls (fixed requires 496). (No other 2-stage beats the 426) • Under the null, the expected sample size is 353 (about 71%) • Under the alternative, the expected sample size is 409 (about 82%)

Ingredients Needed • Single Stage Sample Size Requirement • Number of interim looks • Timing of Each Look (we will use equally spaced for 3+ stages, but this is not an absolute) (Expressed relative to Single Stage) • Cutoffs for futility and significance at each look

Group Sequential Designs • Why bother with sequential designs? • Why not fully continuous sequential designs?

Why do Sequential Studies • Concerns about assigning knowingly inferior or more toxic treatment to trial participants • Concerns about getting knowledge to the public sooner • Concerns about conservation of resources (especially in tissue banking).

Why not do Sequential Studies • They may need to be temporarily closed to accrue the data • There may be no safety issues involved, and no need to beat the competition to publish • May be impractical for small studies. • Results may come in too slowly to be of value • Effect sizes are estimated with lower precision. (Sequential nature must be taken into account.) • Multivariate Endpoints add complexity (But can deal with this if needed).

Optimization of the Group Sequential Design • Absolute minimum sample size: No matter what, I want a design that has a maximum sample size very close to the fixed. • Minimize the average sample size under the Null Hypothesis • Minimize the average sample size under the Alternative Hypothesis • Minimize the expected value of the mean of the sample size under null and alternative • Minimax: Minimize the maximum expected sample size over all values of the effect size.

Other Considerations • We shall enforce Uniform Look times (except 2 stage). • We shall impose a maximum number of looks.

Brownian Motion (E.G) • Sn=(Y1+ .. +Yn) Yi are iid • E(Sn ) =n • Var(Sn ) =n2 • Yi = Ui - Vi (Diff in Means) • Normal distribution, independent stationary increments, mean and variance proportional to “time”

Brownian Motion • X() ~N(,2) • =Time (=1 is the time of the non-sequential study) • =Effect size for the non-sequential study •  is the population standard deviation of the estimate for the non-sequential study, =1.

Brownian Motion • The process has independent, stationary increments. For example • X(1) and X(2)-X(1) , 1 < 2 are independent • This implies that Cov[X(1), X(2)]= 12 1 < 2 .

Typical Examples approximating BM • Sum of independent identically distributed (iid) distributed random variables (One sample problem for means and proportions). Time is proportional to sample size. • Differences between partial sums with equal sample sizes for two populations. (Two sample problem for means or proportions.)

Typical Examples approximating BM • Two sample analysis of covariance for randomized study with a completely random covariate. • Mann-Whitney U-statistic • Cox Regression (Logrank) test in survival analysis, under PROPORTIONAL HAZARDS and equal randomization • Matched Proportions (Unconditional version of McNemar’s Test)

Reference Design Appearance For look times 1 < 2 < 3 …< k , Reject if |X(j)|> j1/2 ZR(j) Accept if |X(j)|< j1/2 ZA(j) Continue if j1/2 ZA(j) <|X(j)|< j1/2 ZR(j)

Optimizing the Design • Suppose we wish to test the null hypothesis • H0: =0 vs. Ha: |  | = a : with type I error  and Type II error . • Can we minimize E(), the “average time required”, defined for us as: • E()=.5E( |  =0) +.25E( |  = a )+ +.25E( |  =- a )

“Risk Function” • In other words, for symmetric situations, the risk function is the average of the expected sample size under the null and alternative hypotheses. We reward closure for futility and efficacy equally. (Other researchers have used the expected under the null only or expected under the alternative only.)

Characterization of a Group Sequential Design • =Type I error, =Type II error, =Expected sample size • Let D be the set of all designs, including random combinations of single designs (e.g.use design 1 with 65% probability and design 2 with 35% probability).

Admissible Designs • A design d1 with parameters (, , ) is admissible if there is no design d2 with parameters (’, ’, ’) with • ’  ’   and  ‘  with at least one of these inequalities strict.

Backward Induction Method • Although a daunting task, the connection to the fact that the admissible designs can be characterized as Bayesian solutions under the previously stated risk function, allows us to develop a search procedure to optimize the designs. This methodology was adapted from work the 1957 work of Kiefer and Weiss to this problem by Myron Chang (1996)

Example of 4 Optimal Look Design • Type I error: =0.05 • Type II error:=0.20 • Maximum Look Time: 1.333 times equivalent single stage sample size. • Equally Spaced Look times.

And the Champ is Look 1: =0.33 Acc if |Z|<0.40 Rej if |Z|>2.59 Look 2: =0.67 Acc if |Z|<0.93 Rej if |Z|>2.36 Look 3: =1.00 Acc if |Z|<1.34 Rej if |Z|>2.31 Look 4:=1.33 Acc if |Z|<2.02 Rej if |Z|>2.02

Properties (=1 is single Stage) • E(|H0)=0.677[0.666] • E(|Ha)=0.755[0.751] • .5E(|H0)+.5E(|Ha)=0.716 (Optimum) • SupE(|)=.807[0.809] • In [] is the champion in a grid search over 300 plausible designs with equally spaced looks and same operating characteristics. Speaks well to robustness to other optimization standards.

Numerical Example • Based on historical control data, a group of patients with aneurisms and unstable urinary creatinine had a 50% chance of dying or needing dialysis within 28 days. • Can drug treatment cut this rate in half?

Single Stage Study • Using my AGS program, CLASSZTEST.SAS, we conclude that for =0.05 (two-sided) and Type II error: =0.20 (80% power) we need 55 patients per group (110 total). • =1 (Single Stage Study) corresponds to N=110. • Using the optimal design, we would look after 37, 74, 110, and 147 patients.

Convincing the Skeptic • Step 1: Show them that the O’Brien-Fleming Design is so highly correlated with the Non-Sequential Design with the same operating characteristics that it behooves them to use a group sequential design, if feasible. • Step 2: Convince them to consider more efficient designs.

Amazing Results • 4 Stage Design with O’Brien-Fleming vs. Single Stage (5% two-sided type I error and 80% power) • Null: Sample paths that are significant for both 4.1%, Sample paths non-significant by both 94.1%. Discordance 0.9% in each direction. • Alternate: Sample paths that are significant for both 77.9%. Sample paths not significant for both 17.9%. Discordance 2.1% in each direction. • Max sample size for OBF<105% of Single Stage

Continuous Monitoring via O’Brien-Fleming ________________X____________________________ Z=2.24 ______________________________________________Z=0 ______________________________________________Z=-2.24 =0 =1 X represents first time Brownian Motion (Mean 0) hits +/- 2.24.

Reflecton (Null Hypothesis) • If a path ends up above 2.24, it had to have crossed 2.24 at some time. Place a mirror for any path hitting 2.24, and it is equally likely (under the null hypothesis of zero drift) that it ends up above vs. below 2.24. • P(hits Z=2.24)=2P(ends above Z=2.24)=.025. • P(hits Z=-2.24)=2P(ends below Z=-2.24)=.025. • P(Hits both) is virtually zero.

Power Function (Alternate Hypothesis for Z=2.24) • Power implies that the first time Z exceeds 2.24 is before time=1. (Time=1 has say 80% power for the fixed study at P=0.05 2-sided). (First passage distribution in Brownian Motion) • Detectable effect size is 2.80 (1.96+0.84) for the non-sequential study. • Detectable difference for study of same duration continuously monitored 2.88, and same power. (Cox and Miller reference provided at end) • Inflation of maximum time for Continuous OBF to have same power as non-sequential, 6%. (Inflate n by 6% and look continuously with OBF). (2.88/2.80)2

Numerical Simulation • Sign Test: We wish to accrue enough subjects to test P=0.50 vs. a two-sided alternative P≠0.50 to have 80% power when |P-.50|>0.10 at P=0.05 two-sided. • Non-sequential sample size • N=[(Z/2 + Z )/]2 =[(1.96+.84)(.5)/.1]2 • N=196 (Non-Sequential requirement)

Continuous O’Brien-Fleming • Inflate by 6% (per first passage distribution) • 106% of 196=208. • Type I error at P=0.50: 4.4% (100,000 sim) • Power at P=0.60: 79.2% (100,000 sim) • ASN=206.1 (Null) and 154.0 (Alt)

Adding Futility is Almost Free • (Computer trial and error) • Start with small # of Simulations to zero in on parameters. • “Conditional Power idea”: Calculate the binomial probability of rejection under the alternate hypothesis at the last observation: • Succ at final>.5Nfinal+.5Zsqrt(Nfinal). • If the probability of rejection is under the alternative is <10% stop for futility. Manipulate Z and Nfinal in simulations.

Continuous Monitoring with Futility • N(Max)=210 (Up from 208 OBF, 196 non-seq) • Stop for efficacy |Z|>2.20*sqrt(210/N) • N=Number sampled to date (Critical value is lowered from 2.24 to 2.20, due to provision for futility). • Empirical Results (based on 100,000 simulations): Type I error 5.1%, Power 80.2% • ASN: Null 144.6(was 206.1) vs. Alternate 142.3(was 154.0). • ASN (Optimal 4 Stage) Null 132.7 vs. Alt 148.0

Continuous Looking • If no sequential monitoring was planned, but client continuously looked, there is a 41% chance of finding at P<0.05, two-sided at some point in a study of 196 for the sign test, when indeed the success rate is 50% (Null). • When you are asked to do an analysis of a study you did not design, ask if this was the planned sample size. Is this a random high?

Take Home Message • As statisticians we understand uncertainty. Are you willing to gamble about when a study may be completed? (This affects your choices of fixed or what type of Group Sequential Design to consider.) • Is it important that the study be stopped early for a positive result (efficacy)? • It is important that a study be stopped early for a negative result (futility)?

Take Home Message • Some knowledge of Sequential Methods is useful when dealing with your response to analyzing data from studies where the design is unclear to you. (Have your colleagues screened for variables based on potential significance? Have they picked a point that is premature so they can present an abstract at a meeting? Have they based the question on a possible random high?)

Sample Size Issues Involved in Sequential Analysis/Sequential Trials