Overview of monitoring clinical trials
Download
1 / 137

Overview of Monitoring Clinical Trials - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Overview of Monitoring Clinical Trials. Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington August 5, 2006. Topics. Goals / Disclaimers Clinical trial setting / Study design Stopping rules Sequential inference Adaptive designs. Goals / Disclaimers. Goals.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Overview of Monitoring Clinical Trials' - dean-hood


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Overview of monitoring clinical trials

Overview of Monitoring Clinical Trials

Scott S. Emerson, M.D., Ph.D.

Professor of Biostatistics

University of Washington

August 5, 2006


Topics
Topics

  • Goals / Disclaimers

  • Clinical trial setting / Study design

  • Stopping rules

  • Sequential inference

  • Adaptive designs



Demystification
Demystification

  • All we are doing is statistics

    • Planning a study

    • Gathering data

    • Analyzing it


Sequential studies
Sequential Studies

  • All we are doing is statistics

    • Planning a study

      • Added dimension of considering time required

    • Gathering data

      • Sequential sampling allows early termination

    • Analyzing it

      • The same old inferential techniques

      • The same old statistics

      • But new sampling distribution


Distinctions without differences
Distinctions without Differences

  • Sequential sampling plans

    • Group sequential stopping rules

    • Error spending functions

    • Conditional / predictive power

  • Statistical treatment of hypotheses

    • Superiority / Inferiority / Futility

    • Two-sided tests / bioequivalence


Perpetual motion machines
Perpetual Motion Machines

  • Discerning the hyperbole in much of the recent statistical literature

    • “Self-designing” clinical trials

    • Other adaptive clinical trial designs

      • (But in my criticisms, I do use a more restricted definition than some in my criticisms)


Key issue
Key Issue

You better think (think)

about what you’re

trying to do…

-Aretha Franklin


Evaluation of trial design
Evaluation of Trial Design

  • When planning an expensive (in time or money) clinical trial, there is no substitute for planning

    • Will your clinical trial satisfy (as much as possible) the various collaborating disciplines with respect to the primary endpoint?

      • The major part of this question can be answered at the beginning of the clinical trial


Goals disclaimers1

Goals / Disclaimers

Disclaimers


Conflict of interest
Conflict of Interest

  • Commercially available software for sequential clinical trials

    • S+SeqTrial (Emerson)

    • PEST (Whitehead)

    • EaSt (Mehta)

    • ?SAS


Personal defects
Personal Defects

  • Physical

  • Personality

    • In the perjorative sense of the words

      • Then:

        • A bent twig

      • Now:

        • Old and male

        • University professor


A more descriptive title
A More Descriptive Title

  • All my talks

    The Use of Statistics to Answer

    Scientific Questions

    (Confessions of a Former Statistician)


A more descriptive title1
A More Descriptive Title

  • All my talks

    The Use of Statistics to Answer

    Scientific Questions

    (Confessions of a Former Statistician)

  • Group sequential talks

    The Use of Statistics to Answer

    Scientific Questions Ethically and Efficiently

    (Confessions of a Former Statistician)


Science vs statistics
Science vs Statistics

  • Recognizing the difference between

    • The parameter space

      • What is the true scientific relationship?

    • The sample space

      • What data did you gather?


Science vs statistics1
Science vs Statistics

  • Recognizing the difference between

    • The true scientific relationship

      • Summary measures of the effect

        • Means, medians, geometric means, proportions…

    • The precision with which you know the true effect

      • Power

      • P values, posterior probabilities


Reporting inference
Reporting Inference

  • At the end of the study analyze the data

  • Report three measures (four numbers)

    • Point estimate

    • Interval estimate

    • Quantification of confidence / belief in hypotheses


Reporting frequentist inference
Reporting Frequentist Inference

  • Three measures (four numbers)

    • Consider whether the observed data might reasonably be expected to be obtained under particular hypotheses

      • Point estimate: minimal bias? MSE?

      • Confidence interval: all hypotheses for which the data might reasonably be observed

      • P value: probability such extreme data would have been obtained under the null hypothesis

        • Binary decision: Reject or do not reject the null according to whether the P value is low


Reporting bayesian inference
Reporting Bayesian Inference

  • Three measures (four numbers)

    • Consider the probability distribution of the parameter conditional on the observed data

      • Point estimate: Posterior mean, median, mode

      • Credible interval: The “central” 95% of the posterior distribution

      • Posterior probability: probability of a particular hypothesis conditional on the data

        • Binary decision: Reject or do not reject the null according to whether the posterior probability is low


Parallels between tests cis
Parallels Between Tests, CIs

  • If the null hypothesis not in CI, reject null

    • (Using same level of confidence)

  • Relative advantages

    • Test only requires sampling distn under null

    • CI requires sampling distn under alternatives

    • CI provides interpretation when null is not rejected


  • Scientific information
    Scientific Information

    • “Rejection” uses a single level of significance

      • Different settings might demand different criteria

    • P value communicates statistical evidence, not scientific importance

    • Only confidence interval allows you to interpret failure to reject the null:

      • Distinguish between

        • Inadequate precision (sample size)

        • Strong evidence for null


    Hypothetical example
    Hypothetical Example

    • Clinical trials of treatments for hypertension

      • Screening trials for four candidate drugs

        • Measure of treatment effect is the difference in average SBP at the end of six months treatment

        • Drugs may differ in

          • Treatment effect (goal is to find best)

          • Variability of blood pressure

        • Clinical trials may differ in conditions

          • Sample size, etc.


    Reporting p values
    Reporting P values

    Study P value

    A 0.1974

    B 0.1974

    C 0.0099

    D 0.0099


    Point estimates
    Point Estimates

    Study SBP Diff

    A 27.16

    B 0.27

    C 27.16

    D 0.27


    Point estimates1
    Point Estimates

    Study SBP Diff P value

    A 27.16 0.1974

    B 0.27 0.1974

    C 27.16 0.0099

    D 0.27 0.0099


    Confidence intervals
    Confidence Intervals

    Study SBP Diff 95% CI P value

    A 27.16 -14.14, 68.46 0.1974

    B 0.27 -0.14, 0.68 0.1974

    C 27.16 6.51, 47.81 0.0099

    D 0.27 0.06, 0.47 0.0099


    Interpreting nonsignificance
    Interpreting Nonsignificance

    • Studies A and B are both “nonsignificant”

      • Only study B ruled out clinically important differences

      • The results of study A might reasonably have been obtained if the treatment truly lowered SBP by as much as 68 mm Hg


    Interpreting significance
    Interpreting Significance

    • Studies C and D are both statistically significant results

      • Only study C demonstrated clinically important differences

      • The results of study D are only frequently obtained if the treatment truly lowered SBP by 0.47 mm Hg or less


    Bottom line
    Bottom Line

    • If ink is not in short supply, there is no reason not to give point estimates, CI, and P value

    • If ink is in short supply, the confidence interval provides most information

      • (but sometimes a confidence interval cannot be easily obtained, because the sampling distribution is unknown under the null)


    But impact of three over n
    But: Impact of “Three over n”

    • The sample size is also important

      • The pure statistical fantasy

        • The P value and CI account for the sample size

      • The scientific reality

        • We need to be able to judge what proportion of the population might have been missed in our sample

          • There might be “outliers” in the population

          • If they are not in our sample, we will not have correctly estimated the variability of our estimates

        • The “Three over n” rule provides some guidance


    Full report of analysis
    Full Report of Analysis

    Study n SBP Diff 95% CI P value

    A 20 27.16 -14.14, 68.46 0.1974

    B 20 0.27 -0.14, 0.68 0.1974

    C 80 27.16 6.51, 47.81 0.0099

    D 80 0.27 0.06, 0.47 0.0099


    Interpreting a negative study
    Interpreting a “Negative Study”

    • This then highlights issues related to the interpretation of a study in which no statistically significant difference between groups was found

      • We have to consider the “differential diagnosis” of possible situations in which we might observe nonsignificance


    General approach
    General approach

    • Refined scientific question

      • We compare the distribution of some response variable differs across groups

        • E.g., looking for an association between smoking and blood pressure by comparing distribution of SBP between smokers and nonsmokers

      • We base our decisions on a scientifically appropriate summary measure 

        • E.g., difference of means, ratio of medians, …


    Interpreting a negative study1
    Interpreting a “Negative Study”

    • Possible explanations for no statistically significant difference in 

      • There is no true difference in the distribution of response across groups

      • There is a difference in the distribution of response across groups, but the value of  is the same for both groups

        • (i.e., the distributions differ in some other way)

      • There is a difference in the value of  between the groups, but our study was not precise enough

        • A “type II error” from low “statistical power”


    Interpreting a positive study
    Interpreting a “Positive Study”

    • Analogous interpretations when we do find a statistically significant difference in 

      • There is a true difference in the value of 

      • There is no true difference in , but we were unlucky and observed spuriously high or low results

        • Random chance leading to a “type I error”

          • The p value tells us how unlucky we would have had to have been

        • (Used a statistic that allows other differences in the distn to be misinterpreted as a difference in 

          • E.g., different variances causing significant t test)


    Bottom line1
    Bottom Line

    • I place greatest emphasis on estimation rather than hypothesis testing

    • When doing testing, I take more of a decision theoretic view

      • I argue this is more in keeping with the scientific method

        • (Ask me about The Scientist Game)

    • All these principles carry over to sequential testing


    Group sequential methods
    Group Sequential Methods

    • Scientific measures

      • Decision theoretic

        • Statement of hypotheses

        • Criteria for early stopping based on point and interval estimates

      • Proper inference for estimation

    • Purely statistical measures

      • Focus on type I errors

      • Error spending functions

      • Conditional / predictive power



    Clinical trials
    Clinical Trials

    • Experimentation in human volunteers

      • Investigates a new treatment/preventive agent

        • Safety:

          • Are there adverse effects that clearly outweigh any potential benefit?

      • Efficacy:

        • Can the treatment alter the disease process in a beneficial way?

    • Effectiveness:

      • Would adoption of the treatment as a standard affect morbidity / mortality in the population?


    Clinical trial design
    Clinical Trial Design

    • Finding an approach that best addresses the often competing goals: Science, Ethics, Efficiency

      • Basic scientists: focus on mechanisms

      • Clinical scientists: focus on overall patient health

      • Ethical: focus on patients on trial, future patients

      • Economic: focus on profits and/or costs

      • Governmental: focus on validity of marketing claims

      • Statistical: focus on questions answered precisely

      • Operational: focus on feasibility of mounting trial


    Statistical planning
    Statistical Planning

    • Satisfy collaborators as much as possible

      • Discriminate between relevant scientific hypotheses

        • Scientific and statistical credibility

      • Protect economic interests of sponsor

        • Efficient designs

        • Economically important estimates

      • Protect interests of patients on trial

        • Stop if unsafe or unethical

        • Stop when credible decision can be made

      • Promote rapid discovery of new beneficial treatments


    Refine scientific hypotheses
    Refine Scientific Hypotheses

    • Target population

      • Inclusion, exclusion, important subgroups

    • Intervention

      • Dose, administration (intention to treat)

    • Measurement of outcome(s)

      • Efficacy/effectiveness, toxicity

    • Statistical hypotheses in terms of some summary measure of outcome distribution

      • Mean, geometric mean, median, odds, hazard, etc.

    • Criteria for statistical credibility

      • Frequentist (type I, II errors) or Bayesian


    Statistics to address variability
    Statistics to Address Variability

    • At the end of the study:

      • Frequentist and/or Bayesian data analysis to assess the credibility of clinical trial results

        • Estimate of the treatment effect

          • Single best estimate

          • Precision of estimates

        • Decision for or against hypotheses

          • Binary decision

          • Quantification of strength of evidence


    Statistical sampling plan
    Statistical Sampling Plan

    • Ethical and efficiency concerns are addressed through sequential sampling

      • During the conduct of the study, data are analyzed at periodic intervals and reviewed by the DMC

      • Using interim estimates of treatment effect

        • Decide whether to continue the trial

        • If continuing, decide on any modifications to

          • scientific / statistical hypotheses and/or

          • sampling scheme


    Sample size determination
    Sample Size Determination

    • Based on sampling plan, statistical analysis plan, and estimates of variability, compute

      • Sample size that discriminates hypotheses with desired power, or

      • Hypothesis that is discriminated from null with desired power when sample size is as specified, or

      • Power to detect the specific alternative when sample size is as specified



    When sample size constrained
    When Sample Size Constrained

    • Often (usually?) logistical constraints impose a maximal sample size

      • Compute power to detect specified alternative

      • Compute alternative detected with high power


    Stopping rules

    Stopping Rules

    Monitoring a Trial


    Need for monitoring a trial
    Need for Monitoring a Trial

    • Ethical concerns

      • Patients already on trial

        • Avoid harm; maintain informed consent

      • Patients not on trial

        • Facilitate rapid introduction of treatments

    • Efficiency concerns

      • Minimize costs

        • Number of patients accrued, followed

        • Calendar time


    Working example
    Working Example

    • Fixed sample two-sided tests

      • Test of a two-sided alternative (+ > 0 > - )

        • Upper Alternative: H+:   + (superiority)

        • Null: H0:  = 0 (equivalence)

        • Lower Alternative: H -:   - (inferiority)

      • Decisions:

        • Reject H0 , H - (for H+)  T  cU

        • Reject H+ , H - (for H0)  cL  T  cU

        • Reject H+ , H0 (for H -)  T  cL



    Fixed sample methods wrong
    Fixed Sample Methods Wrong

    • Simulated trials under null stop too often


    Simulated trials pocock
    Simulated Trials (Pocock)

    • Three equally spaced level .05 analyses

      Pattern of Proportion Significant

      Significance 1st 2nd 3rd Ever

      1st only .03046 .03046

      1st, 2nd .00807 .00807 .00807

      1st, 3rd .00317 .00317 .00317

      1st, 2nd, 3rd .00868 .00868 .00868 .00868

      2nd only .01921 .01921

      2nd, 3rd .01426 .01426 .01426

      3rd only .02445 .02445

      Any pattern .05038 .05022 .05056 .10830


    Pocock level 0 05
    Pocock Level 0.05

    • Three equally spaced level .022 analyses

      Pattern of Proportion Significant

      Significance 1st 2nd 3rd Ever

      1st only .01520 .01520

      1st, 2nd .00321 .00321 .00321

      1st, 3rd .00113 .00113 .00113

      1st, 2nd, 3rd .00280 .00280 .00280 .00280

      2nd only .01001 .01001

      2nd, 3rd .00614 .00614 .00614

      3rd only .01250 .01250

      Any pattern .02234 .02216 .02257 .05099


    Unequally spaced analyses
    Unequally Spaced Analyses

    • Level .022 analyses at 10%, 20%, 100% of data

      Pattern of Proportion Significant

      Significance 1st 2nd 3rd Ever

      1st only .01509 .01509

      1st, 2nd .00521 .00521 .00521

      1st, 3rd .00068 .00068 .00068

      1st, 2nd, 3rd .00069 .00069 .00069 .00069

      2nd only .01473 .01473

      2nd, 3rd .00165 .00165 .00165

      3rd only .01855 .01855

      Any pattern .02167 .02228 .02157 .05660


    Varying critical values obf
    Varying Critical Values (OBF)

    • Level 0.10 O’Brien-Fleming (1979); equally spaced tests at .003, .036, .087

      Pattern of Proportion Significant

      Significance 1st 2nd 3rd Ever

      1st only .00082 .00082

      1st, 2nd .00036 .00036 .00036

      1st, 3rd .00037 .00037 .00037

      1st, 2nd, 3rd .00127 .00127 .00127 .00127

      2nd only .01164 .01164

      2nd, 3rd .02306 .02306 .02306

      3rd only .06223 .01855

      Any pattern .00282 .03633 .08693 .09975


    Error spending pocock 0 05
    Error Spending: Pocock 0.05

    Pattern of Proportion Significant

    Significance 1st 2nd 3rd Ever

    1st only .01520 .01520

    1st, 2nd .00321 .00321 .00321

    1st, 3rd .00113 .00113 .00113

    1st, 2nd, 3rd .00280 .00280 .00280 .00280

    2nd only .01001 .01001

    2nd, 3rd .00614 .00614 .00614

    3rd only .01250 .01250

    Any pattern .02234 .02216 .02257 .05099

    Incremental error .02234 .01615 .01250

    Cumulative error .02234 .03849 .05099


    Stopping rules1

    Stopping Rules

    Definitions


    Question
    Question

    • Under what conditions should we stop the study early?


    Scientific reasons
    Scientific Reasons

    • Safety

    • Efficacy

    • Harm

    • Approximate equivalence

    • Futility


    Statistical criteria
    Statistical Criteria

    • Extreme estimates of treatment effect

    • Statistical significance (Frequentist)

      • At final analysis: Curtailment

      • Based on experimentwise error

        • Group sequential rule

        • Error spending function

    • Statistical credibility (Bayesian)

    • Probability of achieving statistical significance / credibility at final analysis

      • Condition on current data and presumed treatment effect


    Sequential sampling issues
    Sequential Sampling Issues

    • Design stage

      • Choosing sampling plan which satisfies desired operating characteristics

        • E.g., type I error, power, sample size requirements

    • Monitoring stage

      • Flexible implementation to account for assumptions made at design stage

        • E.g., adjust sample size to account for observed variance

    • Analysis stage

      • Providing inference based on true sampling distribution of test statistics


    Prespecified stopping plans
    Prespecified Stopping Plans

    • Prior to collection of data, specify

      • Scientific and statistical hypotheses of interest

      • Statistical criteria for credible evidence

      • Rule for determining maximal statistical information

        • E.g., fix power, maximal sample size, or study time

      • Randomization scheme

      • Rule for determining schedule of analyses

        • E.g., according to sample size, statistical information, or calendar time

      • Rule for determining conditions for early stopping

        • E.g., boundary shape function for stopping rule


    Sampling plan general approach
    Sampling Plan: General Approach

    • Perform analyses when sample sizes N1. . . NJ

      • Can be randomly determined

    • At each analysis choose stopping boundaries

      • aj < bj < cj < dj

    • Compute test statistic Tj=T(X1. . . XNj)

      • Stop if Tj < aj (extremely low)

      • Stop if bj < Tj < cj (approximate equivalence)

      • Stop if Tj > dj (extremely high)

      • Otherwise continue (maybe adaptive modification of analysis schedule, sample size, etc.)

        • Boundaries for modification of sampling plan


    Boundary scales
    Boundary Scales

    • Choices for test statistic Tj

      • Sum of observations

      • Point estimate of treatment effect

      • Normalized (Z) statistic

      • Fixed sample P value

      • Error spending function

      • Bayesian posterior probability

      • Conditional probability

      • Predictive probability


    Correspondence among scales
    Correspondence Among Scales

    • Choices for test statistic Tj

      • All of those choices for test statistics can be shown to be transformations of each other

      • Hence, a stopping rule for one test statistic is easily transformed to a stopping rule for a different test statistic

      • We regard these statistics as representing different scales for expressing the boundaries


    Boundary scales notation
    Boundary Scales: Notation

    • One sample inference about means

      • Generalizable to most other commonly used models


    Partial sum scale
    Partial Sum Scale

    • Uses:

      • Cumulative number of events

        • Boundary for 1 sample test of proportion

      • Convenient when computing density


    Mle scale
    MLE Scale

    • Uses:

      • Natural (crude) estimate of treatment effect


    Normalized z statistic scale
    Normalized (Z) Statistic Scale

    • Uses:

      • Commonly computed in analysis routines


    Fixed sample p value scale
    Fixed Sample P Value Scale

    • Uses:

      • Commonly computed in analysis routine

      • Robust to use with other distributions for estimates of treatment effect


    Error spending scale
    Error Spending Scale

    • Uses:

      • Implementation of stopping rules with flexible determination of number and timing of analyses


    Bayesian posterior scale
    Bayesian Posterior Scale

    • Uses:

      • Bayesian inference (unaffected by stopping)

      • Posterior probability of hypotheses


    Conditional power scale
    Conditional Power Scale

    • Uses:

      • Conditional power

        • Probability of significant result at final analysis conditional on data so far (and hypothesis)

    • Futility of continuing under specific hypothesis


    Conditional power scale mle
    Conditional Power Scale (MLE)

    • Uses:

      • Conditional power

      • Futility of continuing under specific hypothesis


    Predictive power scale
    Predictive Power Scale

    • Uses:

      • Futility of continuing study


    Predictive power flat prior
    Predictive Power (Flat Prior)

    • Uses:

      • Futility of continuing study


    Boundary scales1
    Boundary Scales

    • Stopping rule for one test statistic is easily transformed to a rule for another statistic

      • “Group sequential stopping rules”

        • Sum of observations

        • Point estimate of treatment effect

        • Normalized (Z) statistic

        • Fixed sample P value

        • Error spending function

      • Bayesian posterior probability

      • Stochastic Curtailment

        • Conditional probability

        • Predictive probability


    Which scale my view
    Which Scale: My View

    • Statistically

    • Scientifically


    Which scale my view1
    Which Scale: My View

    • Statistically

      • It doesn’t really matter

    • Scientifically

      • You see what a difference it makes


    Spectrum of stopping rules
    Spectrum of Stopping Rules

    • Down columns: Early stopping vs no early stopping

    • Across rows: One-sided vs two-sided decisions


    Unified family spectrum of boundary shapes
    Unified Family:Spectrum of Boundary Shapes

    • All of the rules depicted have the same type I error and power to detect the design alternative


    Error spending functions
    Error Spending Functions

    • My view: Poorly understood even by the researchers who advocate them

      • There is no such thing as THE Pocock or O’Brien-Fleming error spending function

        • Depends on type I or type II error

        • Depends on number of analyses

        • Depends on spacing of analyses



    Stochastic curtailment
    Stochastic Curtailment

    • Stopping boundaries chosen based on predicting future data

      • My objections will be discussed later


    Major issue
    Major Issue

    • Frequentist operating characteristics are based on the sampling distribution

      • Stopping rules do affect the sampling distribution of the usual statistics

        • MLEs are not normally distributed

        • Z scores are not standard normal under the null

          • (1.96 is irrelevant)

        • The null distribution of fixed sample P values is not uniform

          • (They are not true P values)





    Sequential sampling the price
    Sequential Sampling: The Price

    • It is only through full knowledge of the sampling plan that we can assess the full complement of frequentist operating characteristics

      • In order to obtain inference with maximal precision and minimal bias, the sampling plan must be well quantified

      • (Note that adaptive designs using ancillary statistics pose no special problems if we condition on those ancillary statistics.)


    Familiarity and contempt
    Familiarity and Contempt

    • For any known stopping rule, however, we can compute the correct sampling distribution with specialized software

      • From the computed sampling distributions we then compute

        • Bias adjusted estimates

        • Correct (adjusted) confidence intervals

        • Correct (adjusted) P values

      • Candidate designs can then be compared with respect to their operating characteristics


    Inferential methods
    Inferential Methods

    • Just extensions of methods that also work in fixed samples

      • But in fixed samples, many methods converge on the same estimate, unlike in sequential designs


    Point estimates2
    Point Estimates

    • Bias adjusted (Whitehead, 1986)

      • Assume you observed the mean of the sampling distribution

    • Median unbiased (Whitehead, 1983)

      • Assume you observed the median of the sampling distribution

    • Truncation adapted UMVUE (Emerson & Fleming, 1990)

    • (MLE is the naïve estimator: Biased and high MSE)


    Interval estimates
    Interval Estimates

    • Quantile unbiased estimates

      • Assume you observed the 2.5th or 97.5th percentile

    • Orderings of the outcome space

      • Analysis time or Stagewise

        • Tend toward wider CI, but do not need entire sampling distribution

      • Sample mean

        • Tend toward narrower CI

      • Likelihood ratio

        • Tend toward narrower CI, but less implemented


    P values
    P values

    • Orderings of the outcome space

      • Analysis time ordering

        • Lower probability of low p-values

        • Insensitive to late occurring treatment effects

      • Sample mean

        • High probability of lower p-values

      • Likelihood ratio

        • Highest probability of low p-values



    Evaluation of designs1
    Evaluation of Designs

    • Process of choosing a trial design

      • Define candidate design

        • Usually constrain two operating characteristics

          • Type I error, power at design alternative

          • Type I error, maximal sample size

      • Evaluate other operating characteristics

        • Different criteria of interest to different investigators

      • Modify design

      • Iterate


    Which operating characteristics
    Which Operating Characteristics

    • The same regardless of the type of stopping rule

      • Frequentist power curve

        • Type I error (null) and power (design alternative)

      • Sample size requirements

        • Maximum, average, median, other quantiles

        • Stopping probabilities

      • Inference at study termination (at each boundary)

        • Frequentist or Bayesian (under spectrum of priors)

      • (Futility measures

        • Conditional power, predictive power)


    At design stage
    At Design Stage

    • In particular, at design stage we can know

      • Conditions under which trial will continue at each analysis

        • Estimates

          • (Range of estimates leading to continuation)

      • Inference

        • (Credibility of results if trial is stopped)

    • Conditional and predictive power

  • Tradeoffs between early stopping and loss in unconditional power


  • Case study
    Case Study

    • Randomized, placebo controlled Phase III study of antibody to endotoxin

      • Intervention: Single administration

      • Endpoint: Difference in 28 day mortality rates

        • Placebo arm: estimate 30% mortality

        • Treatment arm: hope for 23% mortality

      • Analysis: Large sample test of binomial proportions

        • Frequentist based inference

        • Type I error: one-sided 0.025

        • Power: 90% to detect θ < -0.07

        • Point estimate with low bias, MSE; 95% CI


    Boundaries and power curves
    Boundaries and Power Curves

    • O’Brien-Fleming, Pocock boundary shape functions when J= 4 analyses and maintain power


    Impact of interim analyses
    Impact of Interim Analyses

    • Required increased maximal sample size in order to maintain power

      • Maximal sample size with 4 analyses

        • O’Brien-Fleming: N= 1773 ( 4.3% increase)

        • Pocock : N= 2340 (37.6% increase)

      • Need to consider

        • Average sample size

        • Probability of continuing past 1700 subjects

        • Conditions under which continue past 1700 subjects


    Asn 75th tile of sample size
    ASN, 75th %tile of Sample Size

    • O’Brien-Fleming, Pocock boundary shape functions;J=4 analyses and maintain power



    At design stage example
    At Design Stage: Example

    • With O’Brien-Fleming boundaries having 90% power to detect a 7% absolute decrease in mortality

      • Maximum sample size of 1700

      • Continue past 1275 if crude difference in 28 day mortality is between -2.9% and -5.7%

      • If we just barely stop for efficacy after 425 patients we will report

        • Estimated difference in mortality: -16.3%

        • 95% confidence interval: -8.7% to -22.4%

        • One-sided lower P < 0.0001


    Efficiency unconditional power
    Efficiency / Unconditional Power

    • Futility: Tradeoffs between early stopping and loss of power

      Boundaries Loss of Power Avg Sample Size


    Stochastic curtailment1
    Stochastic Curtailment

    • Boundaries transformed to conditional or predictive power

      • Key issue: Computations are based on assumptions about the true treatment effect

        • Conditional power

          • “Design”: based on hypotheses

          • “Estimate”: based on current estimates

        • Predictive power

          • “Prior assumptions”



    So what
    So What?

    • Why not use stochastic curtailment?

      • Choice of thresholds poorly understood

        • Do not correspond to unconditional power

      • Inefficient designs result


    Key issues
    Key Issues

    • Very different probabilities based on assumptions about the true treatment effect

      • Extremely conservative O’Brien-Fleming boundaries correspond to conditional power of 50% (!) under alternative rejected by the boundary

      • Resolution of apparent paradox: if the alternative were true, there is less than .003 probability of stopping for futility at the first analysis


    Apples with apples
    Apples with Apples

    • Can compare a group sequential rule to a fixed sample test providing

      • Same maximal sample size (N= 1700)

      • Same (worst case) average sample size (N= 1336)

      • Same power under the alternative (N= 1598)

    • Consider probability of “discordant decisions”

      • Conditional probability (conditional power)

      • Unconditional probability (power)


    Ordering of the outcome space
    Ordering of the Outcome Space

    • Choosing a threshold based on conditional power can lead to nonsensical orderings based on unconditional power

      • Decisions based on 19% conditional power may be more conservative than decisions based on 8% conditional power

      • Can result in substantial inefficiency (loss of power)


    Further comments
    Further Comments

    • Neither conditional power nor predictive power have good foundational motivation

      • Frequentists should use Neyman-Pearson paradigm and consider optimal unconditional power across alternatives

        • And conditional/predictive power is not a good indicator in loss of unconditional power

      • Bayesians should use posterior distributions for decisions


    Implementation
    Implementation

    • Methods have been described for flexible implementation of stopping rules when number and timing of analyses is random

      • Error spending function

      • Constrained boundaries



    Sequential sampling strategies
    Sequential Sampling Strategies

    • Two broad categories of sequential sampling

      • Prespecified stopping guidelines

      • Adaptive procedures


    Adaptive sampling plans1
    Adaptive Sampling Plans

    • At each interim analysis, possibly modify

      • Scientific and statistical hypotheses of interest

      • Statistical criteria for credible evidence

      • Maximal statistical information

      • Randomization ratios

      • Schedule of analyses

      • Conditions for early stopping


    Adaptive sampling examples
    Adaptive Sampling: Examples

    • Prespecified on the scale of statistical information

      • E.g., Modify sample size to account for estimated information (variance or baseline rates)

        • No effect on type I error IF

          • Estimated information independent of estimate of treatment effect

            • Proportional hazards,

            • Normal data, and/or

            • Carefully phrased alternatives

          • And willing to use conditional inference

            • Carefully phrased alternatives


    Estimate alternative
    Estimate Alternative

    • If maximal sample size is maintained, the study discriminates between null hypothesis and an alternative measured in units of statistical information


    Estimate sample size
    Estimate Sample Size

    • If statistical power is maintained, the study sample size is measured in units of statistical information


    Adaptive sampling examples1
    Adaptive Sampling: Examples

    • E.g., Proschan & Hunsberger (1995)

      • Modify ultimate sample size based on conditional power

        • Computed under current best estimate (if high enough)

      • Make adjustment to inference to maintain Type I error


    Incremental statistics
    Incremental Statistics

    • Statistic at the j-th analysis a weighted average of data accrued between analyses




    Two stage design
    Two Stage Design

    • Proschan & Hunsberger consider worst case

      • At first stage, choose sample size of second stage

        • N2 = N2(Z1) to maximize type I error

      • At second stage, reject if Z2 > a2

    • Worst case type I error of two stage design

      • Can be more than two times the nominal

        • a2 = 1.96 gives type I error of 0.0616

        • (Compare to Bonferroni results)


    Better approaches
    Better Approaches

    • Proschan and Hunsberger describe adaptations using restricted procedures to maintain experimentwise type I error

      • Must prespecify a conditional error function which would maintain type I error

        • Then find appropriate a2 for second stage based on N2 which can be chosen arbitrarily

      • But still have loss of power


    Motivation for adaptive designs
    Motivation for Adaptive Designs

    • Scientific and statistical hypotheses of interest

      • Modify target population, intervention, measurement of outcome, alternative hypotheses of interest

      • Possible justification

        • Changing conditions in medical environment

          • Approval/withdrawal of competing/ancillary treatments

          • Diagnostic procedures

        • New knowledge from other trials about similar treatments

        • Evidence from ongoing trial

          • Toxicity profile (therapeutic index)

          • Subgroup effects


    Motivation for adaptive designs1
    Motivation for Adaptive Designs

    • Modification of other design parameters may have great impact on the hypotheses considered

      • Statistical criteria for credible evidence

      • Maximal statistical information

      • Randomization ratios

      • Schedule of analyses

      • Conditions for early stopping


    Cost of planning not to plan
    Cost of Planning Not to Plan

    • Major issues with use of adaptive designs

      • What do we truly gain?

        • Can proper evaluation of trial designs obviate need?

      • What can we lose?

        • Efficiency? (and how should it be measured?)

        • Scientific inference?

          • Science vs Statistics vs Game theory

          • Definition of scientific/statistical hypotheses

          • Quantifying precision of inference


    Prespecified modification rules
    Prespecified Modification Rules

    • Adaptive sampling plans exact a price in statistical efficiency

      • Tsiatis & Mehta (2002)

        • A classic prespecified group sequential stopping rule can be found that is more efficient than a given adaptive design

      • Shi & Emerson (2003)

        • Fisher’s test statistic in the self-designing trial provides markedly less precise inference than that based on the MLE

          • To compute the sampling distribution of the latter, the sampling plan must be known


    Conditional predictive power1
    Conditional/Predictive Power

    • Additional issues with maintaining conditional or predictive power

      • Modification of sample size may allow precise knowledge of interim treatment effect

        • Interim estimates may cause change in study population

          • Time trends due to investigators gaining or losing enthusiasm

        • In extreme cases, potential for unblinding of individual patients

          • Effect of outliers on test statistics


    Final comments
    Final Comments

    • Adaptive designs versus prespecified stopping rules

      • Adaptive designs come at a price of efficiency and (sometimes) scientific interpretation

      • With adequate tools for careful evaluation of designs, there is little need for adaptive designs


    Bottom line2
    Bottom Line

    You better think (think)

    about what you’re

    trying to do…

    -Aretha Franklin


    References
    References

    • Available on www.emersonstatistics.com

      • Frequentist evaluation

      • Bayesian evaluation (Stat Med)

      • On the use of stochastic curtailment

      • Issues in the use of adaptive designs (Stat Med)

      • Group sequential P values under non-proportional hazards (Biometrics)

      • Implementation of group sequential rules using constrained boundaries (Biometrics)


    ad