Moving from Correlative Studies to Predictive Medicine

Moving from Correlative Studies to Predictive Medicine Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute brb.nci.nih.gov

Disclosure InformationRichard Simon, D.Sc. I have no financial relationships to disclose. I will not discuss off label use and/or investigational use in my presentation.

BRB Websitebrb.nci.nih.gov • Powerpoint presentations • Reprints & Technical Reports • BRB-ArrayTools software • BRB-ArrayTools Data Archive • 100+ published cancer gene expression datasets with clinical annotations • Sample Size Planning for Targeted Clinical Trials

“Biomarkers” • Prognostic • Pre-treatment measurement to predict long-term outcome • Untreated or treated patients • Outcome not a direct measure of treatment benefit • Predictive • Pre-treatment measurement to predict response or benefit to a particular treatment • Surrogate endpoints • Pre, during and after treatment measurement to determine whether the treatment is working

Literature of Un-used Prognostic Factors • Most prognostic factors are not used because they are not therapeutically relevant • Most prognostic factor studies are poorly designed and not focused on a clear objective; they use a convenience sample of patients for whom tissue is available. Generally the patients are too heterogeneous to support therapeutically relevant conclusions

Prognostic Biomarkers Can be Therapeutically Relevant • 3-5% of node negative ER+ breast cancer patients require or benefit from systemic rx other than endocrine rx • Prognostic biomarker development that focuses on specific therapeutic decision contexts can provide valuable diagnostics for patient management • OncotypeDx

Key Features of OncotypeDx Development • Identification of important therapeutic decision context • Prognostic marker development using data for patients with node negative ER positive breast cancer receiving tamoxifen as only systemic treatment • Staged development and validation • Separation of data used for test development from data used for test validation • Development of robust assay with rigorous analytical validation • 21 gene RTPCR assay for FFPE tissue • Quality assurance by single reference laboratory operation

Predictive Biomarkers • In the past often studied as un-focused post-hoc subset analyses of RCTs. • Numerous subsets examined • Same data used to define subsets for analysis and for comparing treatments within subsets • No control of type I error • Led to conventional wisdom • Only hypothesis generation • Only valid if overall treatment difference is significant

Basic Cancer Research Demonstrates that Most Types of Cancer are Heterogeneous • Molecularly targeted treatments are likely to benefit only the patients whose tumors are driven by de-regulated pathways that are targets of the treatment • Treatment effects for cytotoxic treatments have been limited in broad eligibility clinical trials because only a subset of the patients benefited

Conducting phase III trials in the traditional way with broad eligibility and primary analysis the overall comparison may result in • false negative trial • Unless a sufficiently large proportion of the patients have tumors driven by the targeted pathway • positive trial leading to treatment of many patients who do not benefit

New Phase III Clinical Trials • Focused of patients considered most likely to benefit from new treatment based on predictive biomarker; or • Without pre-selection of patients but with statistical analysis plans that include • planned subset analysis based on a single predictive biomarker as primary analysis • Type I error for any positive claims from the RCT limited to .05 • Results are not hypothesis generation and not dependent on overall treatment effect being significant

Predictive Biomarker Classifiers • Single gene or protein based on knowledge of therapeutic target • Single gene or protein culled from set of candidate genes identified based on imperfect knowledge of therapeutic target • Empirically determined multi-gene classifier derived from correlating gene expression profiles to patient outcome after treatment • A classifier is more than a set of genes

Developmental Strategy (I) • Develop a predictive biomarker classifier that identifies the patients likely to benefit from the new drug • Develop a reproducible assay for the classifier (analytical validation) • Conduct phase II studies of unselected patients to demonstrate that classifier result is correlated with response to treatment (clinical validation) • Use the classifier to restrict eligibility to an RCT comparing regimen containing the new drug to a control using a phase III endpoint (medical utility of drug)

Develop Predictor of Response to New Drug Using phase II data, develop predictor of response to new drug Patient Predicted Responsive Patient Predicted Non-Responsive Off Study New Drug Control

Applicability of Design I • Primarily for settings where the classifier is based on a single gene whose protein product is the target of the drug • eg trastuzumab • With a strong biological basis for the classifier, it may be unacceptable to expose classifier negative patients to the new drug • Analytical validation, biological rationale and phase II data provide basis for regulatory approval of the test • Phase III study focused on test + patients to provide data for approving the drug

Evaluating the Efficiency of Strategy (I) • Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004; Correction and supplement 12:3229, 2006 • Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005. • reprints and interactive sample size calculations at http://linus.nci.nih.gov

Compared two Clinical Trial Designs • Standard design • Randomized comparison of T to C without screening or selection using classifier • Targeted design • Obtain tissue and evaluate classifier on candidate patients • Randomize only classifier + patients

Relative efficiency of targeted design depends on • proportion of patients test positive • effectiveness of new drug (compared to control) for test negative patients • When less than half of patients are test positive and the drug has little or no benefit for test negative patients, the targeted design requires dramatically fewer randomized patients • The targeted design may require fewer or more screened patients than the standard design

Treatment Benefit for Test – Pts Half that of Test + Pts nstd / ntargeted

For Trastuzumab, even a relatively poor assay enabled conduct of a targeted phase III trial which was crucial for establishing effectiveness • Recent results with Trastuzumab in early stage breast cancer show dramatic benefits for patients selected to express Her-2

Trastuzumab • Metastatic breast cancer • 234 randomized patients per arm • 90% power for 13.5% improvement in 1-year survival • If benefit were limited to the 25% assay + patients, overall improvement in survival would have been 3.375% • 4025 patients/arm would have been required • If assay – patients benefited half as much, 627 patients per arm would have been required

Randomizing Test Negative Patients • We don’t think that this drug will help you because your tumor is test negative. But we need to show the FDA that a drug we don’t think will help test negative patients actually doesn’t • We don’t think that this drug will help you, but we often find that we don’t know much about the drugs we develop so we want to try the drug on you

Develop Predictor of Response to New Rx Predicted Responsive To New Rx Predicted Non-responsive to New Rx New RX Control New RX Control Developmental Strategy (II)

Developmental Strategy (II) • Do not use the diagnostic to restrict eligibility, but to structure a prospective analysis plan • Having a prospective analysis plan is essential • “Stratifying” (balancing) the randomization is useful to ensure that all randomized patients have tissue available but is not a substitute for a prospective analysis plan • The purpose of the study is to evaluate the new treatment overall and for the pre-defined subsets; not to modify or refine the classifier • The purpose is not to demonstrate that repeating the classifier development process on independent data results in the same classifier

Analysis Plan A (confidence in classifier) • Compare the new drug to the control for classifier positive patients • If p+>0.05 make no claim of effectiveness • If p+ 0.05 claim effectiveness for the classifier positive patients and • Compare new drug to control for classifier negative patients using 0.05 threshold of significance

Sample size for Analysis Plan A • 88 events in classifier + patients needed to detect 50% reduction in hazard at 5% two-sided significance level with 90% power • If 25% of patients are positive, then when there are 88 events in positive patients there will be about 264 events in negative patients • 264 events provides 90% power for detecting 33% reduction in hazard at 5% two-sided significance level

Study-wise false positivity rate is limited to 5% • It is not necessary or appropriate to require that the treatment vs control difference be significant overall before doing the analysis within subsets

Analysis Plan B(Limited confidence in test) • Compare the new drug to the control overall for all patients ignoring the classifier. • If poverall 0.03 claim effectiveness for the eligible population as a whole • Otherwise perform a single subset analysis evaluating the new drug in the classifier + patients • If psubset 0.02 claim effectiveness for the classifier + patients.

This analysis strategy is designed to not penalize sponsors for having developed a classifier • It provides sponsors with an incentive to develop genomic classifiers

Analysis Plan C(adaptive) • Test for difference (interaction) between treatment effect in test positive patients and treatment effect in test negative patients • If interaction is significant at level int then compare treatments separately for test positive patients and test negative patients • Otherwise, compare treatments overall

Sample Size Planning for Analysis Plan C • 88 events in test + patients needed to detect 50% reduction in hazard at 5% two-sided significance level with 90% power • If 25% of patients are positive, when there are 88 events in positive patients there will be about 264 events in negative patients • 264 events provides 90% power for detecting 33% reduction in hazard at 5% two-sided significance level

Simulation Results for Analysis Plan C • Using int=0.10, the interaction test has power 93.7% when there is a 50% reduction in hazard in test positive patients and no treatment effect in test negative patients • A significant interaction and significant treatment effect in test positive patients is obtained in 88% of cases under the above conditions • If the treatment reduces hazard by 33% uniformly, the interaction test is negative and the overall test is significant in 87% of cases

The Roadmap • Develop a completely specified genomic classifier of the patients likely to benefit from a new drug • Establish analytical validity (reproducibility and robustness) of measurement of the classifier • Use phase II data to establish clinical validity of the predictive test • Use the completely specified classifier to design and analyze a new clinical trial to evaluate medical utility of the new treatment in patient populations pre-specified based on the classifier

Guiding Principle • The data used to develop the classifier must be distinct from the data used to test predictive accuracy of the classifier and to test hypotheses about treatment effect in subsets determined by the classifier • Developmental studies are exploratory • And not closely regulated by FDA • Studies on which treatment effectiveness claims are to be based should be definitive studies that test a treatment hypothesis in a patient population completely pre-specified by the classifier

Biomarker Adaptive Threshold Design Wenyu Jiang, Boris Freidlin & Richard Simon JNCI 99:1036-43, 2007

Biomarker Adaptive Threshold Design • Randomized phase III trial comparing new treatment E to control C • Survival or DFS endpoint • Have identified a predictive index B thought to be predictive of patients likely to benefit from E relative to C • Eligibility not restricted by biomarker • No threshold for biomarker determined

Analysis Plan • S(b)=log likelihood ratio statistic measuring effectiveness of treatment versus control in subset of patients with Bb • Compute S(b) for all possible threshold values • Determine T=max{S(b)} • Compute null distribution of T by permuting treatment labels • Permute the labels of which patients are in which treatment group • Re-analyze to determine T for permuted data • Repeat for 10,000 permutations

If the data value of T is significant at 0.05 level using the permutation null distribution of T, then reject null hypothesis that E is ineffective • Compute bootstrap confidence interval for the threshold b

Use of Archived Samples • For developing prognostic or predictive biomarkers • For validating a pre-defined prognostic or predictive biomarker

Use of Archived Samples for Marker Development • From a non-targeted “negative” clinical trial to develop a binary classifier of a subset thought to benefit from treatment • From a control arm of a non-targeted clinical trial to develop a prognostic classifier of patients who do not require additional treatment

Use of Archived Samples for Validation • Clinical validation using specimens from patients on single arm phase II trial • Correlate predictive biomarker to response • Clinical utility using specimens from RCT comparing new treatment to control regimen • “Prospective analysis plan” • Sufficient sample size and percent of patients with adequate archived tissue • Separate analytical and pre-analytical validation of robustness of test to real-time tissue handling and laboratory variation

Developmental Studies vs Validation Studies • Validation studies use prognostic or predictive biomarkers or composite classifiers that have been completely defined in previous developmental studies • Validation studies should not become developmental studies by refining the biomarkers to be validated • Validation does not mean repeating the developmental process on independent data

Types of Validation for Prognostic and Predictive Biomarkers • Analytical validation • Pre-analytical and post-analytical robustness • Clinical validation • Does the biomarker predict what it’s supposed to predict for independent data • Clinical utility • Does use of the biomarker result in patient benefit

Clinical Utility • Benefits patient by improving treatment decisions • Depends on context of use of the biomarker • Treatment options and practice guidelines • Other prognostic factors

Clinical Utility • Prognostic biomarker for identifying patients • for whom practice standards imply cytotoxic chemotherapy • who have good prognosis without chemotherapy • Prospective trial to identify such patients and withhold chemotherapy • TAILORx • “Prospective plan” for analysis of archived specimens from previous clinical trial in which patients did not receive chemotherapy

Flaws in Randomizing Which Patients Get New Test • Select patients with node negative ER+ breast cancer • Randomize the patients to standard of care (SOC) vs classifier determined rx • Compare outcomes of the randomized groups overall • Very inefficient because most patients get same treatment in both arms • Since classifier is not measured in SOC arm, the trial must be sized to detect miniscule overall difference in outcome

Measure classifier for all patients and randomize only those for whom classifier determined therapy differs form standard of care • MINDACT • Primary analysis in MINDACT is single arm evaluation of distant-DFS in randomized patients who receive endocrine therapy alone

Moving from Correlative Studies to Predictive Medicine