- 542 Views
- Uploaded on
- Presentation posted in: General

Stratified Analysis of A Binary Endpoint and “Beyond”

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Stratified Analysis of ABinary Endpoint and “Beyond”

Christy Chuang-Stein

Statistical Research and Consulting Center

Pfizer Inc

ASA Biopharm Section Webinar

May 7 2009

- October 21, 2008
- Devan Mehrotra - Stratified Analyses: Tips for Improving Power (http://www.biopharmnet.com/doc/2008_10_21_webinar.pdf )

- April 3, 2009
- Frank Harrell – Case Study in Parametric Survival Modeling
- First 16 slides or so on “Covariable Adjustment in Randomized Clinical Trials” (http://www.biopharmnet.com/doc/2009_04_03_webinar.pdf )

- Stratified Analysis of a Binary Endpoint
- Inverse vs CMH Weighting
- Simpson’s Paradox and Collapsibility

- Beyond
- Stratified Randomization vs Stratified Analysis
- Stratification and Subgroup Analysis
- Sample Sizing for a Multi-regional Trial
- Regulatory Guidances on Global Trials, Data Extrapolation

- Conclusion

- A confirmatory trial in severe sepsis, a double-blind placebo control trial; IV with 96 hours duration; randomization stratified by center.
- Primary analysis was 28-day mortality rate after treatment onset, stratified by 3 pre-specified covariates: APACHE II score, age and protein C activity.
- Trial was terminated by an independent DSMB for efficacy after 2nd interim analysis of 1520 patients.
- Many subgroup analyses were conducted, including APACHE II subgroups (4 defined by the observed quartiles), subgroups defined by the components of the APACHE II score, and subgroups defined by 1, or 2, or 3, or at least 4 organ dysfunctions.

- Three measures are commonly used to assess efficacy within the j th APACHE II stratum
- Risk difference dj : p1j – p2j
- Relative risk rj : p1j / p2j
- Odds ratio oj : { p1j (1 - p2j ) } / { (1 - p1j ) p2j }

- Denote the observed rate by pij, pij = nij1 /nij+.
- We will focus on risk difference. In each stratum, estimate p1j – p2j by p1j – p2j. We will get an overall treatment effect estimate and construct a test statistic.

- A common approach is to form a weighted average and construct a test statistic for the overall effect as

X2 has an asymptotic chi-square distribution with 1 degree of freedom if Sj wj dj = 0.

- Inverse variance – {wi} is equal to the inverse of the sample variance of . In this case, X2 will be

When dj = d (the risk difference is uniform across the strata), the inverse variance weighting produces the minimum variance estimate for the common risk difference d, which is unbiased for large samples. This method is favored by meta analysts.

- CMH method – {wi} is equal to the inverse of the harmonic mean of n1j+ and n2j+. This method produces the X2 test by Cochran, which is asymptotically equivalent to a test developed by Mantel and Haenszel. Continuity correction could be applied.

- Let fi represent the relative frequency of patients in the jth stratum in the population. When the study population mimics the target population, CMH estimate is approximately unbiased for Sj fj dj.
- The above makes CMH weighting attractive when one is not sure if the treatment effect is the same across the strata.

When the mortality rate is low, there is not much room to improve. Most of the benefit is in the high-risk population.

- Weighting by the relative frequency of a stratum within the population leads to an overall treatment effect Sj fj dj of 0.25*(0)+0.25*(3%)+0.25*(9%)+0.25*(12%)= 6% .
- Assume equal allocation within each stratum. The overall treatment effect estimate under the CMH weighting will approach 6% for large samples.
- If we use the inverse variance weighting, we will weigh treatment effects in the 1Q, 2Q, 3Q and 4Q by 2.23 : 1.38 : 1.20 : 1.00. The effect estimate will approach 4.5% for large samples.
- The inverse variance weighting will underestimate the parameter Sj fj dj of interest in this case.

- The CMH test statistic has a value 7.310 with 1 degree of freedom (no continuity correction). The two-sided P-value is 0.0068. The CMH test statistic computes the variance assuming p1j = p2jfor all j.
- A 95% CI for the overall difference in the mortality rate (new treatment – placebo) under the CMH weighting is (-9.8%,-1.6%). The calculation of variance in this case does not assume p1j = p2j .
- The inverse variance approach produces a 95% for the difference in the mortality rate (new treatment – placebo) of (-8.1%, -0.1%).

- The difference in the mortality rates (new treatment – placebo) in the 4 APACHEII strata range between 3% to –12%.
- The graph suggests a possible interaction that might be qualitative in nature.
- We will look at an approach proposed by Gail and Simon (1985, Biometrics, 41:361-372) to test for qualitative interaction.

Dmitrienko et al (2005). Analysis of Clinical Trials Using SAS.

- Let O+ = {di³ 0} = set of non-negative differences
- Let O- = {di£ 0} = set of non-positive differences
- Q > c can be used to test the null hypothesis of no qualitative interaction.
- Q follows a fairly complex distribution based on a weighted sum of chi-square distribution. SAS codes are available in the book by Dmitrienko et al.

- Q+ can be used to test the null hypothesis of all differences being negative. Q- can be used to test the null hypothesis of all differences being positive.
- For the sepsis study, the two-sided Gail-Simon test has a P-value of 0.4822.
- The one-sided P-value for H0 of positive differences (new treatment – placebo) is 0.0030. The one-sided P-value for H0 of negative differences is 0.6005.
- Like other interaction tests, G-S test requires strong evidence before we can reject the no qualitative interaction hypothesis.

- Data from this single study led to the approval of Xigris®
- Xigris® INDICATIONS AND USAGE
Xigrisis indicated for the reduction of mortality in adult patients with severe sepsis (sepsis associated with acute organ dysfunction) who have a high risk of death (e.g., as determined by APACHE II).

Safety and efficacy have not been established in adult patients with severe sepsis and lower risk of death.

APACHE II

Quartile

score

Xigris

Placebo

Total

Mortality rate

Total

Mortality rate

1st + 2nd (3-24)

436

18.8%

437

19.0%

3rd + 4th (25-53)

414

30.9%

403

43.7%

- Patients who have a high risk for death are represented by an APACHE II score in the 3rd and 4th APACHE II score categories.
- Treatment effects need to differ more than what shown in this case for Gail-Simon test to conclude interaction.

- Could one have anticipated this extent of treatment difference before the trial?
- If yes, what would have been a good design and analysis strategy?
- Options
- Specify the high risk population as the primary analysis population and enroll adequate patients in this group.
- Test both the high risk population and the entire population with adjustment for multiplicity.
- Analysis follows the design strategy.

- Losartan Intervention For Endpoint Reduction in Hypertension Study.
- Conducted at 945 sites in 7 countries.
- Enrolled 9193 hypertensive patients with left ventricular hypertrophy (LVH)
- The primary endpoint is a composite endpoint of cardiovascular deaths, stroke, and myocardial infarction.

- Results reviewed by the FDA Cardiovascular and Renal Drugs AC on Jan 6 2003 for a new proposed indication
Cozaar is indicated to reduce the risk of cardiovascular morbidity and mortality as measured by the combined incidence of cardiovascular death, stroke, and myocardial infarction in hypertensive patients with left ventricular hypertrophy.

- Losartan’s then label states that the effect in blood pressure reduction in blacks was somewhat less than in that in whites (a common statement for beta-blockers).
- FDA statistician quoted data from three endpoint studies of other drugs. These studies demonstrated less or no treatment effect in blacks when compared to whites.
- On the primary endpoint, when compared to atenolol, losartan had a hazards ratio of 0.869 (95% CI from 0.772 to 0.979) with a P-value of 0.021. The effect came primarily from the stroke component of the composite.
- The issue of how losartan compared to atenolol in blacks came up.

Hazard Ratio and 95% CIs - Primary Endpoint

- Nominal p-value for Black vs. Non-Black Qualitative Interaction = 0.016.
- Impossible to correctly adjust this p-value for multiple comparisons post hoc.
- 3 subgroups pre-specified for special importance (U.S. region, Diabetics, ISH)
- To do it correctly, the formal analysis plan would need to list all important subgroups and specify a method to correctly adjust for the number of tests.
Source: John Lawrence’s (FDA Statistical Reviewer) slides at the January 6 2003 FDA AC meeting. For more discussion, see http://www.fda.gov/ohrms/dockets/ac/03/slides/3920s1.htm

Indications and Usage

… COZAAR is indicated to reduce the risk of stroke in patients with hypertension and left ventricular hypertrophy, but there is evidence that this benefit does not apply to Black patients. …

Clinical Pharmacology

In the LIFE study, Black patients treated with atenolol were at lower risk of experiencing the primary composite endpoint compared with Black patients treated with COZAAR…. This finding could not be explained on the basis of differences in the populations other than race or on any imbalances between treatment groups… the LIFE study provides no evidence that the benefits of COZAAR on reducing the risk of cardiovascular events in hypertensive patients with left ventricular hypertrophy apply to Black patients.

- In the case of Xigris, subgroups defined by APACHE II score were pre-specified. Statistical significance was not achieved by the Gail-Simon test at the 5% level.
- In the case of COZAAR, race subgroups were not pre-specified. They are, however, among the “usual” demographic subgroups and there is a priori reason for looking at this subgroup. A post hoc Gail-Simon test produced a value less than 0.05.
- The end results (language in the product package insert) are similar – the label describes differential treatment effects in the subgroups.

13% vs 9.5%: a two-sided P-value of 0.023.

95% CI for the diff (A – B) using inverse variance weighting is (-0.017, 0.018) with a point estimate of 0.001. What happens?

The study with the highest AE rates had twice as many subjects on Drug A as on Drug B.

- Within each study, the two groups have the same event rates.
- Study 1 randomized patients 1:1:1:1 to 3 doses and 1 control.
- Study 2 randomized patients 1:1 to one dose and control.

Treatment

Event

No Event

Combined

New

240

(48%)

260

(52%)

500

Control

120

(40%)

180

(60%)

300

- Pooling produces an event rate of 48% for the new treatment and 40% for the control.
- The chi-square statistic has a two-sided P- value = 0.028.
- Conducting un-stratified (un-adjusted) analysis in this case will lead to an erroneous conclusion.

- In this example, the risk difference is not collapsible over the studies (i.e., we can’t ignore “study”).
- Randomization (treatment assignment) is not independent of study in the two-way marginal table of treatment by study.

- When both randomization ratio and risk difference are the same across studies, risk difference is collapsible over studies.
- In this case, the proportion of event for each treatment is a weighted average of the proportions in individual studies with weights proportional to the study sizes.

- If the two treatments have the same effect in all studies (null hypothesis) and in addition, the randomization ratio is the same, then risk difference, risk ratio, and odds ratio are all collapsible across studies.
- In the above case, the risk difference is 0 and the relative risk and odds ratio are 1.
- Otherwise, collapsibility depends on the chosen measure for association (risk difference, risk ratio, odds ratio) - Greenlander, 1998, Encyclopedia of Biostatistics.

1:1 randomization, equal risk difference in two studies

- Meta analysis procedure is frequently used to combine efficacy results.
- Should use meta analysis (stratified analysis) when summarizing safety data from different studies, especially when studies have different patient populations and/or different randomization ratios.
- If there is no a priori information suggesting different risk differences for different studies, inverse variance weighting would be a good choice.
- Should always consider stratified analysis when covariates are highly correlated with the response.

- Factor defining strata is prognostic of response.
- Allowing comparison within more homogeneous groups.

- Factor defining strata is predictive of treatment effect.
- Issue of interaction
- Evaluating treatment effect with subgroups
- Overall treatment effect might be less meaningful if the interaction between treatment and factor is substantial

- If we employ stratified randomization, the convention is to include the stratifying factor in the analysis (CPMP/EWP/2863/99 on adjustment for baseline covariates).
- When there are >=50 patients in each treatment group, Grizzle found that there was little advantage to using stratified randomization with two strata when the strata are roughly equally represented (Grizzle, Controlled Clinical Trials, 1982).
- The incremental benefit of stratified randomization beyond that due to the stratified analysis is minimum (Permutt, DIJ 2007).

- The above is due to the fact that, for a reasonable sample size, the chance that the randomization will produce the type of imbalance that will substantially affect the inference is low.
- If a stratum is small, stratified randomization could reduce the chance of imbalance.
- If we are forced to treat un-stratified analysis as the primary analysis, stratified randomization could generally give us results close to those from an adjusted analysis.
- Stratified allocation is used to ensure adequate (or even greater) representation of a particular type of patients in the study.

- 50 subjects will be randomized to one of two treatments.
- There are 50 men and 50 women. Gender is a prognostic factor and could be used as a stratifying factor for randomization and/or analysis, resulting in 4 options: stratified randomization and analysis (R&A), stratified randomization only (R Only), stratified analysis only (A Only), Neither.
- Assume standard deviation is 10, and a treatment effect that will result in 80% power with 25 per group per gender under the R&A option (i.e., D = 5.6).
- Assuming no treatment by gender interaction, but gender effect varies between 0 and 20.

- Under “A Only” (stratified analysis without stratified randomization), the power was calculated for each possible (treatment,gender) allocation combination. The power was then averaged using probability under the hypergeometric distribution as the weight.
- Under option “R Only” (stratified randomization without stratified analysis), Type I error could be lower than the nominal level (two-sided 5%) because the reduction in the variance of the estimated treatment effect due to stratified randomization is not properly accounted for in the analysis. (See the original paper.)

- How does the treatment perform in patients with mild disease?
- Do patients with mild/moderate disease respond to the treatment similarly as patients with severe disease?
- This is typically phrased as an interaction between treatment and disease severity at baseline

- If heterogeneous effect (interaction) exists, is it qualitative or quantitative?

- Multiplicity leading to inflated false positive rate
- Lack of statistical power leading to inflated false negative rate
- Treatment group incomparable because randomization was not done within the subgroups
- Appropriate reporting/interpretation to ensure scientifically defensible and balanced conclusion
We will focus on the first two issues here.

False Positive

- Multiplicity
- With multiple subgroup analyses, probability of a false positive finding substantial.
- With 10 independent tests (α=0.05), chance of at least one false positive > 40%.

Lagakos (2006) NEJM 354;16

Typical Result

- Hypothetical study
- 4000 patients in 20 countries (200 patients each) with a control arm risk of 20% and an experimental arm risk of 15%
- Homogenous absolute risk reduction of 5% in all countries.
Marschner (DIA Annual Meeting)

- In 10,000 simulations of similar studies, the largest and smallest treatment effect among the 20 countries was calculated
- On average the largest treatment effect among the 20 countries was a 15%absolute risk reduction onthe experimental therapy
- On average the smallest treatment effect among the 20 countries was a 5%absolute risk increase on the experimental therapy

- Purely by chance, the observed experimental treatment effect in different countries can be expected to range from extremely beneficial to apparently harmful.
Marschner (DIA Annual Meeting)

Assuming two groups and a continuous endpoint:

- Factors increasing the probability
- Substantial imbalance between treatment groups
- Substantial differences in the subgroup size
- A large number of subgroups

- Factors decreasing the probability
- Balanced treatments and subgroup size
- A large treatment effect size
- A large sample size

- 2-sided a = 0.05
- 1:1 ratio with perfect balance between treatments
- Various scenarios for subgroup size

Li, Chuang-Stein, Hoseyni, DIJ (2007), 41:47-56.

- Each baseline covariate defines 3 subgroups with equal proportions (2 or 5 covariates).
- Probabilities based on simulations (1000 replicates).
- Unconditional on the overall result.

- Each baseline covariate defines 3 subgroups with equal proportions (2 or 5 covariates).
- Probabilities based on simulations (1000 replicates).
- Conditional on a statistically significant overall result.

- The only pivotal trial to assess the efficacy and safety of metoprolol (Toprol XL) as an adjunctive therapy to optimal standard therapy for patients with congestive heart failure.
- There were 3991 patients from several hundred sites in US and 13 European countries.
- The study has two primary endpoints, total mortality and a composite endpoint.
- 27% of the patients (539 on placebo and 532 on metoprolol) were from the US.

Favors

Toprol-XL

Favors

Placebo

Desire to control (minimize) the probability of observing a negative treatment effect in at least one region when the treatment effect is positive and uniform across all regions in a multi-regional (global) trial.

Bob O’Neill: PhRMA/FDA Workshop on Multi-Regional Trials 2007.

Robert Califf: PhRMA/FDA Workshop 2007

- The Biostatistics and Data Management Group convened a Multi-Regional Clinical Trials (MRCT) Key Issues team after the workshop.
- Bruce Binkowitz (Merck), stat co-chair of the MRCT working group, will present the group’s progress at the Harvard/Schering Plough workshop on May 28-29. The theme of the workshop is “Global Trials: Challenges and Opportunities”.

- PhRMA also has a SGD Committee.
- Its focus is Regulatory, seeking to enable a regulatory framework for allowing global development of therapies that could
- result in simultaneous global submissions with one single global data-set
- expedite global patient access to these products

- Current focus is China, Korea, Taiwan and Japan.

- The 1st China-Japan-Korea Ministerial Meeting on Health was held in Koreain April 2007. The 2nd one took place in Nov 2008 in Beijing.
- They declared in the “Joint Statement of the First Tripartite Health Ministers Meeting (THMM)”to jointly promote cooperation in areas of Clinical Researches, ...
- Cooperation in an investigation on ethnic factors
- MHLW set up study group to investigate differences in PK/PD and safety among Asian populations
- The 1st report on PK difference is targeted 2Q2009
- The Goal : Could Asia be regarded as “one population”?

- A continuous endpoint that follows a normal distribution. Large values are desirable.
- Treatment effect within each region is estimated by the difference in the observed means (or observed mean changes from baseline).
- Effect size (D/s) is uniform across regions.
- The one-sided significance level for the primary analysis on the overall treatment effect is 2.5%. Power to detect (D/s) is 1-b.
- For simplicity, we will work with 3 regions with 1:1 allocation to 2 treatments.

The number N is determined to provide an 80% or 90% power for the primary analysis at the one-sided 2.5% level.

Sample Size/Group: N

Region 3

[Largest]

p3

D3

Region 2

[2nd smallest]

D2

p2

Region 1

[Smallest]

D1

p1

Due to the constraints of

p1 ≤ p2 ≤ p3 and p1+p2+p3=1

Estimated treatment effect

(New treatment - Placebo)

D3

We want a high probability (e.g. 80% or 90%) that the point estimates for the

treatment effect in all regions are positive.

Region 3

D2

Region 2

D1

Region 1

０

Estimated treatment effect

(New treatment - Placebo)

PCS = Probability that three regions show consistent results.

0.9

Power:90%

0.8

Pcs never

reaches90%

Power:80%

0.151

0.213

0.277

Worst case with two small regions and a large one.

- In practice, inference concerning regional results (as a secondary analysis) is relevant only if the overall treatment effect in the confirmatory trial is statistically significant.
- The above calls for looking at Pcs conditional on first concluding a significant overall treatment effect at the one-sided 2.5% level.

Treatment effect = 0.250, s =1

0.9

Power:90%

0.8

Conditional Pcs

Power = 90%

○: D/s = 0.125

+: D/s = 0.250

Power = 80%

○: D/s = 0.125

+: D/s = 0.250

Power:80%

Unconditional

- Basic Principles on Global Clinical Trials ( http://www.pmda.go.jp/english/publications/index.html )
- Method 1
- Look at DJapan/Dall. Want
Pr (DJapan/Dall > 0.5 | Common D) > 80%

- Look at DJapan/Dall. Want
- Method 2
- The “consistency” approach.

- Released for public comments in January 2009.
- Questions the relevance of some clinical data from emerging regions to support marketing applications in EU due to
- Intrinsic factors including genetic and nature of disease
- Extrinsic factors including medical practice, disease definition and study population

- Includes 5 product areas where extrapolation of study results to European population had been found to be difficult.
- Encourages an in-depth prospective evaluation of factors if the trial is to provide evidence to support EU filing. It is possible that additional clinical trials within EU might be necessary if extrapolation is judged to be problematic.

- When there is no reason to suspect the risk difference to differ across strata, IV weighting produces the minimum variance and asymptotically unbiased estimate. However, when the proportions are in the range of (0.25, 0.75), CMH estimates are generally quite close to the IV estimates.
- When risk difference is suspected to differ across strata, CMH tends to produce more sensible estimates.
- It is critically important to know the studies and where the data came from. Naïve pooling could produce very misleading results and should be avoided.
- Stratification often leads to subgroup analysis. We need to consider the role subgroup analysis will play in reporting and interpreting trial results.

- Califf RM. (2007). Multiregional clinical trials. Presented at the PhRMA-FDA workshop, Oct 29-30, Washington DC.
- Dmitrienko A, Molenberghs G, Chuang-Stein C, and Offen W. (2005) Analysis of Clinical Trials Using SAS: A Practical Guide. Cary, NC: SAS Institute Inc.
- EMEA Points to consider on adjustment for baseline covariates. CPMP/EWP/2863/99 (Nov 2003, coming into operation).
- EMEA Reflection paper on the extrapolation of results from clinical studies conducted outside Europe to the EU-population. CHMP/EWP/692702/2008. Released for public comments, January 2009.
- Greenlander S. (1998). Collapsibility. Encyclopedia of Biostatistics, Wiley. 786-788.
- Grizzle JE. (1982). A note on stratifying versus complete random assignment in clinical trials. Controlled Clinical Trials, 3:365-368.
- Kawai N, Chuang-Stein C, Komiyama O, Ii Y. (2008). An approach to rationalize partitioning sample size into individual regions in a multi-regional trial. Drug Information Journal, 42(2):139-147.

- Li Z, Chuang-Stein C, Hoseyni C. (2007). The probability of observing negative subgroup results when the treatment effect is positive and homogeneous across all subgroups. Drug Information Journal, 41(1):47-56.
- Ministry of Health, Labour and Welfare. (2007). Basic Principles on Global Clinical Trials. Available at: http://www.pmda.go.jp/operations/notice/2007/file/0928010-e.pdf
- O’Neill R. (2007). Multi-regional Clinical Trials: Why be concerned? A Regulatory perspective on Issues. Presented at the PhRMA-FDA workshop, Oct 29-30, Washington DC.
- Permutt T. (2007). A note on stratification in clinical trials. Drug Information Journal, 41:719-722.