Estimating Causal Effects from Large Data Sets Using Propensity Scores

Estimating Causal Effects from Large Data Sets Using Propensity Scores Hal V. Barron, MD TICR May 2007

Estimating Causal Effects from Large Data Sets Using Propensity Scores • The aim of many analyses of large databases is to draw causal inferences about the effects of actions, treatments, or interventions • A complication of using large databases to achieve such aims is that their data are almost always observational rather than experimental

Estimating Causal Effects from Large Data Sets Using Propensity Scores Is hospital A better than hospital B in treating metastatic colorectal cancer?

What assumptions are made in the modeling of age adjusted survival?

RR/OR death

Ideally we would like to compare patients who are similar with respect to all covariates which are observed to influence the outcome

Estimating Causal Effects from Large Data Sets Using Propensity Scores • Standard methods of analysis using available statistical software (such as linear or logistic regression) can be deceptive for these objectives because they provide no warnings about their propriety • Propensity score methods may be a more reliable tools for addressing such objectives because the assumptions needed to make their answers appropriate are more assessable and transparent to the investigator

Propensity Scores • Propensity score technology essentially reduces the entire collection of background characteristics to a single composite characteristic that appropriately summarizes the collection • Thus, the PS is a device for constructing matched pairs or matched sets or strata that balance numerous observed covariates

Propensity Scores • This reduction from many characteristics to one composite characteristic allows the straightforward assessment of whether the treatment and control groups overlap enough with respect to background characteristics to allow a sensible estimation of treatment versus control effects from the data set • Moreover, when such overlap is present, the propensity score approach allows a straightforward estimation of treatment versus control effects that reflects adjustment for differences in all observed background characteristics

Questions?? • Two subjects have the same propensity score-what does this mean? • Do the two subjects have the same age, gender etc… • Do their differences help predict which subject is more likely to receive the treatment?

True or False?. • If we pair or group subjects with the same PS, then treated and control subjects in these groups will have similar patterns or distributions of each covariate

Background • The PS approach complements model-based procedures and is not a substitute for them (ie often used in conjunction with regression or log-liner models)

Sub-classification • Table 1. Comparison of Mortality Rates for Three Smoking Groups in Three Databases* Annals of Internal Medicine, Part 2, 15 October 1997. 127:757-763

Sub-classification Comparison of Mortality Rates for Three Smoking Groups in Three Databases* Annals of Internal Medicine, Part 2, 15 October 1997. 127:757-763.

Sub-classification • A particular statistical model, such as a linear regression (or a logistic regression model, or in other settings, a hazard model) could be used to adjust for age, but sub-classification has three distinct advantages

Sub-classification vs MVA • First, if the treatment or exposure groups do not adequately overlap on the confounding covariate age, the investigator will see it immediately and be warned. In contrast, nothing in the standard output of any regression modeling software will display this critical fact

Sub-classification vs MVA • Second: Sub-classification does not rely on any particular functional form, such as linearity, for the relation between the outcome (death) and the covariate (age) within each treatment group, whereas models do

Sub-classification vs MVA • Third: Small differences in many covariates can accumulate into a substantial overall difference

Sub-classification • If standard models can be so dangerous, why are they commonly used for such adjustments when large databases are examined for estimates of causal effects?

Sub-classification • Which is easier??? • How do you deal with multiple confounders??

Propensity Scores • Propensity score techniques are very much like sub-classification techniques with more than one covariate

Is there a benefit to early angiography in patients with ST-segment depression myocardial infarction? An observational study • Background: It remains unclear whether an aggressive treatment approach with very early (<6 hours) angiography and revascularization improves outcome over an early conservative approach. We compared the short-term outcome of patients who received very early (<6 hours) angiography with patients who received early conservative therapy for ST-segment depression MI • Methods: Patients seen within 12 hours with ST-segment depression on the initial electrocardiogram (ECG) were identified from the National Registry of Myocardial Infarction 2 (NRMI) database, which collected information from 1994 to 1998. Those who received very early (<6 hours) angiography were compared with those who received early conservative therapy. The short-term outcomes, including major bleeding episodes, cerebral vascular events, recurrent ischemia and angina, MI, and death, were compared on the basis of the initial therapy received (Am Heart J 2002;143:488-96)

Results

Hospital outcome in the very early angiography group versus the early conservative therapy group

Clinical factors associated with increased hospital mortality Very early angiography has an OR of 0.76 with 95% CIs 0.61-0.95

Adjustment With Propensity Score • Because of the substantial differences in baseline characteristics between the treatment groups, we used the propensity score method to attempt to find comparable patients treated with each strategy • In the first step, we identified factors that predicted receiving very early angiography. These were age, male sex, white race, history of MI, history of angina, history of CHF, previous PTCA, previous aortocoronary bypass surgery, diabetes mellitus, current smoker, Killip class I, pulse >100 beats/min, systolic blood pressure <=100 mm Hg, admission diagnosis of MI, chest pain at presentation, and transfer from an outside hospital • A stepwise multivariate logistic regression analysis was performed to predict receiving early angiography • Thus, the propensity score represents the probability that a patient will receive very early angiography. A higher score indicates a higher probability of receiving very early angiography. Similarly, the same propensity score among patients indicates that same probability of receiving very early angiography

Adjustment With Propensity Score • The predicted probability of receiving very early angiography (the propensity score) was calculated for each patient • Patients receiving very early angiography (cases) were matched to patients receiving early conservative therapy (controls) on propensity score using the nearest available pair matching method. The 4-digit match resulted in 58% of the cases matched to control, yielding 1405 patient matches with similar baseline characteristics • After the matched-pair analysis, the original multivariate logistic regression model to predict hospital death was rerun with the propensity score forced in. OR and 95% CIs were calculated

Results Comparing patients matched on propensity score showed mortality was similar in both treatment groups (5.6% vs 5.4%, P = .87), with no significant inhospital mortality benefit of very early angiography in a MVA (OR = 0.89; 95% CI 0.71-1.13)

Summary:Propensity Scores • The basic idea of propensity score methods is to replace the collection of confounding covariates in an observational study with one function of these covariates, called the propensity score (that is, the propensity to receive treatment 1 rather than treatment 2). This score is then used just as if it were the only confounding covariate • Thus, the collection of predictors is collapsed into a single predictor • The propensity score is found by predicting treatment group membership (that is, the indicator variable for being in treatment group 1 as opposed to treatment group 2) from the confounding covariates, for example, by a logistic regression or discriminant analysis • In this prediction of treatment group measurement, it is critically important that the outcome variable (for example, death) play no role; the prediction of treatment group must involve only the covariates

Summary: Propensity Scores • Each person in the database then has an estimated propensity score, which is the estimated probability (as determined by that person's covariate values) of being exposed to treatment 1 rather than treatment 2. This propensity score is then the single summarized confounding covariate to be used for sub-classification

Summary:Propensity Scores • If two persons, one exposed to treatment 1 and the other exposed to treatment 2, had the same value of the propensity score, these two persons would then have the same predicted probability of being assigned to treatment 1 or treatment 2. Thus, as far as we can tell from the values of the confounding covariates, a coin was tossed to decide who received treatment 1 and who received treatment 2. Now suppose that we have a collection of persons receiving treatment 1 and a collection of persons receiving treatment 2 and that the distributions of the propensity scores are the same in both groups (as is approximately true within each propensity subclass). In subclass 1, the persons who received treatment 1 were essentially chosen randomly from the pool of all persons in subclass 1, and analogously for each subclass • As a result, within each subclass, the multivariate distribution of the covariates used to estimate the propensity score differs only randomly between the two treatment groups

Limitations of Propensity Scores • In observational studies, our confidence in causal conclusions is limited • Propensity score methods can only adjust for observed confounding covariates and not for unobserved ones • Propensity score methods work better in larger samples • A final possible limitation of propensity score methods is that a covariate related to treatment assignment but not to outcome is handled the same as a covariate with the same relation to treatment assignment but strongly related to outcome (potential for over-correcting or including irrelevant covariates)

Conclusion • Large databases have tremendous potential for addressing (although not necessarily settling) important medical questions, including important causal questions involving issues of policy • Addressing these causal questions using standard statistical models can be fraught with pitfalls because of their possible reliance on unwarranted assumptions and extrapolations without any warning • Propensity score methods are more reliable; they generalize the straightforward technique of sub-classification with one confounding covariate to allow simultaneous adjustment for many covariates • One critical advantage of propensity score methods is that they can warn the investigator that, because of inadequately overlapping covariate distributions, a particular database cannot address the causal question at hand without relying on untrustworthy model-dependent extrapolation or restricting attention to the type of person adequately represented in both treatment groups

Clinical Implications A group of Biostatisticians and a group of clinicians were riding together on a train to joint scientific meetings. All the clinicians had tickets, but the Biostatisticians only had one ticket between them. Inquisitive by nature, the clinicians asked the Biostatisticians how they were going to get away with such a small sample of tickets when the conductor came through. The Biostatisticians said, "Easy.We have methods for dealing with that." Later, when the conductor came to punch tickets, all the Biostatisticians slipped quietly into the bathroom. When the conductor knocked on the door, the head Biostatistician slipped their one ticket under the door thoroughly fooling the layman conductor.After the joint meetings were over, the Biostatisticians and the clinicians again found themselves on the same train. Always quick to catch on, the clinicians had purchased one ticket between them. The Biostatisticians (always on the cutting edge) had purchased NO tickets for the trip home. Confused, the clinicians asked the Biostatisticians "We understand how your methods worked when you had one ticket, but how can you possibly get away with no tickets?" "Easy," replied the Biostatisticians smugly, "we have different methods for dealing with that situation." Later, when the conductor was in the next car, all the clinicians trotted off to the bathroom with their one ticket and all the Biostatisticians packed into the other bathroom. Shortly, the head Biostatistician crept over to where the clinicians were hiding and knocked authoritatively on the door. As they had been instructed, the clinicians slipped their one ticket under the door. The head Biostatistician took the clinicians' one and only ticket and returned triumphantly to the Biostatistician group. Of course, the clinicians were subsequently discovered and publicly humiliated. Moral: Beware of using statistical techniques that you don't fully understand - it can only lead to trouble

Propensity Sub-classification • The U.S. Government Accounting Office used propensity score methods on the SEER database to compare the two treatments for breast cancer • First, approximately 30 potential confounding covariates and interactions were identified • A logistic regression was then used to predict treatment (mastectomy compared with conservation therapy) from these confounding covariates on the basis of data from the 5326 women • Each woman was then assigned an estimated propensity score, which was her probability, on the basis of her covariate values, of receiving breast conservation therapy rather than mastectomy • The group was then divided into five subclasses of approximately equal size on the basis of the womens' individual propensity scores • Before examining any outcomes (5-year survival results), the subclasses were checked for balance with respect to the covariates • If important within-subclass differences between treatment groups had been found on some covariates, then the propensity score prediction model would need to be reformulated

Propensity Sub-classification Table 3. Estimated 5-Year Survival Rates for Node-Negative Patients in the SEER Database within Each of Five Propensity Score Subclasses* Annals of Internal Medicine, Part 2, 15 October 1997. 127:757-763

Estimating Causal Effects from Large Data Sets Using Propensity Scores