Propensity Score Matching: A technique for Program Evaluation

Propensity Score Matching: A technique for Program Evaluation Aradhna Aggarwal Department of Business Economics, South Campus, University of Delhi Sambodhi international conference, 29 April, 2011,

Outline • Overview: Why Propensity Score Matching? • How to use PSM: Choices to be made • Example: Impact evaluation of Yeshasvini health care programme

The best way for evaluation • Randomised experiment • Not always possible • Quasi experimental design : • Regression, • Matching ( Direct, PSM, DID)

Regression • Control the difference between participants and non participants. • The problem of non observables. • Based on parametric relationship. • demanding with respect to the modelling assumptions

Matching • Theory of Counterfactuals • The fact is that some people receive treatment. • The counterfactual question is: “What would have happened to those who, in fact, did receive treatment, if they had not received treatment (or the converse)?” • Counterfactuals cannot be seen or heard—we can only create an estimate of them. • Matching on covariates is one technique that creates these counterfactuals and estimate the difference

Creating a counterfactual • means that the outcomes of members are compared with thepotential outcomes of comparison households had they been members of the programme. More specifically, • ATT= E(Y1|D=1)-E(Y0|D=1)

Approximating Counterfactuals : direct matching • If the number of observable pre-treatment characteristics is large, it is difficult to determine along which dimensions to match units or which weighting scheme to adopt (Dehejia and Wahba, 2002, p. 1). • Matching on single characteristics that distinguish treatment and comparison groups (to try to make them more alike)

Propensity Score Matching • Matching is performed conditioning on the propensity scores of X (the probability of participating in the programme conditional on X) rather than on X. • The crucial difference of PSM from conventional matching: match subjects on one score rather than multiple variables:“… the propensity score is a monotone function of the discriminant score” (Rosenbaum & Rubin, 1984). • The probability is usually obtained from probit/logistic regression to create a counterfactual group • Propensity scores may be used for matching or as covariates—alone or with other matching variables or covariates.

Average treatment effect • More specifically, if P=1 for treated group and =0 for comparison group, then the average treatment effect on treated (ATT) on an outcome variable Y is • ATT= E(Y1-Y0|P=1), • which means, • ATT= E(Y1|P=1)-E(Y0|P=1) • While data on E(Y1|P=1) are available from the programme participants, estimation of the counterfactual E(Y0|P=1) is based on the assumption that after adjusting for observable differences, the mean of the potential outcome is the same for P = 1 and P = 0. • The mean effect of treatment can then be calculated as the average difference in outcomes between the participants and non-participants. This means that the outcomes of members are compared with the potential outcomes of comparison households. That being done, differences in outcomes of the control (comparison) group and of participants (treated) can be attributed to the programme.

PSM : The origin • In 1983, Rosenbaum and Rubin published their seminal paper that first proposed this approach. • From the 1970s, Heckman and his colleagues focused on the problem of selection biases, and traditional approaches to program evaluation, including randomized experiments, classical matching, and statistical controls. Heckman later developed “Difference-in-differences” method

Match Each Participant to One or More Nonparticipants on Propensity Score • Nearest neighbor matching • Caliper matching • Mahalanobis metric matching in conjunction with PSM • Stratification matching • Difference-in-differences matching (kernel & local linear weights) General Procedure • Run Logistic Regression: • Dependent variable: Y=1, if participate; Y = 0, otherwise. • Choose appropriate conditioning (instrumental) variables. • Obtain propensity score: predicted probability (p) or log[p/(1-p)]. Estimation of ATT

The procedure : using an illustration of Yeshasvini impact evaluation

Estimating PS function : 1. Choice of treatment vs. comparison group • Depends on the objective of evaluation and the structure of data. • Treated groups: • yeshasvini members, • beneficiaries (Claimants); • renewing members • Comparison group • Non yeshasvini cooperative HHs • Non yeshasvini non cooperative HH • The former have better economic and social status

Our models • 6 models: Three treatment and two comparison groups • Matching with cooperative groups will match better off sections. • Matching with non cooperative group will match poorer sections. • Thus results across different socio economic status

Estimating PS function : 2. Choice of the model : probit vs logit • In principle, any discrete choice model could be used. Hence, the choice was not too critical (Caliendo and Kopeinig 2008). • We have used a probit specification

Estimating PS function : 3. Choice of the variables • Match, as much as possible, on variables that are precisely measured and stable (to avoid extreme baseline scores that will regress toward the mean) • While analysing the factors affecting the demand for health insurance, most studies focus on individuals’ or households’ observable traits, such as income, nature of economic activity, demographic patterns, age structure, health patterns, social status, education, and personal preferences. • The socio-economic contexts within which households live are generally ignored. We have explicitly taken into account village-specific and district-specific attributes along with household-specific characteristics. These include economic conditions, literacy, health infrastructure, distance from the nearest health facility, distance from the nearest Yeshasvini facility, living conditions, poverty, transport facilities and the coverage of cooperative societies.

Estimation of PS function • pscore ydumb3 dumchronic1 lock2_i_concen_inc headage headedustatus demodivage hsize block3a_membershg h_sc_grp sh_female lper hholdasset block2_paper block2_tv v_livingcdn v_hlthdistance v_copop d_health_infra v_nature disadv d_panchay_villg d_tpt, pscore(myscore2)

The pre matching balancing test • Since conditioning is not done on covariates but only on propensity scores, the matching procedure should be able to balance the distribution of the relevant variables in both the comparison and the treatment group. • The problem of bias because Y is related to a variable X whose distribution differs in the two groups. For removing bias, a few subclasses are created based on the distribution of X. Next, the mean value of Y is calculated separately within each subclass. Finally, a weighted mean of these subclass means is calculated for each group, using the same weights for each group, where the weights are proportional to the number of subjects in the subgroup. • as the number of covariates increases, the number of subclasses grows dramatically. For example, considering only binary covariates, with k variables, there will be 2k subclasses, and it is highly unlikely that every subclass will contain both treated and comparison units. In this case, propensity scores are used and the balancing test is to be satisfied. • (Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee* • Melbourne Institute of Applied Economic and Social Research • The University of Melbourne March 10, 2006 )

Illustration of the pre-matching balancing • Inferior ydumb3 = 0 if hoymem • of block == 0 • of pscore 0 1 Total • 0 299 312 611 • .2 64 13 77 • .25 59 27 86 • .3 150 79 229 • .4 146 107 253 • .5 116 180 296 • .6 119 206 325 • .7 46 124 170 • .75 24 137 161 • .8 59 370 429 • Total 1,082 1,555 2,637 • This number of blocks ensures that the mean propensity score • is not different for treated and controls in each blocks • The balancing property is satisfied

Choosing algorithm for matching • Nearest neighbor: Randomly order the participants and nonparticipants, then select the first participant and find the nonparticipant with closest propensity score. • Caliper: define a common-support region (e.g., .01 to .00001), and randomly select one nonparticipant that matches on the propensity score with the participant. • Kernel: each person in the treatment group is matched to a weighted sum of individuals who have similar propensity scores with greatest weight being given to people with closer scores

Other methods • Radius matching • matching Mahalanobis: Mahalanobis metric matching including the propensity score, and (2) Nearest available Mahalandobis metric matching within calipers defined by the propensity score. • Local linear regression matching • Spline matching….

Greedy vs optimal • There are basically two types of matching algorithms. an optimal match algorithm: In an optimal matching algorithm, previous matches are reconsidered before making the current match greedy match algorithm. : A greedy algorithm is frequently used to match cases to controls in observational studies. In a greedy algorithm, a set of X Cases is matched to a set of Y Controls in a set of X decisions. Once a match is made, the match is not reconsidered. That match is the best match currently available. Bias reduced but observations also restricted.

Limitations of Matching • If the two groups do not have substantial overlap, then substantial error may be introduced: • E.g., if only the worst cases from the untreated “comparison” group are compared to only the best cases from the treatment group, the result may be regression toward the mean • makes the comparison group look better • Makes the treatment group look worse.

Propensity score histograms : Overlap Treated : YH;Untreated:NYCH Treated:YB;Untreated:NYCHB Treated: YH3+;Untreated:NY+3CH Treated: YH;Untreated:NYNCH Treated: YB;Untreated:NYNCHB Treated:YH+3;Untreated:NY+3NCH

Common support • For the matching, we had to decide whether the test should be performed only on the observations that had propensity scores within the common support region, i.e. precisely on the subset of the comparison group that was most comparable to the treatment group or on the full set of the comparison group. • Heckman et al., (1997) argue that imposing the common support restriction in the estimation of propensity scores improves the quality of the estimates. Lechner (2001), on the other hand, argues that besides reducing the sample considerably, imposing the restriction may lose high-quality matches at the boundary of the common support region. • General practice is to use common support.

Cases Are Excluded at Both Ends of the Propensity Score Cases excluded Range of matched cases.

Incomplete Matching or Inexact Matching? • While trying to maximize exact matches (i.e., strictly “nearest” or narrow down the common-support region), cases may be excluded due to incomplete matching. • While trying to maximize cases (i.e., widen the region), inexact matching may result.

Post matching balancing test

Outcome variables Outcome variables were classified into four broad groups: • health-care utilisation; • financial protection; • treatment outcome (days lost in illness, income lost in illness, perception regarding the level of satisfaction, abnormal deliveries and caesarean deliveries); and • economic well-being (change in income, savings, borrowings, sale and purchase of assets, and total savings and borrowings over the past three years).

Estimation of standard error • The estimated variance of the treatment effect includes the variance due to the estimation of the propensity score, the imputation of the common support, and possibly also the order in which treated individuals are matched. These estimation steps add variation beyond the normal sampling variation (Heckman et al., 1998). • The most commonly used method to deal with this problem is bootstrapping of standard errors as suggested by Lechner (2002). Using this technique, we modified the estimates of standard errors by bootstrapping 50 replications. • In general, 50 replications are observed to be good enough to provide a good estimate of standard error (Efron and Tibshirani, 1993).

Illustration command • bootstrap r(att): psmatch2 ydumb3 , kernel pscore(myscore2) bwidth()common out (b41nofacilityvstd)

Illustration of output

Criteria for “Good” PSM • Identify treatment and comparison groups with substantial overlap • Use a composite variable—e.g., a propensity score—which minimizes group differences across many scores

Limitations of Propensity Scores • Large samples are required • Group overlap must be substantial • Hidden bias may remain because matching only controls for observed variables (to the extent that they are perfectly measured) • The treatment affect the comparison groups as well. This may create underestimation of treatment effects. (Shadish, Cook, & Campbell, 2002)

A Methodological Overview • Computational software • STATA – PSMATCH2 • SAS SUGI 214-26 “GREEDY” Macro • S-Plus with FORTRAN Routine for difference-in-differences (Petra Todd)

Thank You Very Much Questions?

Propensity Score Matching: A technique for Program Evaluation