Why Propensity Score Matching should be used to Assess Programmatic Effects

Why Propensity Score Matching should be used to Assess Programmatic Effects NASPA Assessment & Retention Conference, June 11, 2010 Forrest Lane University of North Texas

Contact Information Forrest Lane, NASPA Assessment & Retention Conference Center for Interdisciplinary Research & Analysis (CIRA) Department of Educational Psychology University of North Texas Forrest.lane@unt.edu

Program Outline Assessment Practices within Post-Secondary Education Challenges to Quasi-Experimental Evaluation and Assessment Methods Propensity Score Matching Heuristic Example Forrest Lane, NASPA Assessment & Retention Conference

Educational Assessment • Increased importance of modeling university resources to institutional outcomes. • From a student development perspective, this is often through evaluating program effects. • First Year Programming • Orientation, New Student Camps, Freshman Seminars • Co-curricular Activities • Greek Life, Student Activities, Community Involvement • Service-Learning Initiatives • Living Learning Communities Forrest Lane, NASPA Assessment & Retention Conference

Modeling Effects Forrest Lane, NASPA Assessment & Retention Conference • In order to accurately assess programmatic effects, cause & effect needs to be established. • May use control/comparison groups (experimental design) • If quasi-experimental: • Participants have traditionally been matched on demographic or other relevant variables • Or matched on some pre-treatment outcome (examination of baseline differences)

Example from the Literature • The effects of on and off-campus living arrangements were explored with regard to students’ openness to diversity. • The 13 variables in the model were analyzed using path modeling. • Spurious effects were modeled on background characteristics. • Results indicated that living on-campus was directly associated with significantly higher levels of openness to diversity than off campus living. Pike, G. (2009). The differential effects of on- and off-campus living arrangements on students’ openness to diversity. Journal of Student Affairs Research & Practice, 46(4), 629-645. Forrest Lane, NASPA Assessment & Retention Conference

Commonly Reported Limitations “The fact that students self selected into different residential communities represents another potential limitation of the research. Females, minority students, and higher-ability students were over-represented in the research sample due to the under-representation of off-campus students. Although background differences were accounted for in the study, the possibility remains that the residence groups might have differed in ways that were not explored” (Pike, 2009, p. 639). Forrest Lane, NASPA Assessment & Retention Conference

Empirical Problems with Self-Selection True randomization is rarely an option in educational assessment (Luellen, Shadish, & Clark, 2005; Grunwald & Mayhew, 2008). As a result, there is an abundance and often over-reliance on reported effects which may inadequately address variables which contribute to differences in treatment group selection. Non-randomized groups may systematically differ from one based on any number of covariates. Leads to effect size bias when interpreting treatment effects. Forrest Lane, NASPA Assessment & Retention Conference

Experimental vs. Quasi-Experimental Forrest Lane, NASPA Assessment & Retention Conference In true randomization, groups can be directly compared to one another because systematic differences have been controlled through experimental design: Probability of group membership is equal (p = .50). In quasi-experimental designs, group differences exist from non-randomization and therefore cannot be compared directly to one another. Probability of group membership is not equal (p ≠.50)

The ANCOVA Problem Forrest Lane, NASPA Assessment & Retention Conference • ANCOVA is often used to control for differences on an outcome of interest based on theoretically relevant covariates. • Controlling for covariates on an outcome is theoretically different than matching participants on their likelihood to be in a treatment group (independent variable). • Covariates which control for outcome differences may or may not have anything to do with group membership or self-selection.

Solution to Quasi-Experimental Designs • Propensity score matching (PSM) is used to estimate the true treatment effect and to reduce group bias based due to non-randomization. • Participants are matched across groups on their likelihood of group membership. • Recommended method by the U.S. Department of Education to improve the quality of quasi-experimental research (Glen, 2005). • Increasingly used in medical & economic research since mid 1980s. Forrest Lane, NASPA Assessment & Retention Conference

Defining a Propensity Score Defined as the conditional probability of assignment to a particular treatment or control given a set of covariates (Rosenbaum & Rubin, 1983b). Propensity scores incorporate covariates into a singular scalar variable ranging from 0 to 1. This new scalar variable can then be used to match participants in treatment groups. Once matched, treatments effects should be more reflective of the true effect and analogous to interpretation of randomized designs Forrest Lane, NASPA Assessment & Retention Conference

Calculating Propensity Scores The most commonly used methods include using either logistic regression. Other methods include classification trees or ensemble methods such as bagging, boosted regression trees, and random forest (Shadish, Luellen, & Clark, 2006). Forrest Lane, NASPA Assessment & Retention Conference

PSM in the Literature Grunwald & Mayhew (2008) examined the development of moral reasoning in young adults and demonstrated a significant reduction is the overestimation of effects. Morgan (2001) used propensity score matching and demonstrated the effect of private school education on math and reading achievement is actually larger than findings in non-matched samples. Other similar studies have been demonstrated in economics (Dehejia & Wahba, 2002), medicine (Schafer & Kang, 2008), and sociology (Morgan & Harding, 2006). Forrest Lane, NASPA Assessment & Retention Conference

PSM in the Literature Forrest Lane, NASPA Assessment & Retention Conference Over 1,000 articles were found in JASTOR having used propensity score matching among sociology, economics, and medical journals, yet it remains virtually absent from educational research & assessment methods.

PSM in Higher Education Literature The following reflects a search for propensity score matching techniques in the literature between the years of 1996 - 2010 Forrest Lane, NASPA Assessment & Retention Conference

Heuristic Example College X believes participation in a LLC contributes to better academic performance (GPA). A sample of 30 students was collected. Data were examined to determine if academic performance among LLC was statistically & meaningfully different than those who do not participate in an LLC. Forrest Lane, NASPA Assessment & Retention Conference

Pre-Matching Achievement Scores Biased Treatment Effect (3.21) Non- LLC (3.43) LLC 4.0 3.0 Forrest Lane, NASPA Assessment & Retention Conference

Propensity Score Calculation • Logistic Regression was performed using SPSS 18.0 using the following covariates to predict participation in an LLC** • In-State vs. Out of State • Legacy • PSAT Scores • SAT Scores • Gender • Predicted probabilities were saved in the analysis **Covariates should be theoretically driven variables which contribute to group membership, not the outcome of interest. Forrest Lane, NASPA Assessment & Retention Conference

Pre-Matching Propensity Scores Amount of Bias (.380) Non- LLC (.565) LLC 1 0 Unlikely to be in LLC Likely to be in LLC Forrest Lane, NASPA Assessment & Retention Conference

Propensity Score Matching • Balance groups on covariates though either matching, regression adjustment, and stratification • Stratification across quintiles is the recommended and most common method. • Shown to reduce approximately 90% of bias due to covariates (Rubin & Rosenbaum, 1983b; Rubin & Rosenbaum, 1984; Shadish, Luellen, & Clark, 2005) • Caliper matching can also substantially reduce bias (Rosenbaum and Rubin, 1985b). • A caliper of 0.25 standard deviations of the logit transformation of the propensity score can also work well to reduce bias (Stuart & Rubin, 2007, ¶4.3.3). Forrest Lane, NASPA Assessment & Retention Conference

Matching Algorithms • MatchIt in R (Ho, Imai, King, and Stuart, 2007) • PSMATCH2 algorithm in STATA (Leuven & Sianesi, 2004) • SUGI 214-26 “GREEDY” Macro in SAS (D’Agostino, 1998), • SPSS algorithm (Painter, 2009) • Core code written by Raynald Levesque and adapted for use with propensity matching by John Painter Feb 2004 • Program developed and tested with SPSS 11.5 • Procedure will find best match for each treatment case from the control cases • Control case is then removed and not reconsidered for subsequent matches Forrest Lane, NASPA Assessment & Retention Conference

Assessing Matched Samples • Some ways of assessing balance (Rubin, 2001) • The standardized difference in the mean propensity score in the two groups should be near zero (d < .20), • The ratio of the variance of the propensity score in the two groups should be near one, preferably between 0.80 and 1.25 Forrest Lane, NASPA Assessment & Retention Conference

Pre-Matching Propensity Scores Amount of Bias (.380) Non- LLC (.565) LLC 1 0 Unlikely to be in LLC Likely to be in LLC Forrest Lane, NASPA Assessment & Retention Conference

Post-Matching Propensity Scores (.476) LLC (.487) Non- LLC 1 0 Unlikely to be in LLC Likely to be in LLC Forrest Lane, NASPA Assessment & Retention Conference

Histogram of Post-Matching PS Differences Forrest Lane, NASPA Assessment & Retention Conference

Pre-Matching Achievement Scores Biased Treatment Effect (3.21) Non- LLC (3.43) LLC 4.0 3.0 Forrest Lane, NASPA Assessment & Retention Conference

Post-Matching Achievement Scores True Treatment Effects (3.32) Non- LLC (3.44) LLC 4.0 3.0 Forrest Lane, NASPA Assessment & Retention Conference

Limitations & Cautions Algorithms across various platforms make different assumptions about how to treat data. Matched data sets tend to be more homogenous than in randomized samples Pre-matched sample (n) and post-matched sample (n) will not equal and should be taken into account with regard to statistical power. Propensity score matching typically requires larger sample sizes. Forrest Lane, NASPA Assessment & Retention Conference

References D’Agostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of treatment to a non-randomized control group. Statistics in Medicine, 17, 2265-2281.National Research Council (2000). Scientific research in education. Washington, D.C.: National Academy Press. Glenn, D. (2005, March). New federal policy favors randomized trials in education research. The Chronicle of Higher Education, Retrieved December 5, 2009 from http://www.chronicle.com. Grunwald, H.E. & Mayhew, M.J. (2008).The use of propensity scores in identifying a comparison group in a quasi-experimental design: Moral reasoning development as an outcome. Research in Higher Education, 49(8), 758-775. Ho D., Imai, K., King, G.,& Stuart, E. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15, 199-236. Leuven, E., & Sianesi, B. (2004). PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing, Statistical Software Components S432001, Boston College Department of Economics. Morgan, S. L. (2001). Counterfactuals, causal effect heterogeneity, and the Catholic school effect on learning. Sociology of Education, 74, 341–374. Forrest Lane, NASPA Assessment & Retention Conference

References Morgan, S., & Harding, D. (2006).Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods & Research, 35(1), 3-60. DOI: 10.1177/0049124106289164. Painter, J. (2009). Jordan institute for families: Virtual research community. Retrieved from http://ssw.unc.edu/VRC/Lectures/index.htm. Pike, G. (2009). The differential effects of on- and off-campus living arrangements on students’ openness to diversity. Journal of Student Affairs Research & Practice, 46(4), 629-645. Rosenbaum, P. R., & Rubin, D. B. (1983b). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524 Rubin, D. B. (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services & Outcomes Research Methodology 2, 169–188. Forrest Lane, NASPA Assessment & Retention Conference

References Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279-313. doi:10.1037/a0014268. Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H., & Shavelson, R. J. (2007). Estimating causal effects using experimental and observational designs (report from the Governing Board of the American Educational Research Association Grants Program). Washington, DC: American Educational Research Association. Shadish W. R., Luellen J. K., & Clark M. H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review, 29(6), 530-558. doi:10.1177/0193841X0575596. Shadish W. R., Luellen J. K., & Clark M. H. (2006). Propensity scores and quasi-experiments: A testimony to the practical side of Lee Sechrest. In: Bootzin R.R., McKnight P.E. (Eds.), Strengthening research methodology: Psychological measurement and evaluation. American Psychological Association: Washington, DC, 143–157. Stuart, E. A., & Rubin, D. B. (2008). Matching methods for causal inference: Designing observational studies. In: Obsborne, J. (Eds.), Best practices in quantitative methods. Thousand Oaks, CA: Sage Publishing. Forrest Lane, NASPA Assessment & Retention Conference

Why Propensity Score Matching should be used to Assess Programmatic Effects