Missing Data: Analysis and Design

Missing Data: Analysis and Design John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University

Presentation in Four Parts • (1) Introduction: Missing Data Theory • (2) A brief analysis demonstration • Multiple Imputation with • NORM and Proc MI • Amos...break... • (3) Attrition Issues • (4) Planned missingness designs: • 3-form Design

Recent Papers • Graham, J. W., Cumsille, P. E.,& Elek-Fisk,E. (2003).Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons. • Collins, L. M., Schafer, J. L.,& Kam, C. M.(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351. • Schafer, J. L.,& Graham,J. W.(2002).Missing data: our view of the state of the art. Psychological Methods, 7, 147-177. jgraham@psu.edu

Part I:A Brief Introduction toAnalysis with Missing Data

Problem with Missing Data • Analysis procedures were designed for complete data. . .

Solution 1 • Design new model-based procedures • Missing Data + Parameter Estimation in One Step • Full Information Maximum Likelihood (FIML)SEM and Other Latent Variable Programs(Amos, Mx, LISREL, Mplus, LTA)

Solution 2 • Data based procedures • e.g., Multiple Imputation (MI) • Two Steps • Step 1: Deal with the missing data • (e.g., replace missing values with plausible values • Produce a product • Step 2: Analyze the product as if there were no missing data

FAQ • Aren't you somehow helping yourself with imputation?. . .

NO. Missing data imputation . . . • does NOT give you something for nothing • DOES let you make use of all data you have . . .

FAQ • Is the imputed value what the person would have given?

NO. When we impute a value . . • We do not impute for the sake of the value itself • We impute to preserve important characteristics of the whole data set . . .

We want . . . • unbiased parameter estimation • e.g., b-weights • Good estimate of variability • e.g., standard errors • best statistical power

Causes of Missingness • Ignorable • MCAR: Missing Completely At Random • MAR: Missing At Random • Non-Ignorable • MNAR: Missing Not At Random

MCAR(Missing Completely At Random) • MCAR 1: Cause of missingness completely random process (like coin flip) • MCAR 2: • Cause uncorrelated with variables of interest • Example: parents move • No bias if cause omitted

MAR (Missing At Random) • Missingness may be related to measured variables • But no residual relationship with unmeasured variables • Example: reading speed • No bias if you control for measured variables

MNAR (Missing Not At Random) • Even after controlling for measured variables ... • Residual relationship with unmeasured variables • Example: drug use reason for absence

MNAR Causes • The recommended methods assume missingness is MAR • But what if the cause of missingness is not MAR? • Should these methods be used when MAR assumptions not met? . . .

YES! These Methods Work! • Suggested methods work better than “old” methods • Multiple causes of missingness • Only small part of missingness may be MNAR • Suggested methods usually work very well

Revisit Question: What if THE Cause of Missingness is MNAR? • Example model of interest: X  Y X = Program (prog vs control) Y = Cigarette Smoking Z = Cause of missingness: say, Rebelliousness (or smoking itself) • Factors to be considered: • % Missing (e.g., % attrition) • rYZ . • rZ,Ymis .

rYZ • Correlation between • cause of missingness (Z) • e.g., rebelliousness (or smoking itself) • and the variable of interest (Y) • e.g., Cigarette Smoking

rZ,Ymis • Correlation between • cause of missingness (Z) • e.g., rebelliousness (or smoking itself) • and missingness on variable of interest • e.g., Missingness on the Smoking variable • Missingness on Smoking (Ymis) • Dichotomous variable: Ymis = 1: Smoking variable not missing Ymis = 0: Smoking variable missing

How Could the Cause of Missingness be Purely MNAR? • rZ,Y = 1.0 AND rZ,Ymis = 1.0 • We can get rZ,Y = 1.0 if smoking is the cause of missingness on the smoking variable

How Could the Cause of Missingness be Purely MNAR? • We can get rZ,Ymis = 1.0 like this: • If person is a smoker, smoking variable is always missing • If person is not a smoker, smoking variable is never missing • But is this plausible? ever?

What if the cause of missingness is MNAR? Problems with this statement • MAR & MNAR are widely misunderstood concepts • I argue that the cause of missingness is never purely MNAR • The cause of missingness is virtually never purely MAR either.

MAR vs MNAR: • MAR and MNAR form a continuum • Pure MAR and pure MNAR are just theoretical concepts • Neither occurs in the real world • MAR vs MNAR NOT dimension of interest

MAR vs MNAR: What IS the Dimension of Interest? • Question of Interest:How much estimation bias? • when cause of missingness cannot be included in the model

Bottom Line ... • All missing data situations are partly MAR and partly MNAR • Sometimes it matters ... • bias affects statistical conclusions • Often it does not matter • bias has minimal effects on statistical conclusions (Collins, Schafer, & Kam, Psych Methods, 2001)

Methods:"Old" vs MAR vs MNAR • MAR methods (MI and ML) • are ALWAYS at least as good as, • usually better than "old" methods (e.g., listwise deletion) • Methods designed to handle MNAR missingness are NOT always better than MAR methods

References • Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119-128. • Graham, J. W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P., & Schafer, J.L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325-366). Washington, D.C.: American Psychological Association. • Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351.

Analysis: Old and New

Old Procedures: Analyze Complete Cases(listwise deletion) • may produce bias • you always lose some power • (because you are throwing away data) • reasonable if you lose only 5% of cases • often lose substantial power

Analyze Complete Cases(listwise deletion) • 1 1 1 1 • 0 1 1 1 • 1 0 1 1 • 1 1 0 1 • 1 1 1 0 • very common situation • only 20% (4 of 20) data points missing • but discard 80% of the cases

Other "Old" Procedures • Pairwise deletion • May be of occasional use for preliminary analyses • Mean substitution • Never use it • Regression-based single imputation • generally not recommended ... except ...

Recommended Model-Based Procedures • Multiple Group SEM (Structural Equation Modeling) • LatentTransitionAnalysis (Collins et al.) • A latent class procedure

Recommended Model-Based Procedures • Raw Data Maximum Likelihood SEMaka Full Information Maximum Likelihood (FIML) • Amos (James Arbuckle) • LISREL 8.5+ (Jöreskog & Sörbom) • Mplus (Bengt Muthén) • Mx (Michael Neale)

Amos 7, Mx, Mplus, LISREL 8.8 • Structural Equation Modeling (SEM) Programs • In Single Analysis ... • Good Estimation • Reasonable standard errors • Windows Graphical Interface

Limitation with Model-Based Procedures • That particular model must be what you want

Recommended Data-Based Procedures EM Algorithm (ML parameter estimation) • Norm-Cat-Mix, EMcov, SAS, SPSS Multiple Imputation • NORM, Cat, Mix, Pan (Joe Schafer) • SAS Proc MI • LISREL 8.5+

EM Algorithm • Expectation - Maximization Alternate between E-step: predict missing data M-step: estimate parameters • Excellent parameter estimates • But no standard errors • must use bootstrap • or multiple imputation

Multiple Imputation • Problem with Single Imputation:Too Little Variability • Because of Error Variance • Because covariance matrix is only one estimate

Too Little Error Variance • Imputed value lies on regression line

Imputed Values on Regression Line

Restore Error . . . • Add random normal residual

Covariance Matrix (Regression Line) only One Estimate • Obtain multiple plausible estimates of the covariance matrix • ideally draw multiple covariance matrices from population • Approximate this with • Bootstrap • Data Augmentation (Norm) • MCMC (SAS 8.2, 9)

Regression Line only One Estimate

Data Augmentation • stochastic version of EM • EM • E (expectation) step: predict missing data • M (maximization) step: estimate parameters • Data Augmentation • I (imputation) step: simulate missing data • P (posterior) step: simulate parameters

Data Augmentation • Parameters from consecutive steps ... • too related • i.e., not enough variability • after 50 or 100 steps of DA ... covariance matrices are like random draws from the population

Multiple Imputation Allows: • Unbiased Estimation • Good standard errors • provided number of imputations is large enough • too few imputations  reduced power with small effect sizes

From Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (in press). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science.

Part II:Illustration of Missing Data Analysis: Multiple Imputation with NORM and Proc MI

Missing Data: Analysis and Design