1 / 32

Working with Missing Values

Working with Missing Values. Alan C. Acock February, 2007 Supporting material is available at www.oregonstate.edu/~acock/missing. Why are the Values Missing: The reason instructs the solution. By Design —Completely Random Missing Completely at Random ( MCAR )

emily-olson
Download Presentation

Working with Missing Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at www.oregonstate.edu/~acock/missing

  2. Why are the Values Missing: The reason instructs the solution By Design—Completely Random • Missing Completely at Random (MCAR) • 50% of items selected randomly for each interview • 50% randomly selected for follow-up • Effective when there are too many items or high costs Intentionally Missing—Researcher controlled • Boys not asked when first menstruation • Drop from analysis • Sometimes unintentionally imputed • Imputing doesn’t necessarily hurt Alan C. Acock, Working with Missing Values

  3. Why are the Values Missing Refusals—We may know mechanism • Adjusted for gender, race, education • May be missing at random • Otherwise, bias is likely w/o Auxiliary Variables Missing because of “don’t know” responses • Between agree and disagree? • Can we impute a better value? • Should we? Alan C. Acock, Working with Missing Values

  4. Why are the Values Missing Missing by researcher error • May be missing completely at random • May reflect researcher bias • Perceived risk to researcher • Missing observation worse than missing value Code reason value is missing • NLSY97, uses 5 types of missing values • Treat each differently Alan C. Acock, Working with Missing Values

  5. Why are the Values Missing • Understand why each value is missing • Delete observations or variables where you do not intend to impute a value • Drop variable • Drop observation Alan C. Acock, Working with Missing Values

  6. Four Questions • Do I want to have a value for this person? • Is the value missing completely at random, or • Do I have auxiliary variables that explain why it is missing, and • Do I have covariates that predict the score? Alan C. Acock, Working with Missing Values

  7. Patterns of Missing Values MISSING DATA PATTERNS 1 2 3 4 5 6 7 8 9 10 HLTH x x x x CHILDS x x x x x x x x x x HAP_GEN x x x x x INCOME98 x x x x x x AGE x x x x x x x x EDUC x x x x x What is problem with HLTH? INCOME98? EDUC? Alan C. Acock, Working with Missing Values

  8. Patterns of Missing Values MISSING DATA PATTERN FREQUENCIES Pattern Freq Pattern Freq Pattern Freq 1 550 5 27 9 4 2 81 6 2 10 14 3 77 7 12 4 30 8 21 Throw out 81 people in pattern 2? We have data on five of the six variables Income might not be a key predictor Why is health missing in patterns 5 to 10—Was this by design? Alan C. Acock, Working with Missing Values

  9. Amount of Missing Values PROPORTION OF DATA PRESENT HLTH CHILDS HAP_GEN INC AGE EDUC HLTH .90 CHILDS .90 1.00 HAP_GEN .77 .82 .82 INCOME98 .76 .83 .70 .83 AGE .90 .99 .81 .82 .99 EDUC .77 .82 .82 .70 .81 .822 Income low with educ, hlth, hap_gen If income is “just” a control variable--Find a substitute or impute Over 50% of cases for all the combinations Could be worse if you did 3-way (hlth, income, educ) Alan C. Acock, Working with Missing Values

  10. Raw Data Missingness Alan C. Acock, Working with Missing Values

  11. Missing Completely at Random (MCAR) • The Missingness data is random. D1, D2, D3 uncorrelated with anything! • Correlate (or logistic regression) variables with D1, D2, D3 • Consider race, gender, age, education • None of these should be correlated with D1, D2, or D3 • This is not correlating variables with the raw score! Alan C. Acock, Working with Missing Values

  12. Missing at Random (MAR) • The Missingness data is a random pattern after you control for • Variables in your analysis • Auxiliary variables • Probability of missingness NOT dependent on unobserved variables • Correlate variables with D1, D2, D3 • Consider auxiliary variables--race, gender, age, education Alan C. Acock, Working with Missing Values

  13. Missing at Random (MAR) • Include auxiliary variables as mechanisms for missingness • If they are correlated significantly with the missingness, D1, D2, D3 • Data is MAR after controlling auxiliary variables • Auxiliary variables available in many datasets Alan C. Acock, Working with Missing Values

  14. Problem with Traditional Approaches Listwisedeletion—standard default • It excludes many observations—50%? • May be only missing one variable and that variable may not be important • In longitudinal program evaluations • Missing those with low level of implementation • If MCAR, this reduces power, but is unbiased • W/O MCAR this is biased • Political Science Journal—50% deleted Alan C. Acock, Working with Missing Values

  15. Problem with Traditional Approaches Mean Substitution • Mean often bad estimate • Attenuates variance • Reduces effect—variables w/ missing data, or • Exaggerates effects--variables with little missing data • Reduces R2 Alan C. Acock, Working with Missing Values

  16. Problem with Traditional Approaches Pairwise Deletion (rarely used) • Each correlation on different subsample • Set of correlations—no single sample • May not be able to invert matrix • What is the right sample size? • If it works, usually better than mean substitution or listwise deletion Alan C. Acock, Working with Missing Values

  17. Problem with Traditional Approaches Ordinary regression imputation • Multiple regression used to predict their score • Predicted value will have no new information if predictors are in your model—colinearity • Does nothing about uncertainty of predictions • If R2 = .90, the predicted value is good • If R2 = .10, the predicted value has a lot of noise • Thus, predicted values are “too good” Alan C. Acock, Working with Missing Values

  18. Problem with Traditional Approaches Single Imputation (SPSS Module) (MAR) • American Statistician article--done incorrectly • Single imputation does not incorporate variability between multiple imputations • Reviewers for many journals not aware of limitations of single imputation so . . . • Easy to implement using SPSS Alan C. Acock, Working with Missing Values

  19. Modern Approaches Multiple Imputation--Assumes MAR • Imputation is done 5-20 times • Model is estimated 5-20 times • Estimates (R’s, B’s, Betas) are averaged • Standard errors--variances between solutions incorporated • Reflects uncertainty of the process • Always better than single imputation Alan C. Acock, Working with Missing Values

  20. Modern Approaches Multiple Imputation • Available with best Statistical packages • Stata • SAS • Available with freeware programs that work in conjunction with statistical packages • Norm • Amelia • IVEware • Mice Alan C. Acock, Working with Missing Values

  21. Modern Approaches Full Information Maximum Likelihood (FIML) • Assumes MAR • Uses all available information • Assumes patterns same if no missing • Results similar to multiple imputation • Available with SEM programs • Mplus • LISREL • AMOS • EQS Alan C. Acock, Working with Missing Values

  22. Modern Approaches Full Information Maximum Likelihood • Easy changes in SEM programs will do this • Researchers rarely include auxiliary variables • Researchers rarely include covariates unless in model • Possible to add auxiliary/predictor variables • Mplus allows for both FIML estimation and multiple imputation--nice to compare results Alan C. Acock, Working with Missing Values

  23. How Multiple Imputation Works: Non-technical Explanation • All variables may have some missing values, including DV • Eliminate observations will missing values on all variables • Missing wave of panel is just missing values • Estimate covariance matrix (listwise) • Regress xi on remaining variables Alan C. Acock, Working with Missing Values

  24. How Multiple Imputation Works • Add residual based on strength of prediction • R2 = .90—add small error • R2 = .10—add big error • You now have an actual or imputed value for all observations on all variables • Estimate a covariance • This covariance matrix should be “better” because it utilizes more information Alan C. Acock, Working with Missing Values

  25. How Multiple Imputation Works • If covariance matrices are different • Repeat process until successive covariance matrices are virtually identical • This provides first imputed dataset • Repeat this process m times • Results—m imputed datasets with no missing values Alan C. Acock, Working with Missing Values

  26. How Multiple Imputation Works • Estimate your model with each of your m imputed datasets • Combine the results using Rubin’s rules • Parameter estimates—mean of their m values • Standard errors inflate mean of standard errors based on how much solutions vary • Standard errors (hence t-tests) will be unbiased if the data is MAR Alan C. Acock, Working with Missing Values

  27. How FIML is Implemented: Mplus Title: Missing values including mechanisms Data: File is miss_systematic-999.dat ; Variables: Names are childs satfin male hap_gen ident income98 educ hlth age; Missing are all (-999) ; Usevariables are hlth childs hap_gen income98 age educ satfin male ; Analysis: Type = missing ; *without this get listwise Alan C. Acock, Working with Missing Values

  28. FIML: Mplus Example • Model: • hlth on childs hap_gen income98 age educ ; • satfin on childs hap_gen income98 age educ ; • male on childs hap_gen income98 age educ ; • Output: • standardized ; • The “hlth” and “satfin” lines are the model • The “male” line is a nonsense equation that includes any covariates or auxiliary variables Alan C. Acock, Working with Missing Values

  29. Freeware Dedicated Packages Alan C. Acock, Working with Missing Values

  30. Commercial Statistical Packages Alan C. Acock, Working with Missing Values

  31. Commercial FIML Packages Alan C. Acock, Working with Missing Values

  32. Web Pages for Selected Software • Ameilia gking.harvard.edu/amelia/ • Iveware http://www.isr.umich.edu/src/smp/ive/ • Norm http://www.stat.psu.edu/~jls/misoftwa.html#aut • MX www.vcu.edu/mx/ • SPSS www.spss.comwww.mvsoft.com/ • LISREL http://www.ssicentral.com/hlm/index.html • Mplus www.statmodel.com • SAS www.sas.com • Stata www.stata.com Alan C. Acock, Working with Missing Values

More Related