1 / 70

Todd D. Little University of Kansas Director, Quantitative Training Program

On the Merits of Planning and Planning for Missing Data* *You’re a fool for not using planned missing data design. Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis

sloan
Download Presentation

Todd D. Little University of Kansas Director, Quantitative Training Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Merits of Planning and Planning for Missing Data* • *You’re a fool for not using planned missing data design Todd D. Little University of Kansas Director, Quantitative Training Program Director, Center for Research Methods and Data Analysis Director, Undergraduate Social and Behavioral Sciences Methodology Minor Member, Developmental Psychology Training Program crmda.KU.edu Workshop presented 3-7-2012 @ Society for Research in Adolescence Special Thanks to: Mijke Rhemtulla & Wei Wu crmda.KU.edu

  2. Learn about the different types of missing data • Learn about ways in which the missing data process can be recovered • Understand why imputing missing data is not cheating • Learn why NOT imputing missing data is more likely to lead to errors in generalization! • Learn about intentionally missing designs • Introduce a simple method for significance testing • Discuss imputation with large longitudinal datasets Road Map crmda.KU.edu

  3. Key Considerations • Recoverability • Is it possible to recover what the sufficient statistics would have been if there was no missing data? • (sufficient statistics = means, variances, and covariances) • Is it possible to recover what the parameter estimates of a model would have been if there was no missing data. • Bias • Are the sufficient statistics/parameter estimates systematically different than what they would have been had there not been any missing data? • Power • Do we have the same or similar rates of power (1 – Type II error rate) as we would without missing data? crmda.KU.edu

  4. Effects of imputing missing data crmda.KU.edu

  5. Types of Missing Data • Missing Completely at Random (MCAR) • No association with unobserved variables (selective process) and no association with observed variables • Missing at Random (MAR) • No association with unobserved variables, but maybe related to observed variables • Random in the statistical sense of predictable • Non-random (Selective) Missing (MNAR) • Some association with unobserved variables and maybe with observed variables crmda.KU.edu

  6. Effects of imputing missing data crmda.KU.edu

  7. Effects of imputing missing data Statistical Power: Will always be greater when missing data is imputed! crmda.KU.edu

  8. Modern Missing Data Analysis MI or FIML • In 1978, Rubin proposed Multiple Imputation (MI) • An approach especially well suited for use with large public-use databases. • First suggested in 1978 and developed more fully in 1987. • MI primarily uses the Expectation Maximization (EM) algorithm and/or the Markov Chain Monte Carlo (MCMC) algorithm. • Beginning in the 1980’s, likelihood approaches developed. • Multiple group SEM • Full Information Maximum Likelihood (FIML). • An approach well suited to more circumscribed models crmda.KU.edu

  9. Full Information Maximum Likelihood • FIML maximizes the casewise -2loglikelihood of the available data to compute an individual mean vector and covariance matrix for every observation. • Since each observation’s mean vector and covariance matrix is based on its own unique response pattern, there is no need to fill in the missing data. • Each individual likelihood function is then summed to create a combined likelihood function for the whole data frame. • Individual likelihood functions with greater amounts of missing are given less weight in the final combined likelihood function than those will a more complete response pattern, thus controlling for the loss of information. • Formally, the function that FIML is maximizing is where crmda.KU.edu

  10. Multiple Imputation • Multiple imputation involves generating m imputed datasets (usually between 20 and 100), running the analysis model on each of these datasets, and combining the m sets of results to make inferences. • By filling in m separate estimates for each missing value we can account for the uncertainty in that datum’s true population value. • Data sets can be generated in a number of ways, but the two most common approaches are through an MCMC simulation technique such as Tanner & Wong’s (1987) Data Augmentation algorithm or through bootstrapping likelihood estimates, such as the bootstrapped EM algorithm used by Amelia II. • SAS uses data augmentation to pull random draws from a specified posterior distribution (i.e., stationary distribution of EM estimates). • After m data sets have been created and the analysis model has been run on each separately, the resulting estimates are commonly combined with Rubin’s Rules (Rubin, 1987). crmda.KU.edu

  11. Fraction Missing • Fraction Missing is a measure of efficiency lost due to missing data. It is the extent to which parameter estimates have greater standard errors than they would have had all data been observed. • It is a ratio of variances: Estimated parameter variance in the complete data set Between-imputation variance crmda.KU.edu

  12. Fraction Missing • Fraction of Missing Information (asymptotic formula) • Varies by parameter in the model • Is typically smaller for MCAR than MAR data crmda.KU.edu

  13. Estimate Missing Data With SAS Obs BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 65 95 95 100 23 25 25 27 2 10 10 40 25 25 27 28 27 3 95 100 100 100 27 29 29 28 4 90 100 100 100 30 30 27 29 5 30 80 90 100 23 29 29 30 6 40 50 . . 28 27 3 3 7 40 70 100 95 29 29 30 30 8 95 100 100 100 28 30 29 30 9 50 80 75 85 26 29 27 25 10 55 100 100 100 30 30 30 30 11 50 100 100 100 30 27 30 24 12 70 95 100 100 28 28 28 29 13 100 100 100 100 30 30 30 30 14 75 90 100 100 30 30 29 30 15 0 5 10 . 3 3 3 . crmda.KU.edu

  14. PROC MI data=sample out=outmi seed = 37851 nimpute=100 EM maxiter = 1000; MCMC initial=em (maxiter=1000); Var BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6; run; out= Designates output file for imputed data nimpute = # of imputed datasets Default is 5 Var Variables to use in imputation PROC MI crmda.KU.edu

  15. PROC MI output: Imputed dataset Obs _Imputation_ BADL0 BADL1 BADL3 BADL6 MMSE0 MMSE1 MMSE3 MMSE6 1 1 65 95 95 100 23 25 25 27 2 1 10 10 40 25 25 27 28 27 3 1 95 100 100 100 27 29 29 28 4 1 90 100 100 100 30 30 27 29 5 1 30 80 90 100 23 29 29 30 6 1 40 50 21 12 28 27 3 3 7 1 40 70 100 95 29 29 30 30 8 1 95 100 100 100 28 30 29 30 9 1 50 80 75 85 26 29 27 25 10 1 55 100 100 100 30 30 30 30 11 1 50 100 100 100 30 27 30 24 12 1 70 95 100 100 28 28 28 29 13 1 100 100 100 100 30 30 30 30 14 1 75 90 100 100 30 30 29 30 15 1 0 5 10 8 3 3 3 2 crmda.KU.edu

  16. What to Say to Reviewers: • I pity the fool who does not impute • Mr. T • If you compute you must impute • Johnny Cochran • Go forth and impute with impunity • Todd Little • If math is God’s poetry, then statistics are God’s elegantly reasoned prose • Bill Bukowski crmda.KU.edu

  17. 3-Form Intentionally Missing Design crmda.KU.edu

  18. Three-form design • What goes in the Common Set? crmda.KU.edu

  19. Three-form design: Example • 21 questions made up of 7 3-question subtests crmda.KU.edu

  20. Three-form design: Example • Common Set (X) crmda.KU.edu

  21. Three-form design: Example • Common Set (X) crmda.ku.edu

  22. Three-form design: Example • Set A I start conversations. I get stressed out easily. I am always prepared. I have a rich vocabulary. I am interested in people. crmda.KU.edu

  23. Three-form design: Example • Set B I am the life of the party. I get irritated easily. I like order. I have excellent ideas. I have a soft heart. crmda.KU.edu

  24. Three-form design: Example • Set C I am comfortable around people. I have frequent mood swings. I pay attention to details. I have a vivid imagination. I take time out for others. crmda.KU.edu

  25. crmda.KU.edu

  26. Missing Data and Estimation:Missingness by Design • Assess all persons, but not all variables at each time of measurement • McArdle, Graham • Have core battery for all participants, but divide sample into groups and each group has additional measures • Control entry into study, to estimate and control for retesting effects • Randomly assign participants to their entry into a longitudinal study and to the occasions of assessment • Likely to be key in providing unbiased estimates of growth or change crmda.KU.edu

  27. Expansions of 3-Form Design • (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu

  28. Expansions of 3-Form Design • (Graham, Taylor, Olchowski, & Cumsille, 2006) crmda.KU.edu

  29. 2-Method Planned Missing Design crmda.KU.edu

  30. 2-Method Planned Missing Design • Use when you have an ideal (highly valid) measure that is time-consuming or expensive • By supplementing this measure with a less expensive or time-consuming measure, it is possible to increase total sample size and get higher power • e.g., measuring stress • Expensive measure = collect spit samples, measure cortisol • Inexpensive measure = survey querying stressful thoughts • e.g., measuring intelligence • Expensive measure = WAIS IQ scale • Inexpensive measure = multiple choice IQ test • e.g., measuring smoking • Expensive measure = carbon monoxide measure • Inexpensive measure = self-report crmda.KU.edu

  31. 2-Method Planned Missing Design • Assumptions: • expensive measure is unbiased (i.e., valid) • inexpensive measure is systematically biased • Using both measures (on a subset of participants) enables us to estimate and remove the bias from the inexpensive measure (for all participants) • As the inexpensive measure gets more valid, fewer observations are needed on the expensive measure • If inexpensive measure is perfectly unbiased, we don’t need the expensive measure at all! crmda.KU.edu

  32. 2-Method Planned Missing Design • All participants get the inexpensive measure • Only a subset get the expensive measure • Cost: crmda.KU.edu

  33. 2-Method Planned Missing Design • Holding cost constant, as Ntotal increases, Nexpensive decreases • As Ntotal increases, SEs begin to decrease (power increases); as Ntotal continues to increase, SEs increase again, driving power back down crmda.KU.edu

  34. 2-Method Planned Missing Design Self-Report Bias Self- Report 1 Self- Report 2 CO Cotinine Smoking crmda.KU.edu

  35. 2-Method Planned Missing Design • Goal: find the sweet spot! crmda.KU.edu

  36. Longitudinal methods • Rather than specific items missing, longitudinal planned missing designs tend to focus on whole waves missing for individual participants • Researchers have long turned complete data into planned missing data with more time points • e.g., data at 3 grades transformed into 8 ages crmda.KU.edu

  37. age grade 5;6- 5;11 6;6- 6;11 7;6- 7;11 4;6- 4;11 5;0- 5;5 6;0- 6;5 7;0- 7;5 2 student K 1 1 5;6 6;7 7;3 2 5;3 6;0 7;4 3 4;9 5;11 6;10 4 4;6 5;5 6;4 5 4;11 5;9 6;10 6 5;7 6;7 7;5 7 5;2 6;1 7;3 8 5;4 6;5 7;6 crmda.KU.edu

  38. age • Out of 3 waves, we create 7 waves of data with high missingness • Allows for more fine-tuned age-specific growth modeling • Even high amounts of missing data are not typically a problem for estimation 5;6- 5;11 6;6- 6;11 7;6- 7;11 4;6- 4;11 5;0- 5;5 6;0- 6;5 7;0- 7;5 5;6 6;7 7;3 5;3 6;0 7;4 4;9 5;11 6;10 4;6 5;5 6;4 4;11 5;9 6;10 5;7 6;7 7;5 5;2 6;1 7;3 5;4 6;5 7;6 crmda.KU.edu

  39. Growth-Curve Design crmda.KU.edu

  40. Growth Curve Design II crmda.KU.edu

  41. Growth Curve Design II crmda.KU.edu

  42. Efficiency of Planned Missing Designs crmda.KU.edu

  43. Combined Elements crmda.KU.edu

  44. The Sequential Designs crmda.KU.edu

  45. Transforming to Accelerated Longitudinal crmda.KU.edu

  46. Transforming to Episodic Time crmda.KU.edu

  47. The Impact of Auxiliary Variables • Consider the following Monte Carlo simulation: • 60% MAR (i.e., Aux1) missing data • 1,000 samples of N = 100 www.crmda.ku.edu crmda.KU.edu 48

  48. Excluding A Correlate of Missingness www.crmda.ku.edu crmda.KU.edu 49

  49. Figure 3. Simulation Results Showing the Bias Associated with Omitting a Correlate of Missingness. crmda.KU.edu

More Related