1 / 23

Multiple Imputation

Multiple Imputation. Julia Kozlitina Steve Robertson April 26, 2006. Outline. Multiple Imputation (MI) How to impute (i.e. how to fill in values) How to analyze and draw inferences How many times to impute Alternatives to MI Applications Software. Multiple Imputation -.

parley
Download Presentation

Multiple Imputation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Imputation Julia Kozlitina Steve Robertson April 26, 2006

  2. Outline • Multiple Imputation (MI) • How to impute (i.e. how to fill in values) • How to analyze and draw inferences • How many times to impute • Alternatives to MI • Applications • Software

  3. Multiple Imputation - • Idea: replace each missing item with 2 or more acceptable values, representing a distribution of possibilities (Rubin, 1987). • This results in m complete datasets (each one is analyzed using standard methods, and estimated parameters are averaged). • Can often be generated from simple modifications of existing single-imputation methods such as hot-deck or regression.

  4. Dataset with m imputations m imputations k variables N units in the survey …most useful when the fraction of values missing is not excessive and when m is modest (say 2 to 10) Each row vector of imputations is of length m, where model for 1st imputation = … model for 2nd imputation = … … model for mth imputation = …

  5. Advantages: • Allows to use standard complete-data methods • Can incorporate data collector’s knowledge to reflect the uncertainty about imputed values (sampling variability and uncertainty about the reasons for nonresponse) • Increases efficiency of estimation • Provides valid inferences (for variance estimators) under an assumed model for nonresponse • Allows one to study sensitivity to various models

  6. Disadvantages: • More work is needed to generate multiple imputations Often not difficult to implement using the existing single-imputation scheme • More space is needed to store the data • More work required to analyze the data (not serious when m is modest) – Often not difficult to implement using and standard statistical programs

  7. How to fill in the values : • BAYESIAN PERSPECTIVE (Rubin, 1987): draw multiple imputations to simulate a Bayesian posterior distribution of missing values, that is, conditional distribution of the missing data given the observed data, where, obs = set of observed values inc = set of units included in the sample I = an indicator for inclusion

  8. How to fill in the values : • Impose a probability model on the complete data and nonresponse mechanism(i.e., normal regression or loglinear model) • Create imputations through a 2-step Bayesian process: • Specify prior distributions and draw unknown model parameters, and • Simulate m independent draws from the conditional distribution of the missing data given the observed data

  9. How to fill in the values : • This requires deriving the posterior distribution. In simple problems, closed-form solutions exist • In more complex applications, rely on special computational techniques such as Markov chain Monte Carlo (MCMC) • Other possibilities: approximate Bayesian bootstrap (Rubin, 1987) • Modeling propensity scores to form sampling groups (Lavori et. al., 1995)

  10. Approx. Bayesian Bootstrap (ABS) • Draw n1 values randomly with replacement form Yobs (i.e. create a hot deck) • Draw the n0 = n - n1 components of Ymisrandomly with replacement from Y*obs See Rubin (1987) for details on: • Bayesian Bootstrap (BB) - p. 44, • Approximate Bayesian Bootstrap (ABS) - p. 124

  11. Inference on combined estimates : • The estimate is the average of m repeated complete-data estimates • Let – be the average of m repeated complete-data variances, and – variance between imputations • The total variance is approximately the sum of the two:

  12. Inference on combined estimates • Confidence intervals and significance tests can be computed using a t reference distribution with degrees of freedom, where rm is the relative increase in variance due to nonresponse (Rubin, Ch. 3)

  13. How many imputations needed? • Rubin (1987, p. 114) shows that the relative efficiency of a finite-m estimator is where  is the rate of missing information for the quantity being estimated. • Values shown below. For small , m =2 or 3 is nearly fully efficient.

  14. Problems • Difficulties with MI variance estimator discussed by Binder & Sun (1996), Fay (1996), and others • Gives inconsistent variance estimates under some simple conditions (improper imputation) • Kott (1995) observes that sampling weights must be used for both point and variance estimation in order to satisfy the conditions of being proper • Wang and Robins (1998) explore large-sample properties of MI estimators

  15. Alternatives • Advances made on making efficient and asymptotically valid inferences from single imputations • Shao (2002) and Rao (2000, 2005): jacknife variance estimator for hot-deck imputation in which donors are selected W/R with selection probability proportional to sampling weights • Kalton & Kish (1984), Fay (1996): fractionally weighted imputation – use more than one donor for a recipient

  16. Fractionally Weighted Imputation • Idea: reduce imputation variance relative to single imputation • Fractional hot-deck imputation replaces each missing value with a set of imputed values and assigns a weight to each (Kim & Fuller, 2004), i.e. • Each imputed value receives a “fraction” of the original observation weight

  17. Multiple Imputation Applications • SAS has recently developed a procedure for multiple imputation ( first available in the 8.1 version) • The procedure requires use of both: PROC MI PROC MIANALYZE

  18. MI Applications • Multiple imputation inference involves three distinct phases 1. The missing data are filled in m times to generate m complete data sets (PROC MI) 2. The m complete data sets are analyzed by standard statistical analyses. (PROC REG, PROC GLM, etc.) 3. The results from the m complete data sets are combined to produce inferential results. (PROC MIANALYZE)

  19. Three Imputation Mechanisms : (Choice depends on the type of missing data pattern) • Regression Method - A regression model is fitted for each variable with missing values, with previous variables as covariates. (Monotone missing) • Propensity Score Method - Observations are grouped based on propensity scores, and an approximate Bayesian bootstrap imputation is applied to each group. (Monotone missing) • MCMC Method - (Markov Chain Monte Carlo) Constructs a Markov chain long enough for the distribution of the elements to stabilize (MAR)

  20. Multiple Imputation Applications • See handout of SAS code and output • Examples of the MI procedure can be shown using a data set which contains measurements on men running during a P.E. Course at N.C. State University • 3 Variables of interest: Oxygen intake per minute (ml/kg body wt) Runtime (time in minutes to run 1.5 miles) RunPulse (heart rate while running)

  21. Conclusions: • Multiple imputation is a method of replacing missing values which has some theoretical advantages over other methods • Software is becoming more common to handle multiple imputation and the code is relatively simple

  22. Software Commercial: • SAS PROC MI • SOLAS for Missing Data Analysis (http://www.statsolusa.com/) Free: • MIX - Software for multiple imputation http://www.stat.psu.edu/~jls/misoftwa.html

  23. References • Binder, D.A., and Sun, W. (1996). Frequency valid multiple imputation for surveys with a complex design. Proceedings of the Section on Survey Research Methods, ASA, 281-286. • Fay, R.E. (1996). Alternative paradigms for the analysis of imputed survey data. JASA, 91, 490-498. • Kalton, G., and Kish, L. (1984). Some efficient random imputation methods. Communications in Statistics, A13, 1919-1939. • Kott, P.S. (1995). A paradox of multiple imputation. Proceedings, 384-389. • Kim, J., and Fuller, W.A. (2004). Fractional hot deck imputation. Biometrika, 91, 559-578. • Lavori, P.W., Dawson, R., and Shera, D. (1995). A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine, 14, 1913-1925. • Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc. • SAS Manual Version 8.1, Chapter 11 • Shao, J. (2002). Resampling methods for variance estimation in complex surveys with a complex design. In Survey Nonresponse. Edited by Groves, R.M., et. al. New York: John Wiley & Sons, Inc., 303-314

More Related