Multiple Imputation

Multiple Imputation Julia Kozlitina Steve Robertson April 26, 2006

Outline • Multiple Imputation (MI) • How to impute (i.e. how to fill in values) • How to analyze and draw inferences • How many times to impute • Alternatives to MI • Applications • Software

Multiple Imputation - • Idea: replace each missing item with 2 or more acceptable values, representing a distribution of possibilities (Rubin, 1987). • This results in m complete datasets (each one is analyzed using standard methods, and estimated parameters are averaged). • Can often be generated from simple modifications of existing single-imputation methods such as hot-deck or regression.

Dataset with m imputations m imputations k variables N units in the survey …most useful when the fraction of values missing is not excessive and when m is modest (say 2 to 10) Each row vector of imputations is of length m, where model for 1st imputation = … model for 2nd imputation = … … model for mth imputation = …

Advantages: • Allows to use standard complete-data methods • Can incorporate data collector’s knowledge to reflect the uncertainty about imputed values (sampling variability and uncertainty about the reasons for nonresponse) • Increases efficiency of estimation • Provides valid inferences (for variance estimators) under an assumed model for nonresponse • Allows one to study sensitivity to various models

Disadvantages: • More work is needed to generate multiple imputations Often not difficult to implement using the existing single-imputation scheme • More space is needed to store the data • More work required to analyze the data (not serious when m is modest) – Often not difficult to implement using and standard statistical programs

How to fill in the values : • BAYESIAN PERSPECTIVE (Rubin, 1987): draw multiple imputations to simulate a Bayesian posterior distribution of missing values, that is, conditional distribution of the missing data given the observed data, where, obs = set of observed values inc = set of units included in the sample I = an indicator for inclusion

How to fill in the values : • Impose a probability model on the complete data and nonresponse mechanism(i.e., normal regression or loglinear model) • Create imputations through a 2-step Bayesian process: • Specify prior distributions and draw unknown model parameters, and • Simulate m independent draws from the conditional distribution of the missing data given the observed data

How to fill in the values : • This requires deriving the posterior distribution. In simple problems, closed-form solutions exist • In more complex applications, rely on special computational techniques such as Markov chain Monte Carlo (MCMC) • Other possibilities: approximate Bayesian bootstrap (Rubin, 1987) • Modeling propensity scores to form sampling groups (Lavori et. al., 1995)

Approx. Bayesian Bootstrap (ABS) • Draw n1 values randomly with replacement form Yobs (i.e. create a hot deck) • Draw the n0 = n - n1 components of Ymisrandomly with replacement from Y*obs See Rubin (1987) for details on: • Bayesian Bootstrap (BB) - p. 44, • Approximate Bayesian Bootstrap (ABS) - p. 124

Inference on combined estimates : • The estimate is the average of m repeated complete-data estimates • Let – be the average of m repeated complete-data variances, and – variance between imputations • The total variance is approximately the sum of the two:

Inference on combined estimates • Confidence intervals and significance tests can be computed using a t reference distribution with degrees of freedom, where rm is the relative increase in variance due to nonresponse (Rubin, Ch. 3)

How many imputations needed? • Rubin (1987, p. 114) shows that the relative efficiency of a finite-m estimator is where  is the rate of missing information for the quantity being estimated. • Values shown below. For small , m =2 or 3 is nearly fully efficient.

Problems • Difficulties with MI variance estimator discussed by Binder & Sun (1996), Fay (1996), and others • Gives inconsistent variance estimates under some simple conditions (improper imputation) • Kott (1995) observes that sampling weights must be used for both point and variance estimation in order to satisfy the conditions of being proper • Wang and Robins (1998) explore large-sample properties of MI estimators

Alternatives • Advances made on making efficient and asymptotically valid inferences from single imputations • Shao (2002) and Rao (2000, 2005): jacknife variance estimator for hot-deck imputation in which donors are selected W/R with selection probability proportional to sampling weights • Kalton & Kish (1984), Fay (1996): fractionally weighted imputation – use more than one donor for a recipient

Fractionally Weighted Imputation • Idea: reduce imputation variance relative to single imputation • Fractional hot-deck imputation replaces each missing value with a set of imputed values and assigns a weight to each (Kim & Fuller, 2004), i.e. • Each imputed value receives a “fraction” of the original observation weight

Multiple Imputation Applications • SAS has recently developed a procedure for multiple imputation ( first available in the 8.1 version) • The procedure requires use of both: PROC MI PROC MIANALYZE

MI Applications • Multiple imputation inference involves three distinct phases 1. The missing data are filled in m times to generate m complete data sets (PROC MI) 2. The m complete data sets are analyzed by standard statistical analyses. (PROC REG, PROC GLM, etc.) 3. The results from the m complete data sets are combined to produce inferential results. (PROC MIANALYZE)

Three Imputation Mechanisms : (Choice depends on the type of missing data pattern) • Regression Method - A regression model is fitted for each variable with missing values, with previous variables as covariates. (Monotone missing) • Propensity Score Method - Observations are grouped based on propensity scores, and an approximate Bayesian bootstrap imputation is applied to each group. (Monotone missing) • MCMC Method - (Markov Chain Monte Carlo) Constructs a Markov chain long enough for the distribution of the elements to stabilize (MAR)

Multiple Imputation Applications • See handout of SAS code and output • Examples of the MI procedure can be shown using a data set which contains measurements on men running during a P.E. Course at N.C. State University • 3 Variables of interest: Oxygen intake per minute (ml/kg body wt) Runtime (time in minutes to run 1.5 miles) RunPulse (heart rate while running)

Conclusions: • Multiple imputation is a method of replacing missing values which has some theoretical advantages over other methods • Software is becoming more common to handle multiple imputation and the code is relatively simple

Software Commercial: • SAS PROC MI • SOLAS for Missing Data Analysis (http://www.statsolusa.com/) Free: • MIX - Software for multiple imputation http://www.stat.psu.edu/~jls/misoftwa.html

References • Binder, D.A., and Sun, W. (1996). Frequency valid multiple imputation for surveys with a complex design. Proceedings of the Section on Survey Research Methods, ASA, 281-286. • Fay, R.E. (1996). Alternative paradigms for the analysis of imputed survey data. JASA, 91, 490-498. • Kalton, G., and Kish, L. (1984). Some efficient random imputation methods. Communications in Statistics, A13, 1919-1939. • Kott, P.S. (1995). A paradox of multiple imputation. Proceedings, 384-389. • Kim, J., and Fuller, W.A. (2004). Fractional hot deck imputation. Biometrika, 91, 559-578. • Lavori, P.W., Dawson, R., and Shera, D. (1995). A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine, 14, 1913-1925. • Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc. • SAS Manual Version 8.1, Chapter 11 • Shao, J. (2002). Resampling methods for variance estimation in complex surveys with a complex design. In Survey Nonresponse. Edited by Groves, R.M., et. al. New York: John Wiley & Sons, Inc., 303-314

Multiple Imputation