Multiple Imputation

1 / 23

Multiple Imputation - PowerPoint PPT Presentation

Multiple Imputation. Julia Kozlitina Steve Robertson April 26, 2006. Outline. Multiple Imputation (MI) How to impute (i.e. how to fill in values) How to analyze and draw inferences How many times to impute Alternatives to MI Applications Software. Multiple Imputation -.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Multiple Imputation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Multiple Imputation

Julia Kozlitina

Steve Robertson

April 26, 2006

Outline
• Multiple Imputation (MI)
• How to impute (i.e. how to fill in values)
• How to analyze and draw inferences
• How many times to impute
• Alternatives to MI
• Applications
• Software
Multiple Imputation -
• Idea: replace each missing item with 2 or more acceptable values, representing a distribution of possibilities (Rubin, 1987).
• This results in m complete datasets (each one is analyzed using standard methods, and estimated parameters are averaged).
• Can often be generated from simple modifications of existing single-imputation methods such as hot-deck or regression.
Dataset with m imputations

m imputations

k variables

N units in the survey

…most useful when the fraction of values missing is not excessive and when m is modest (say 2 to 10)

Each row vector of imputations is of length m, where

model for 1st imputation = …

model for 2nd imputation = …

model for mth imputation = …

• Allows to use standard complete-data methods
• Can incorporate data collector’s knowledge to reflect the uncertainty about imputed values (sampling variability and uncertainty about the reasons for nonresponse)
• Increases efficiency of estimation
• Provides valid inferences (for variance estimators) under an assumed model for nonresponse
• Allows one to study sensitivity to various models
• More work is needed to generate multiple imputations

Often not difficult to implement using the existing single-imputation scheme

• More space is needed to store the data
• More work required to analyze the data (not serious when m is modest) –

Often not difficult to implement using and standard statistical programs

How to fill in the values :
• BAYESIAN PERSPECTIVE (Rubin, 1987): draw multiple imputations to simulate a Bayesian posterior distribution of missing values, that is, conditional distribution of the missing data given the observed data,

where, obs = set of observed values

inc = set of units included in the sample

I = an indicator for inclusion

How to fill in the values :
• Impose a probability model on the complete data and nonresponse mechanism(i.e., normal regression or loglinear model)
• Create imputations through a 2-step Bayesian process:
• Specify prior distributions and draw unknown model parameters, and
• Simulate m independent draws from the conditional distribution of the missing data given the observed data
How to fill in the values :
• This requires deriving the posterior distribution. In simple problems, closed-form solutions exist
• In more complex applications, rely on special computational techniques such as Markov chain Monte Carlo (MCMC)
• Other possibilities: approximate Bayesian bootstrap (Rubin, 1987)
• Modeling propensity scores to form sampling groups (Lavori et. al., 1995)
Approx. Bayesian Bootstrap (ABS)
• Draw n1 values randomly with replacement form Yobs (i.e. create a hot deck)
• Draw the n0 = n - n1 components of Ymisrandomly with replacement from Y*obs

See Rubin (1987) for details on:

• Bayesian Bootstrap (BB) - p. 44,
• Approximate Bayesian Bootstrap (ABS) - p. 124
Inference on combined estimates :
• The estimate is the average of m repeated complete-data estimates
• Let – be the average of m repeated

complete-data variances, and

– variance between imputations

• The total variance is approximately the sum of the two:
Inference on combined estimates
• Confidence intervals and significance tests can be computed using a t reference distribution with

degrees of freedom, where rm is the relative increase in variance due to nonresponse (Rubin, Ch. 3)

How many imputations needed?
• Rubin (1987, p. 114) shows that the relative efficiency of a finite-m estimator is

where  is the rate of missing information for the quantity being estimated.

• Values shown below. For small , m =2 or 3 is nearly fully efficient.
Problems
• Difficulties with MI variance estimator discussed by Binder & Sun (1996), Fay (1996), and others
• Gives inconsistent variance estimates under some simple conditions (improper imputation)
• Kott (1995) observes that sampling weights must be used for both point and variance estimation in order to satisfy the conditions of being proper
• Wang and Robins (1998) explore large-sample properties of MI estimators
Alternatives
• Advances made on making efficient and asymptotically valid inferences from single imputations
• Shao (2002) and Rao (2000, 2005): jacknife variance estimator for hot-deck imputation in which donors are selected W/R with selection probability proportional to sampling weights
• Kalton & Kish (1984), Fay (1996): fractionally weighted imputation – use more than one donor for a recipient
Fractionally Weighted Imputation
• Idea: reduce imputation variance relative to single imputation
• Fractional hot-deck imputation replaces each missing value with a set of imputed values and assigns a weight to each (Kim & Fuller, 2004), i.e.
• Each imputed value receives a “fraction” of the original observation weight
Multiple Imputation Applications
• SAS has recently developed a procedure for multiple imputation ( first available in the 8.1 version)
• The procedure requires use of both:

PROC MI

PROC MIANALYZE

MI Applications
• Multiple imputation inference involves three distinct phases

1. The missing data are filled in m times to generate m complete data sets (PROC MI)

2. The m complete data sets are analyzed by standard statistical analyses. (PROC REG, PROC GLM, etc.)

3. The results from the m complete data sets are combined to produce inferential results. (PROC MIANALYZE)

Three Imputation Mechanisms :

(Choice depends on the type of missing data pattern)

• Regression Method - A regression model is fitted for each variable with missing values, with previous variables as covariates. (Monotone missing)
• Propensity Score Method - Observations are grouped based on propensity scores, and an approximate Bayesian bootstrap imputation is applied to each group. (Monotone missing)
• MCMC Method - (Markov Chain Monte Carlo) Constructs a Markov chain long enough for the distribution of the elements to stabilize (MAR)
Multiple Imputation Applications
• See handout of SAS code and output
• Examples of the MI procedure can be shown using a data set which contains measurements on men running during a P.E. Course at N.C. State University
• 3 Variables of interest:

Oxygen intake per minute (ml/kg body wt)

Runtime (time in minutes to run 1.5 miles)

RunPulse (heart rate while running)

Conclusions:
• Multiple imputation is a method of replacing missing values which has some theoretical advantages over other methods
• Software is becoming more common to handle multiple imputation and the code is relatively simple
Software

Commercial:

• SAS PROC MI
• SOLAS for Missing Data Analysis (http://www.statsolusa.com/)

Free:

• MIX - Software for multiple imputation

http://www.stat.psu.edu/~jls/misoftwa.html

References
• Binder, D.A., and Sun, W. (1996). Frequency valid multiple imputation for surveys with a complex design. Proceedings of the Section on Survey Research Methods, ASA, 281-286.
• Fay, R.E. (1996). Alternative paradigms for the analysis of imputed survey data. JASA, 91, 490-498.
• Kalton, G., and Kish, L. (1984). Some efficient random imputation methods. Communications in Statistics, A13, 1919-1939.
• Kott, P.S. (1995). A paradox of multiple imputation. Proceedings, 384-389.
• Kim, J., and Fuller, W.A. (2004). Fractional hot deck imputation. Biometrika, 91, 559-578.
• Lavori, P.W., Dawson, R., and Shera, D. (1995). A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine, 14, 1913-1925.
• Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc.
• SAS Manual Version 8.1, Chapter 11
• Shao, J. (2002). Resampling methods for variance estimation in complex surveys with a complex design. In Survey Nonresponse. Edited by Groves, R.M., et. al. New York: John Wiley & Sons, Inc., 303-314