A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001

A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001

Imputation Techniques Implemented in SOLAS 3.0 • SINGLE IMPUTATION • Hot Decking • Predicted Mean Imputation • Last Value Carried Forward • MULTIPLE IMPUTATIONS • Propensity Score Based Imputation • Predictive Model Based Imputation

Method 1: Propensity Score Based Imputation • This was the only Method in Version 1. • Method similar to Lavori,Dawson,Shera (1995) “A multiple imputation strategy for clinical trials with truncation of patient data” • GOAL: To impute Missing values by minimal Distributional Assumptions

How it Works • Let R be the indicator for the missingness pattern (R=0 or 1) • Model R from X1, X2,..., XP using logistic regression • p=Prob(R=1| X1, X2,…,XP) for each case yielding N pi’s.

How it works…. (Approximate Bayesian bootstrap, Rubin, 1987) • Group (user specified) the units by the value of the quintiles of p. • Suppose that within a particular group there are n1 observed and n0 missing values. Quintiles of p

sample n1+n0 units with replacement from the observed values. • From the sampled pool, subsample n0 units with replacement • Use these n0 units as the imputed values for the n0 missing values • Repeat the procedure m times to get m imputations • with replacement with replacement • n1 obs n0+ n1 n0

Theoretical Justification • It produces an imputed distribution of Y that has been corrected for biases due to missingness related to X. • It's similar in spirit to reweighting but here we have a multiple imputation version of it. • The method produces unbiased estimates for marginal distribution of Y.

Problems/Drawbacks The method does not preserve the association between Y and individual Xi’s. Reasoning: The only aspect of Xi’s that is used here is the linear prediction for Y (b0+ b1X1+b2X2…. +bpXp) in the logistic model. This is the function that predicts missingness of Y (R) but not Y itself.

Problems/Drawbacks (Continued….) Suppose X1 is highly correlated with Y but is unrelated to P(R=1). X1 will drop out of the the logistic model and it is not used in the imputation. As a result, the model will misrepresent the correlation of X1 and Y. Also, by not using X1 in the imputation, we are failing to impute Y efficiently.

Simulation Results Using SOLAS 1.1 Data Generation Mechanism: Y=X+Z+e, where and e ~N(0,1) Source: Paul D. Allison “Multiple Imputation for Missing Data, A Cautionary Tale”

Some Comments About the Propensity Score Based Method • The method can provide valid but possibly inefficient inferences about Y (marginal). • The method can lead to very misleading inferences about the relationships between Y and other variables.

Method 2: Predictive Model Based Multiple Imputation This method is implemented in SOLAS 2.0 and 3.0 HOW IT WORKS: • Regress Y on X1, X2,…, Xp • Get the estimates of b0,b1,b2,….bp and s2 • Draw b0*,b1*,b2*….bp*, s2* from an approximate posterior distribution • Impute Y*= b0*+ b1* X1+b2* X2…. +bp* Xp+e* where e*~Normal(0, s2*) • Repeat m times to get the m imputed datasets

Good points • The method provides correct model based MI under the regression model and MAR • It also preserves the correlation between Xi's and Y What is the difference with NORM ? • NORM does the same thing with MCMC • Under multivariate normal model, both methods give the same results

Which Software is More General ? I work for arbitrary missingness pattern I work for non-linear relation of y on X But that’s probably very similar to norm with rounding

Concluding Remarks • SOLAS is the first commercial missing data software. • It has good graphical interface. • Easy data import and export to other softwares. • Performs well under monotone missingness pattern. • Estimates are not always unbiased.

A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001