Multiple Imputation for Handling Item Nonresponse in Environmental Data

1. Multiple Imputation for Handling Item Nonresponse in Environmental Data Breda Munoz Ruben Smith Virginia Lesser

2. This presentation was supported under STAR Research Assistance Agreement No. CR82-9096-01 awarded by the U.S. Environmental Protection Agency to Oregon State University. It has not been formally reviewed by EPA. The views expressed in this document are solely those of authors and EPA does not endorse any products or commercial services mentioned in this presentation.

3. Outline Introduction Multiple imputation and Hierarchical Bayesian Mixed Model Illustration Summary Future research

4. Introduction Researchers using environmental data face problems of missing data Missing observational unit (unit nonresponse) Few variables missing for an observational unit (item nonresponse) Causes of missing data: Failure of the measuring instruments Inaccessibility of the site Data lost or damaged

5. Introduction Impact of missing data depends on: The missing data mechanism The fraction of complete cases The type of parameter the researcher intends to estimate Missing data mechanism Missing completely at random (MCAR) Missing at random Missing not at random or ignorable

6. Introduction Missing at random (MAR) Missingness does not depend on the non observed response but depends on observed values and covariates A model can be formulated and incorporated into the analysis techniques to explain and account for the nonresponse mechanism Little and Rubin (2002); Lohr (2001); Lessler and Kalsbeek (2000)

7. Introduction Results of data analysis on single imputation data do not reflect the missing-data uncertainty, or the consequence of imputation Schafer and Olsen (1998): analyses based on a single imputation may result: Small standard errors Smaller p-values

8. Multiple Imputation Multiple imputation (MI) is a well known methodology for handling non-response Incorporates uncertainty of the missing data into the inference Replaces each missing data with several values from a distribution of likely values Generates m complete data sets, on which the same analysis procedure is performed Final inferences are combinations of individual ones(Rubin, 1987)

9. Multiple Imputation Advantages of MI: Possibility of performing different analyses with the same collection of m complete data sets while accounting for the missing data problem Highly efficiency achieved for small values of m Data sets can be analyzed using standard techniques and software available for complete datasets (Schaffer and Olsen, 1998; Schaffer, 1997)

10. Multiple Imputation Let a probabilistic sample Missing data occurred in n1 of the n sites Define response indicator: R(s) = 1 if the value of z(s) was observed at site s, R(s) = 0 otherwise

11. Multiple Imputation Under MAR assumption: Imputations for are obtained from the posterior predictive distribution of the missing data: Valid inferences from MI: the imputation model should preserve the same relationship in the data that would be considered at the analysis stage (Schaffer(1997); Rubin (1996) )

12. Multiple Imputation Note: the posterior of ? given the observed data is: is the observed data likelihood is some prior for ? (Schafer (1997) and Little and Rubin (2000))

13. Hierarchical Bayesian Models

14. Illustration Data: Oregon Stream Habitat Surveys Conducted every year from June through September Surveys are designed to assess all streams within the range of Coho salmon Target population: all streams located in watersheds at western Oregon, that drain into the Pacific Ocean south of the Columbia River

15. Illustration Sites selected using Random Tessellation Stratified (RTS) design (Stevens 1997) Variable: Average unit gradient (represents the overall steepness of the stream channel within each habitat unit throughout the reach). Log(Gradient+0.001) is approximately normal

16.

17. Illustration

18. Illustration ODFW habitat surveys-1998-2002 n= 647 observed n1= 75 (spawners surveys from year 2000 without habitat variables) Y(si)|Z(si) ~ independent N( �, s2e I ) Z ~ MVN[0,s2z R(?)] where R(si,sj) =

19. Illustration Parameter priors: ? , ? ~ Uniform (ai,bi) where i= ? , ?, s2Z, s2e ~ Inverse Gamma(ai,bi) where i=z, e Joint Posterior distribution:

20. Illustration MCMC methods were used to draw samples from posterior and marginal distributions: Gibss sampler Metropolis-Hastings MCMC simulation was run for 15,000 iterations after a 10,000 burn-in period.

23. Illustration Prediction at location s0 : Write expresssion here

24. Future Research Implement MI under other distributions such as: Gamma, Poisson, Bernoulli Incorporate auxiliary variables into systematic part Explore MI with other methods of geostatistical analysis Explore imputation using the Posterior Predictive Distribution mean.

25. Illustration

26. Thanks to Phil Larson, Steve Jacobs, Kim Jones, Jeff Rodgers and Andy Talavere for providing data, interpretation and useful comments.

Multiple Imputation for Handling Item Nonresponse in Environmental Data

Multiple Imputation for Handling Item Nonresponse in Environmental Data

Presentation Transcript

Data Handling

Handling Data

Data Handling in Science

Repetition Multiple imputation

Data Imputation

Multiple Choice Item Construction

Multiple Imputation of missing data in longitudinal health records

Imputation

Item Nonresponse in a Mail Survey of Young Adults

Multiple Imputation

LECTURE 15 MULTIPLE IMPUTATION

Imputation for Multi Care Data

Disclosure Limitation in Microdata with Multiple Imputation

Introduction to Multiple Imputation

Nonresponse Rates and Nonresponse Bias In Surveys

Multiple Imputation

Handling of data from multiple databases

Data Imputation Methods and Technologies

Multiple Imputation using SOLAS for Missing Data Analysis