260 likes | 437 Views
This presentation was supported under STAR Research Assistance Agreement No. CR82-9096-01 awarded by the U.S. Environmental Protection Agency to Oregon State University. It has not been formally reviewed by EPA. The views expressed in this document are solely those of authors and EPA does not endo
E N D
1. Multiple Imputation for Handling Item Nonresponse in Environmental Data Breda Munoz
Ruben Smith
Virginia Lesser
2. This presentation was supported under STAR Research Assistance Agreement No. CR82-9096-01 awarded by the U.S. Environmental Protection Agency to Oregon State University. It has not been formally reviewed by EPA. The views expressed in this document are solely those of authors and EPA does not endorse any products or commercial services mentioned in this presentation.
3. Outline Introduction
Multiple imputation and Hierarchical Bayesian Mixed Model
Illustration
Summary
Future research
4. Introduction Researchers using environmental data face problems of missing data
Missing observational unit (unit nonresponse)
Few variables missing for an observational unit (item nonresponse)
Causes of missing data:
Failure of the measuring instruments
Inaccessibility of the site
Data lost or damaged
5. Introduction Impact of missing data depends on:
The missing data mechanism
The fraction of complete cases
The type of parameter the researcher intends to estimate
Missing data mechanism
Missing completely at random (MCAR)
Missing at random
Missing not at random or ignorable
6. Introduction Missing at random (MAR)
Missingness does not depend on the non observed response but depends on observed values and covariates
A model can be formulated and incorporated into the analysis techniques to explain and account for the nonresponse mechanism
Little and Rubin (2002); Lohr (2001); Lessler and Kalsbeek (2000)
7. Introduction Results of data analysis on single imputation data do not reflect the missing-data uncertainty, or the consequence of imputation
Schafer and Olsen (1998): analyses based on a single imputation may result:
Small standard errors
Smaller p-values
8. Multiple Imputation Multiple imputation (MI) is a well known methodology for handling non-response
Incorporates uncertainty of the missing data into the inference
Replaces each missing data with several values from a distribution of likely values
Generates m complete data sets, on which the same analysis procedure is performed
Final inferences are combinations of individual ones(Rubin, 1987)
9. Multiple Imputation Advantages of MI:
Possibility of performing different analyses with the same collection of m complete data sets while accounting for the missing data problem
Highly efficiency achieved for small values of m
Data sets can be analyzed using standard techniques and software available for complete datasets (Schaffer and Olsen, 1998; Schaffer, 1997)
10. Multiple Imputation Let a probabilistic sample
Missing data occurred in n1 of the n sites
Define response indicator:
R(s) = 1 if the value of z(s) was observed at site s,
R(s) = 0 otherwise
11. Multiple Imputation Under MAR assumption:
Imputations for are obtained from the posterior predictive distribution of the missing data:
Valid inferences from MI: the imputation model should preserve the same relationship in the data that would be considered at the analysis stage (Schaffer(1997); Rubin (1996) )
12. Multiple Imputation Note:
the posterior of ? given the observed data is:
is the observed data likelihood
is some prior for ? (Schafer (1997) and Little and Rubin (2000))
13. Hierarchical Bayesian Models
14. Illustration Data: Oregon Stream Habitat Surveys
Conducted every year from June through September
Surveys are designed to assess all streams within the range of Coho salmon
Target population: all streams located in watersheds at western Oregon, that drain into the Pacific Ocean south of the Columbia River
15. Illustration Sites selected using Random Tessellation Stratified (RTS) design (Stevens 1997)
Variable: Average unit gradient (represents the overall steepness of the stream channel within each habitat unit throughout the reach).
Log(Gradient+0.001) is approximately normal
16.
17. Illustration
18. Illustration ODFW habitat surveys-1998-2002
n= 647 observed
n1= 75 (spawners surveys from year 2000 without habitat variables)
Y(si)|Z(si) ~ independent N( µ, s2e I )
Z ~ MVN[0,s2z R(?)] where R(si,sj) =
19. Illustration Parameter priors:
? , ? ~ Uniform (ai,bi) where i= ? , ?,
s2Z, s2e ~ Inverse Gamma(ai,bi) where i=z, e
Joint Posterior distribution:
20. Illustration MCMC methods were used to draw samples from posterior and marginal distributions:
Gibss sampler
Metropolis-Hastings
MCMC simulation was run for 15,000 iterations after a 10,000 burn-in period.
23. Illustration Prediction at location s0 :
Write expresssion here
24. Future Research Implement MI under other distributions such as: Gamma, Poisson, Bernoulli
Incorporate auxiliary variables into systematic part
Explore MI with other methods of geostatistical analysis
Explore imputation using the Posterior Predictive Distribution mean.
25. Illustration
26. Thanks to Phil Larson, Steve Jacobs, Kim Jones, Jeff Rodgers and Andy Talavere for providing data, interpretation and useful comments.