1 / 55

Wish You Were Here! Strategies for Handling Missing Data

Wish You Were Here! Strategies for Handling Missing Data. Agenda. Overview Types of Missing Data Strategies for Handling Missing Data Software Applications and Examples. Overview. Sources of Missing Data Item non-response Missing value for any given item Scale non-response

eydie
Download Presentation

Wish You Were Here! Strategies for Handling Missing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wish You Were Here! Strategies for Handling Missing Data

  2. Agenda • Overview • Types of Missing Data • Strategies for Handling Missing Data • Software Applications and Examples

  3. Overview • Sources of Missing Data • Item non-response • Missing value for any given item • Scale non-response • Missing value for any given scale • Often a result of item non-response • Attrition • Missing value (item and/or scale) for any given time point • Data entry error • Observed value not included

  4. Overview • So I have missing data…what’s the big deal? • Missing data, no matter how minimal, can (and probably do) result in biased results • Statistical power • Validity

  5. Overview • How much missing data is “problematic”? Depends on who you ask… • Answer #1 • ANY • Answer #2 • Its never “too much” • Optimal methods can easily accommodate 50% missing data • Answer #3 • >5% (Schafer, 1999) • >10% (Bennett, 2001) • >20% (Peng, et al., 2006) • Answer #4 (Widaman, 2006) • 1%-2% (Negligible) • 5%-10% (Minor) • 10%-25% (Moderate) • 25%-50% (High) • >50% (Excessive)

  6. Types of Missing Data • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Not Missing at Random (NMAR)

  7. Types of Missing Data • Missing Completely at Random (MCAR) • Missing values on Y are unrelated to any other variable in the analysis • Cases with missing data can be treated as a random subset of the entire sample • Best case scenario; difficult to ascertain

  8. Types of Missing Data • Missing at Random (MAR) • Missing values on Y are related to X but not to Y • Missing values on Y are random (random effect)after controlling for X (systematic effect • Can test systematic effect but not random effect

  9. Types of Missing Data • Not Missing at Random (NMAR) • Missing values on Y are related to Y itself • Missing data are “non-ignorable” • Difficult to ascertain; difficult to manage

  10. Determining Type of Missing Data • Testing for MCAR • Little’s Test of MCAR • Omnibus χ2 test of all specified variables • If significant, data are not MCAR • May be MAR or MNAR • If not significant, can assume MCAR • Available in SPSS under “Missing Value Analysis” and as a SAS Macro

  11. Determining Type of Missing Data • Testing for MAR • Create a “dummy” variable for not missing/missing on the variable of interest • Conduct statistical tests to see if other relevant variables are associated with values of the new variable • Binomial logistic regression • χ2 test of independence • t-tests • If significant relationships are found, then have MAR; these variables need to be included in any analyses • If no significant relationships found, then you have more work to do

  12. Determining Type of Missing Data • If not MCAR or MAR, does that mean it is MNAR? • Not necessarily… • Might still be MAR but you haven’t found the right indicator variable • Consider other potentially relevant variables and test against the missing data “dummy” variable

  13. Determining Type of Missing Data • Patterns of missing data • Monotone pattern • Variables v1-vj can be ordered so that if data are missing on v1, they are missing on all successive variables • VERY common with longitudinal data

  14. Determining Type of Missing Data • Patterns of missing data • Non-monotone pattern • Patterns of missing data are arbitrary

  15. Methods for Handling Missing Data • Deletion Methods • Remove cases with missing values • Non-Stochastic Methods • Replace missing values with “known” values • Stochastic Methods • Replace missing values with estimated values

  16. Deletion Methods • List-Wise Deletion • Mechanism • Deletes cases from analysis with missing data on any variable (even if that variable isn’t part of the analysis) • Only uses “complete cases” • Pros • Easy to implement • Works for any kind of statistical analysis • If data are MCAR, does not introduce any bias in parameter estimates • Standard error estimates are appropriate • Cons • May delete a large proportion of cases, resulting in loss of statistical power • May introduce bias if MAR but not MCAR

  17. Deletion Methods • Pair-Wise Deletion • Mechanism • Deletes cases when missing data on a specific variable involved in parameter estimation • Uses all available information for each estimation, independent of information available for other estimations • Pros • Approximately unbiased if MCAR • Uses all available information • Cons • Standard errors are incorrect

  18. Deletion Methods

  19. Non-Stochastic Methods • Mean Imputation • Mechanism • All missing values on a given variable are replaced by the sample mean for that variable • Pros • Leaves sample mean of non-missing values unchanged • Cons • Often leads to biased parameter estimates (e.g., variances) • Usually leads to standard error estimates that are biased downward • Treats imputed data as real data, ignores inherent uncertainty in imputed values.

  20. Non-Stochastic Methods • Individual Mean Imputation • Mechanism • Scale scores are computed by taking the mean of non-missing values • Ex: Respondent answered 8 of 10 questions on Miller Anxiety Scale – Compute Scale score by taking mean of available cases • Pros • All available information for a given individual is used in the estimation of missing values • Cons • Assumes the items with missing values are similar in difficulty or extremity to items with non-missing data • May lead to biased scores

  21. Non-Stochastic Methods • Regression • Mechanism • Missing values are replaced by “predicted” values derived from MR using all relevant variables • Pros • Predicted values maintain relationships among variables • Cons • Predicted values are “perfect” and lead to positively biased estimates

  22. Non-Stochastic Methods

  23. Stochastic Methods • Stochastic Regression (aka “Simple Imputation”) • Mechanism • Similar to non-stochastic regression in the available data are used to predict missing values • Adds a random value to the predicted value by sampling from a normal distribution with a mean of zero and variance equal to the residual variance of the regression equation • Pros • Improvement over Non-Stochastic methods • Provides unbiased variance estimates • Cons • Only uses a single estimation step and may produce inaccurate or unusual values

  24. Stochastic Methods (Regression)

  25. Stochastic Methods • Expectation Maximization (EM) • Mechanism • 2-step iterative process • Step 1: Expectation • Use parameter values (initially based on complete-case data) to estimate values for missing data • Step 2: Maximization • Use complete-case data and estimated values for missing data to estimate new model parameters • Repeat until results converge (Successive iterations will not yield different parameters) • Pros • Minimizes bias in parameter estimates (larger samples yield less bias) • Ideal for exploratory and reliability analyses • Cons • Initial estimates based on list-wise deletion (doesn’t use all available data) • Biased standard errors (minimized with larger samples) • Less efficient than FIML for hypothesis testing

  26. Stochastic Methods (EM)

  27. Stochastic Methods • Full Information Maximum Likelihood (FIML) • Mechanism • Directly estimates parameters using all observed data for every case • Pros • Only requires a single step for imputation and analysis • Uses allavailable data even if some cases are missing data • Unbiased standard errors • Can be used with smaller samples (N<100) • Cons • All variables related to missing data need to be included in the analysis

  28. Stochastic Methods (FIML)

  29. Stochastic Methods • Multiple Imputation (MI) • Mechanism • Creates multiple data set using stochastic regression • Minimum of 3-5 recommended, but no limit on maximum (Schafer, 1997) • Each data set will be slightly different because of the random component • Parameters are estimated for each data set and then averaged • Pros • Produces unbiased parameter estimates • Produces unbiased standard errors • Easy to include auxiliary variables • Cons • Labor intensive • Can be difficult to integrate multiple data sets

  30. Stochastic Methods (MI)

  31. Stochastic Methods • Comparison of Stochastic Methods

  32. Software Applications

  33. Example • Modeling problematic child behavior outcomes • Predictors • Positive Parenting • Social Skills • Interpartner Violence • Child Sex • N=181 • Original data set missing 4 observations (<.5%) • New data set created for purpose of demonstration

  34. Testing for Type of Missing Data • Little’s Test of MCAR can be obtained as part of PASW “Missing Values Analysis” • Little's MCAR test: Chi-Square = 36.014, DF = 18, Sig. = .007 • Conclude that data are not MCAR (not surprising given that I did not delete values in a random manner)

  35. Testing for Type of Missing Data • Test of MAR can be conducted by creating new dichotomous variable for “Not Missing/Missing” and using it as the outcome variable in a logistic regression model • Most interested in missing data on outcome variable in this example, but method is not limited to that • Conclude that pattern of missing data is related to Gender • Little's MCAR test for Boys: Chi-Square = 8.338, DF = 14, Sig. = .871* • Little's MCAR test for Girls: Chi-Square = 13.026, DF = 18, Sig. = .790* *We can conclude that data are MCAR within each group. Gender must be included in any missing data analysis to minimize bias.

  36. Patterns of missing data can be obtained using “Analyze Patterns” option available under “Multiple Imputation”

  37. Results of pattern analysis

  38. Results of pattern analysis Although the pattern is not monotone, these cases only make up a very small %

  39. Missing Values Analysis in PASW • PASW provides several options for handling missing data • The add-on module for “Missing Values Analysis” allows you to implement several different strategies simultaneously • In addition to saving time, comparison output is provided for means, SDs, and correlation/covariance matrices • Available options: • List-wise deletion • Pair-wise deletion • Stochastic regression • EM

  40. Missing Values Analysis in PASW Choose strategies Additional options Enter continuous and categorical variables

  41. Multiple Imputation in PASW • The “Multiple Imputation” option is part of the basic PASW package • Provides numerous options • Choose # of iterations • Choose estimation method • (monotone vs. non-monotone patterns) • Create new data sets

  42. Multiple Imputation in PASW Enter all variables to use in imputation (model + auxiliary) Choose # of iterations Create a new data set with imputed data Note: PASW allows you to run analysis on all imputed sets simultaneously

  43. Multiple Imputation in PASW “Automatic” is the default Can manually select method based on pattern of missing data If your data include interactions, so should your imputation model

  44. Missing Data and LISREL Multiple Imputation available in PreLIS under “Statistics” I have included both model and auxiliary variables Select estimation method EM -> monotone MCMC -> non-monotone Decide how to handle cases when all data are missing Output is a “complete” data set for analysis

  45. Missing Data and LISREL An alternative to MI is to use FIML estimation with the original data set containing missing values LISREL will default to this option if there is missing data

  46. Comparing Results

  47. Comparing Results

  48. Comparing Results

  49. Test • The goal of handling missing data is to find values close to the “real” (but absent) values. (T or F) • FALSE – the goal is to estimate unbiased standard errors and parameter estimates • Which is more important – amount of missing data or type of missing data? • Both are important, but type is more important than amount • List-wise deletion is a good strategy for handling missing data? (T or F) • TRUE – if data are MCAR; if not MCAR, then there are better alternatives • There are no “good” strategies for handling data that are NMAR. (T or F) • TRUE – but FIML is considered to yield the least biased results

  50. Test • Deletion is the only strategy for handling missing categorical data. (T or F) • FALSE – can use both non-stochastic and stochastic methods • If using multiple imputation, it is best to include all available variables. (T or F) • FALSE – only include variables related to those with missing data • Values such as “not applicable”, “not sure”, “I don’t know”, etc. should be treated as missing data. (T or F) • FALSE – if you included these as possible response categories, then they constitute valid responses (i.e., they are not missing) • List-wise deletion is better than non-stochastic imputation. (T or F) • TRUE – if data are MCAR and/or unless using a small sample with minimal power

More Related