1 / 26

The European Statistical Training Programme (ESTP)

Learn about single imputation techniques for handling item nonresponse in surveys, including deductive imputation, imputation of mean, random imputation, imputation using donor records, and imputation using a model.

ind
Download Presentation

The European Statistical Training Programme (ESTP)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The European Statistical Training Programme(ESTP)

  2. Chapter 13: Item nonresponse • Handbook: chapter 14 • How to treat missing values? • Single imputation • Effects of single imputation • Multiple imputation

  3. Introduction • Nonresponse • Unit nonresponse: No information is obtained from a sampled person • Item nonresponse: Person participated in the survey but answers to some questions are missing. full item response item nonresponse

  4. Introduction • How to deal with item nonresponse? • Case wise deletion: • Ignore all cases with missing data. • Pair wise deletion: • Ignore only those cases with missing data on the variables needed for the analysis • Imputation: • Substitute estimates for missing data.

  5. Introduction • Estimation under item nonresponse • The effectiveness of deletion and imputation techniques depends on the missing-data-patterns. • Under case wise and pair wise deletion one assumes that cases with missing data are on average the same as cases with full data. • Values that are imputed follow from a model that assumes that within the model item respondents and nonrespondents are on average the same. • Imputed data records cannot be treated the same way as non-imputed records. • Missing-data-mechanisms as for unit nonresponse • Missing Completely at Random (MCAR). • Missing at Random (MAR). • Not Missing at Random (NMAR).

  6. Introduction • Missing-data-mechanisms as for item nonresponse • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Not Missing at Random (NMAR) • Examples • MCAR: Respondent accidently forgets to fill in backside of questionnaire or oversees a block of questions. • MAR: Older respondents more often do not want to state their income. • NMAR: Respondent does not want to state real income as it comes partially from moonlighting (untaxed income).

  7. Single imputation • Single versus multiple imputation • Single imputation: a missing value is replaced by a single (synthetic) value. • Multiple imputation: a missing value is replaced by a set of (synthetic) values. • Imputation techniques • Deductive imputation • Imputation of a mean • Random imputation • Imputation using donor records • Imputation using a model with auxiliary variables

  8. Single imputation • Notation • Sample indicators: • Item response indicators: • Target variable: • Auxiliary variable(s): • Imputed target variable: • Deductive imputation • The value of the missing item can be deduced from the non-missing items. • Example 1: Profits and costs are given while total revenue is missing. • Example 2: Respondent is male but does not state how many times he was pregnant.

  9. Single imputation • Imputation of a mean • Imputation of the overall mean: • Imputation of the mean within strata or groups: • Examples: • Imputation of mean income over all households • Imputation of mean profit over persons with the same size and having the same kind of job.

  10. Single imputation • Random imputation • Impute at random one of possible values. • Cases with same missing-data-pattern may have different imputed values. • Random imputation can also be employed within strata or groups. • Examples: • If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} • If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} in case the respondent is 16 years or older and otherwise impute not married • If income is missing fit a normal distribution on the non-missing records and sample a value from the fitted distribution.

  11. Single imputation • Imputation using donors • Hot deck imputation: sample randomly from the set of values found under the item respondents • Nearest neighbour imputation: define a distance measure, search for item respondent that is closest to item nonrespondent and impute corresponding value • Examples: • In case income is missing, identify all persons with the same age and gender and impute randomly one of their incomes. • In case level of education is missing, search for the item respondent with the income that is closest in absolute sense and impute the corresponding level education.

  12. Single imputation • Imputation using a model • Implicitly imputation of the mean within groups and nearest neighbour imputation use auxiliary information and thus a model. Select those strata or nearest neighbour for which the corresponding auxiliary variables relate strongly to the missing item. • More sophisticated imputation techniques have been developed that model missing items by non-missing items. • Situation similar to unit nonresponse. How to build such models and how to select auxiliary variables? • Main difference between item and unit nonresponse is the availability of non-missing items next to auxiliary information available from administrative data.

  13. Single imputation • Ratio imputation • Impute where • Regression imputation • Impute where • Examples • Model income using size of household and average house value • Model health status using age, gender and employment status.

  14. Single imputation • A general model for imputation • Most of the proposed imputation techniques can be put into a general framework. • Let be constants and be a random term, then the general model has the form • Imputation of mean: All terms are zero except which equals the overall mean. • Hot deck imputation: Let the random term take values in the set of item responses. • Ratio and regression imputation: Take corresponding estimated parameters. Random term equals zero. • Exception: nearest neighbour imputation • Benefit of general framework is development of theory to compare different techniques

  15. Effects of single imputation – general effects • Imputed value must belong to domain of valid answers • Qualitative variable: some form of donor-imputation. • Quantitative variable: any technique. • Effect on mean: • Deterministic imputation: mean not affected. • Random imputation: mean is affected, but expected value not. • Effect on distribution • Deterministic imputation: distribution becomes more ‘peaked’. • Random imputation: preserved distribution better. • Effect on correlation • Both deterministic and random imputation may affect the value of correlations. • Correlations after imputation will be smaller.

  16. Effects of single imputation – some notation • Target variable (with missing values) Y1, Y2, …, YN • Sample of size n a1, a2, …, aN. ak = 1 if element k selected, otherwise ak = 0. • Missing data R1, R2, …, RN. Rk = 1 if element k available, otherwise Rk = 0. • Number of available observation • Mean of available observation • Imputation Value Yk is missing if ak = 1 and Rk =1. Then a synthetic value is used

  17. Effects of single imputation – some notation • Mean of imputed values • Estimator after imputation • Expected value • Variance

  18. Effects of single imputation – imputation of the mean • Imputed value, for all missing Yk: • Mean of imputed values: • Estimator • Expected value not affected • Variance:

  19. Effects of single imputation – imputation of the mean • Suppose • A researchers is given the complete (imputed) data set, and • he doesn’t know that imputation of the mean has been carried out. • To determine the standard error of the mean • He computes the sample variance, and • uses it as an estimator of the true population variance S2 • However, the sample variance is equal to • And therefore under-estimates the population variance: • Estimates are less precise than he thinks!

  20. Effects of single imputation – imputation of the mean • Example: Population of size N = 19,000. Sample of size n = 1,000. Population variance S2 = 360,000. • Variance of mean in case of full response: • 10% missing values, available observations m = 900. Imputation of mean is carried out. Variance of estimator after imputation: • Standard deviation of all (real and imputed) observations: • (Wrong) estimate of variance of sample mean

  21. Effects of single imputation – random imputation • Imputed value is randomly selected from available observation. • Expected value is not affected: • Variance: • Variance consist of two components: • Normal sampling variance. • Variance introduced by imputation mechanism. • Expected value of standard deviation of all observations: • Variance estimator is asymptotically design unbiased.

  22. Multiple imputation • Disadvantage of single imputation: underestimation of the population variance • Multiple imputation (MI) is a solution to this problem • MI replaces each missing value by m >1 synthetic values. This leads to m datasets, for each of which we obtain an estimate for the population characteristic. These estimates can be combined to produce estimates and confidence intervals

  23. Multiple imputation • MI assumes some kind of model, e.g. a linear model like • The effect of imputation depends on the missing data mechanism • If MCAR, we can apply random imputation a number of times • If MAR, we can use the linear model above • If NMAR, no valid imputation model can be used • Consequently, if the missing data mechanism is not modelled properly, analysis of the imputed data sets can be seriously wrong!

  24. Multiple imputation • Let denote the estimator of data set j (for j = 1, 2, …, m) • The overall estimator is then defined by • The variance of this estimator equals which can be seen as the within imputation variance plus the between imputation variance

  25. Multiple imputation • What should be the number of imputations m? Rubin (1987) claims that it should not exceed m = 10. • The relative increase in variance is approximately equal to where is the rate of missing information

  26. Multiple imputation

More Related