The European Statistical Training Programme (ESTP)

The European Statistical Training Programme(ESTP)

Chapter 13: Item nonresponse • Handbook: chapter 14 • How to treat missing values? • Single imputation • Effects of single imputation • Multiple imputation

Introduction • Nonresponse • Unit nonresponse: No information is obtained from a sampled person • Item nonresponse: Person participated in the survey but answers to some questions are missing. full item response item nonresponse

Introduction • How to deal with item nonresponse? • Case wise deletion: • Ignore all cases with missing data. • Pair wise deletion: • Ignore only those cases with missing data on the variables needed for the analysis • Imputation: • Substitute estimates for missing data.

Introduction • Estimation under item nonresponse • The effectiveness of deletion and imputation techniques depends on the missing-data-patterns. • Under case wise and pair wise deletion one assumes that cases with missing data are on average the same as cases with full data. • Values that are imputed follow from a model that assumes that within the model item respondents and nonrespondents are on average the same. • Imputed data records cannot be treated the same way as non-imputed records. • Missing-data-mechanisms as for unit nonresponse • Missing Completely at Random (MCAR). • Missing at Random (MAR). • Not Missing at Random (NMAR).

Introduction • Missing-data-mechanisms as for item nonresponse • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Not Missing at Random (NMAR) • Examples • MCAR: Respondent accidently forgets to fill in backside of questionnaire or oversees a block of questions. • MAR: Older respondents more often do not want to state their income. • NMAR: Respondent does not want to state real income as it comes partially from moonlighting (untaxed income).

Single imputation • Single versus multiple imputation • Single imputation: a missing value is replaced by a single (synthetic) value. • Multiple imputation: a missing value is replaced by a set of (synthetic) values. • Imputation techniques • Deductive imputation • Imputation of a mean • Random imputation • Imputation using donor records • Imputation using a model with auxiliary variables

Single imputation • Notation • Sample indicators: • Item response indicators: • Target variable: • Auxiliary variable(s): • Imputed target variable: • Deductive imputation • The value of the missing item can be deduced from the non-missing items. • Example 1: Profits and costs are given while total revenue is missing. • Example 2: Respondent is male but does not state how many times he was pregnant.

Single imputation • Imputation of a mean • Imputation of the overall mean: • Imputation of the mean within strata or groups: • Examples: • Imputation of mean income over all households • Imputation of mean profit over persons with the same size and having the same kind of job.

Single imputation • Random imputation • Impute at random one of possible values. • Cases with same missing-data-pattern may have different imputed values. • Random imputation can also be employed within strata or groups. • Examples: • If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} • If the marital status is missing, sample randomly a value from the set {married, not married, divorced, widowed} in case the respondent is 16 years or older and otherwise impute not married • If income is missing fit a normal distribution on the non-missing records and sample a value from the fitted distribution.

Single imputation • Imputation using donors • Hot deck imputation: sample randomly from the set of values found under the item respondents • Nearest neighbour imputation: define a distance measure, search for item respondent that is closest to item nonrespondent and impute corresponding value • Examples: • In case income is missing, identify all persons with the same age and gender and impute randomly one of their incomes. • In case level of education is missing, search for the item respondent with the income that is closest in absolute sense and impute the corresponding level education.

Single imputation • Imputation using a model • Implicitly imputation of the mean within groups and nearest neighbour imputation use auxiliary information and thus a model. Select those strata or nearest neighbour for which the corresponding auxiliary variables relate strongly to the missing item. • More sophisticated imputation techniques have been developed that model missing items by non-missing items. • Situation similar to unit nonresponse. How to build such models and how to select auxiliary variables? • Main difference between item and unit nonresponse is the availability of non-missing items next to auxiliary information available from administrative data.

Single imputation • Ratio imputation • Impute where • Regression imputation • Impute where • Examples • Model income using size of household and average house value • Model health status using age, gender and employment status.

Single imputation • A general model for imputation • Most of the proposed imputation techniques can be put into a general framework. • Let be constants and be a random term, then the general model has the form • Imputation of mean: All terms are zero except which equals the overall mean. • Hot deck imputation: Let the random term take values in the set of item responses. • Ratio and regression imputation: Take corresponding estimated parameters. Random term equals zero. • Exception: nearest neighbour imputation • Benefit of general framework is development of theory to compare different techniques

Effects of single imputation – general effects • Imputed value must belong to domain of valid answers • Qualitative variable: some form of donor-imputation. • Quantitative variable: any technique. • Effect on mean: • Deterministic imputation: mean not affected. • Random imputation: mean is affected, but expected value not. • Effect on distribution • Deterministic imputation: distribution becomes more ‘peaked’. • Random imputation: preserved distribution better. • Effect on correlation • Both deterministic and random imputation may affect the value of correlations. • Correlations after imputation will be smaller.

Effects of single imputation – some notation • Target variable (with missing values) Y1, Y2, …, YN • Sample of size n a1, a2, …, aN. ak = 1 if element k selected, otherwise ak = 0. • Missing data R1, R2, …, RN. Rk = 1 if element k available, otherwise Rk = 0. • Number of available observation • Mean of available observation • Imputation Value Yk is missing if ak = 1 and Rk =1. Then a synthetic value is used

Effects of single imputation – some notation • Mean of imputed values • Estimator after imputation • Expected value • Variance

Effects of single imputation – imputation of the mean • Imputed value, for all missing Yk: • Mean of imputed values: • Estimator • Expected value not affected • Variance:

Effects of single imputation – imputation of the mean • Suppose • A researchers is given the complete (imputed) data set, and • he doesn’t know that imputation of the mean has been carried out. • To determine the standard error of the mean • He computes the sample variance, and • uses it as an estimator of the true population variance S2 • However, the sample variance is equal to • And therefore under-estimates the population variance: • Estimates are less precise than he thinks!

Effects of single imputation – imputation of the mean • Example: Population of size N = 19,000. Sample of size n = 1,000. Population variance S2 = 360,000. • Variance of mean in case of full response: • 10% missing values, available observations m = 900. Imputation of mean is carried out. Variance of estimator after imputation: • Standard deviation of all (real and imputed) observations: • (Wrong) estimate of variance of sample mean

Effects of single imputation – random imputation • Imputed value is randomly selected from available observation. • Expected value is not affected: • Variance: • Variance consist of two components: • Normal sampling variance. • Variance introduced by imputation mechanism. • Expected value of standard deviation of all observations: • Variance estimator is asymptotically design unbiased.

Multiple imputation • Disadvantage of single imputation: underestimation of the population variance • Multiple imputation (MI) is a solution to this problem • MI replaces each missing value by m >1 synthetic values. This leads to m datasets, for each of which we obtain an estimate for the population characteristic. These estimates can be combined to produce estimates and confidence intervals

Multiple imputation • MI assumes some kind of model, e.g. a linear model like • The effect of imputation depends on the missing data mechanism • If MCAR, we can apply random imputation a number of times • If MAR, we can use the linear model above • If NMAR, no valid imputation model can be used • Consequently, if the missing data mechanism is not modelled properly, analysis of the imputed data sets can be seriously wrong!

Multiple imputation • Let denote the estimator of data set j (for j = 1, 2, …, m) • The overall estimator is then defined by • The variance of this estimator equals which can be seen as the within imputation variance plus the between imputation variance

Multiple imputation • What should be the number of imputations m? Rubin (1987) claims that it should not exceed m = 10. • The relative increase in variance is approximately equal to where is the rate of missing information

Multiple imputation

The European Statistical Training Programme (ESTP)