1 / 14

Prediction and Imputation in ISEE - PowerPoint PPT Presentation

Prediction and Imputation in ISEE. - Tools for more efficient use of combined data sources. Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen. ISEE model: A sketch. Use of a statistical register. Combining administrative and survey data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Prediction and Imputation in ISEE' - sybil

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Prediction and Imputation in ISEE

- Tools for more efficient use of combined data sources

Li-Chun Zhang, Statistics Norway

Svein Nordbotton, University of Bergen

• Combining administrative and survey data

• Model-based prediction or weighting

• Construction of statistical registers

• Uses of a statistical register

• Prediction of (sub-)population totals

• Multiple uses & general database quality => inferential concerns associated with imputation

• How to balance between the two types inferential concerns?

A triple-goal criterion for statistical registers

• Effisicient population totals of interest

• Correct co-variances among survey variables, as well as between survey and auxiliary variables

• Non-stochastic & constant tabulation

• NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR).

• A simultaneous prediction method

• Values are generated outside of the sample

• Efficient for prediction of population totals

• Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.

• Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods

• Solves variance estimation problem at the same time

• Genuine multivariate imputation with realistic imputed values

• Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches

• NNI can be made non-stochastic, yielding constant tabulations on repetition

• An algorithm

• Jump-start phase: to speed up the imputation procedure if desirable

• Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains

• Adjustment between the two phases

• Current research

• How well does the algorithm perform in real statistical productions?

• Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation?

• Evaluation of micro-data quality

Background information:

Some standard methods of prediction and imputation

• Under the general linear model:

• Target parameter T = linear combination of y-values in the population

• Estimation of T  Prediction of T outside of the selected sample

• Prediction of individuals: A special case

• Main problems for a statistical register

• Lack of natural variation in data; especially if many units have the same x-values

• Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation

• To emulate the natural variation in data: Add a random residual to the best predicted y-value

• Hot-deck as a special case

• Main problems:

• Extra variance of imputed estimator due to random imputation => never fully efficient

• Random imputation not the only means for creating natural variation in data

• Different tabulations on repetition => lack of acceptability and face-value in official statist.

• Independent random imputations + formulae for combining results

• Bayesian or frequentist approach

• Main problems:

• Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations

• A common misunderstanding: only MI can yield acceptable measures of accuracy.

• Find the donor among the observed units who has the same predict y-value & impute the observed y-value

• Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance.

• Essentially a marginal, variable-by-variable approach

• Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit.

• A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models.

• Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases.

• Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code.

• Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)

• Class of functional imputation

• ANN as generalized regression functions (Bishop, 1995)

• No analytic predictor

• Unrealistic imputed values for categorical variables of interest

• Usually not fully efficient