Prediction and imputation in isee
1 / 14

Prediction and Imputation in ISEE - PowerPoint PPT Presentation

  • Uploaded on

Prediction and Imputation in ISEE. - Tools for more efficient use of combined data sources. Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen. ISEE model: A sketch. Use of a statistical register. Combining administrative and survey data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Prediction and Imputation in ISEE' - sybil

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Prediction and imputation in isee

Prediction and Imputation in ISEE

- Tools for more efficient use of combined data sources

Li-Chun Zhang, Statistics Norway

Svein Nordbotton, University of Bergen

Use of a statistical register
Use of a statistical register

  • Combining administrative and survey data

    • Model-based prediction or weighting

    • Construction of statistical registers

  • Uses of a statistical register

    • Prediction of (sub-)population totals

    • Multiple uses & general database quality => inferential concerns associated with imputation

  • How to balance between the two types inferential concerns?

A triple goal criterion for statistical registers
A triple-goal criterion for statistical registers

  • Effisicient population totals of interest

  • Correct co-variances among survey variables, as well as between survey and auxiliary variables

  • Non-stochastic & constant tabulation

A simultaneous prediction method
A simultaneous prediction method

  • NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR).

  • A simultaneous prediction method

    • Values are generated outside of the sample

    • Efficient for prediction of population totals

    • Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.

About nni wr
About NNI-WR

  • Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods

  • Solves variance estimation problem at the same time

  • Genuine multivariate imputation with realistic imputed values

  • Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches

  • NNI can be made non-stochastic, yielding constant tabulations on repetition

An algorithm and current research
An algorithm and current research

  • An algorithm

    • Jump-start phase: to speed up the imputation procedure if desirable

    • Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains

    • Adjustment between the two phases

  • Current research

    • How well does the algorithm perform in real statistical productions?

    • Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation?

    • Evaluation of micro-data quality

Background information

Background information:

Some standard methods of prediction and imputation

Basic prediction approach
Basic prediction approach

  • Under the general linear model:

    • Target parameter T = linear combination of y-values in the population

    • Estimation of T  Prediction of T outside of the selected sample

    • Prediction of individuals: A special case

  • Main problems for a statistical register

    • Lack of natural variation in data; especially if many units have the same x-values

    • Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation

Random regression imputation rri
Random regression imputation (RRI)

  • To emulate the natural variation in data: Add a random residual to the best predicted y-value

  • Hot-deck as a special case

  • Main problems:

    • Extra variance of imputed estimator due to random imputation => never fully efficient

    • Random imputation not the only means for creating natural variation in data

    • Different tabulations on repetition => lack of acceptability and face-value in official statist.

Multiple imputation mi
Multiple imputation (MI)

  • Independent random imputations + formulae for combining results

  • Bayesian or frequentist approach

  • Main problems:

    • Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations

    • A common misunderstanding: only MI can yield acceptable measures of accuracy.

Predictive mean matching pmm
Predictive mean matching (PMM)

  • Find the donor among the observed units who has the same predict y-value & impute the observed y-value

  • Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance.

  • Essentially a marginal, variable-by-variable approach

Nearest neighbor imputation nni
Nearest neighbor imputation (NNI)

  • Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit.

  • A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models.

  • Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases.

  • Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code.

  • Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)

Artificial neural network ann
Artificial neural network (ANN)

  • Class of functional imputation

  • ANN as generalized regression functions (Bishop, 1995)

  • No analytic predictor

  • Unrealistic imputed values for categorical variables of interest

  • Usually not fully efficient