prediction and imputation in isee
Skip this Video
Download Presentation
Prediction and Imputation in ISEE

Loading in 2 Seconds...

play fullscreen
1 / 14

Prediction and Imputation in ISEE - PowerPoint PPT Presentation

  • Uploaded on

Prediction and Imputation in ISEE. - Tools for more efficient use of combined data sources. Li-Chun Zhang, Statistics Norway Svein Nordbotton, University of Bergen. ISEE model: A sketch. Use of a statistical register. Combining administrative and survey data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Prediction and Imputation in ISEE' - sybil

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
prediction and imputation in isee

Prediction and Imputation in ISEE

- Tools for more efficient use of combined data sources

Li-Chun Zhang, Statistics Norway

Svein Nordbotton, University of Bergen

use of a statistical register
Use of a statistical register
  • Combining administrative and survey data
    • Model-based prediction or weighting
    • Construction of statistical registers
  • Uses of a statistical register
    • Prediction of (sub-)population totals
    • Multiple uses & general database quality => inferential concerns associated with imputation
  • How to balance between the two types inferential concerns?
a triple goal criterion for statistical registers
A triple-goal criterion for statistical registers
  • Effisicient population totals of interest
  • Correct co-variances among survey variables, as well as between survey and auxiliary variables
  • Non-stochastic & constant tabulation
a simultaneous prediction method
A simultaneous prediction method
  • NNI as the only feasible approach in terms of preserving co-variances among all the variables. To improve efficiency: introduce restrictions on the imputed totals, which may be obtained separately from imputation, say, through regression prediction. To be referred to as NNI with restrictions (NNI-WR).
  • A simultaneous prediction method
    • Values are generated outside of the sample
    • Efficient for prediction of population totals
    • Not optimal (or best) prediction of each specific unit, but for the assemble of units, now that attention is given to the co-variances among the variables.
about nni wr
About NNI-WR
  • Separation of prediction of totals from general imputation concerns, allowing full freedom in search of efficient methods
  • Solves variance estimation problem at the same time
  • Genuine multivariate imputation with realistic imputed values
  • Non-parametric nature and mild regularity condition suggest robustness, compared to standard regression based approaches
  • NNI can be made non-stochastic, yielding constant tabulations on repetition
an algorithm and current research
An algorithm and current research
  • An algorithm
    • Jump-start phase: to speed up the imputation procedure if desirable
    • Fine-tune phase: relaxation to k-nearest neighbor imputation for better agreement with restrictions; consistency remains
    • Adjustment between the two phases
  • Current research
    • How well does the algorithm perform in real statistical productions?
    • Effective way of setting up the restrictions, i.e. maximum control with minimum number of explicit restrictions for imputation?
    • Evaluation of micro-data quality
background information

Background information:

Some standard methods of prediction and imputation

basic prediction approach
Basic prediction approach
  • Under the general linear model:
    • Target parameter T = linear combination of y-values in the population
    • Estimation of T  Prediction of T outside of the selected sample
    • Prediction of individuals: A special case
  • Main problems for a statistical register
    • Lack of natural variation in data; especially if many units have the same x-values
    • Infeasible simultaneously for a large amount of variables; impractical as production mode; leading to inconsistency of cross-tabulation
random regression imputation rri
Random regression imputation (RRI)
  • To emulate the natural variation in data: Add a random residual to the best predicted y-value
  • Hot-deck as a special case
  • Main problems:
    • Extra variance of imputed estimator due to random imputation => never fully efficient
    • Random imputation not the only means for creating natural variation in data
    • Different tabulations on repetition => lack of acceptability and face-value in official statist.
multiple imputation mi
Multiple imputation (MI)
  • Independent random imputations + formulae for combining results
  • Bayesian or frequentist approach
  • Main problems:
    • Removes all the extra imputation variance only if infinite number of repetitions. Otherwise, still not fully efficient & non-constant tabulations
    • A common misunderstanding: only MI can yield acceptable measures of accuracy.
predictive mean matching pmm
Predictive mean matching (PMM)
  • Find the donor among the observed units who has the same predict y-value & impute the observed y-value
  • Noticeable difference from RRI as the chance of multiple donors decreases; PMM is more efficient due to the removal of imputation variance.
  • Essentially a marginal, variable-by-variable approach
nearest neighbor imputation nni
Nearest neighbor imputation (NNI)
  • Provided a set of covariates and a distance metric, the donor is the ‘nearest’ observed unit.
  • A non-parametric generalization of PMM & dot-deck as a special case. More flexible and practical for multivariate imputation than regression models.
  • Chen and Shao (2000): consistent estimator of totals as well as finite population distributions, provided the absolute difference in conditional means of y is bounded by the ‘distance’ between two units. Linear models as special cases.
  • Can be made non-stochastic by introducing extra seemingly uncorrelated covariates, such as Zip code.
  • Main draw back: Usually not efficient (i.e. local smoothing instead of global regression predictor)
artificial neural network ann
Artificial neural network (ANN)
  • Class of functional imputation
  • ANN as generalized regression functions (Bishop, 1995)
  • No analytic predictor
  • Unrealistic imputed values for categorical variables of interest
  • Usually not fully efficient