New and emerging methods
Download
1 / 30

New and Emerging Methods - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

New and Emerging Methods. Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa. Introduction. New methods of data editing and imputation Subdivided into 5 different themes: Automatic editing Imputation E & I for demographic variables

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' New and Emerging Methods' - jamil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
New and emerging methods

New and Emerging Methods

Maria Garcia and Ton de Waal

UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa


Introduction
Introduction

  • New methods of data editing and imputation

  • Subdivided into 5 different themes:

    • Automatic editing

    • Imputation

    • E & I for demographic variables

    • Selective editing

    • Software


Invited papers
Invited Papers

  • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

  • WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US)

  • WP 31: Smoothing Imputations for categorical data in the linear regression paradigm (USCB, US)


Automatic editing papers 1 2
Automatic editing: papers (1/2)

Six papers:

  • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

  • WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US)

  • WP 33: Data editing and logic (Australia)


Automatic editing papers 2 2
Automatic editing: papers (2/2)

  • WP 43: Automatic editing system for the case of two short-term business surveys (Republic of Slovenia)

  • WP 44: A variable neighbourhood local search approach for the continuous data editing problem (Spain)

  • WP 46: Implicit linear inequality edits and error localization in the SPEER edit system (USCB, US)


Automatic editing main developments
Automatic Editing: main developments

Methods based on Fellegi-Holt model

  • Developments at SORS

    • General system combines error localization with outlier detection

    • Plans for automation of implied edit generation

  • Further improvements of SPEER

    • Preprocessing program for generation of implied edits

    • Improve error localization


Automatic editing main developments1
Automatic Editing: main developments

  • Framework of Fellegi-Holt theory in propositional logic

    • Generation of implied edits framed as logical deduction

    • Automatic tools that can potentially be used for finding minimal deletion set


Automatic editing main developments2
Automatic Editing: main developments

Methods based on some other approach

  • Erroneous unit measures

    • Model as cluster analysis problem

  • Ratio and balance constraints

    • Hybrid ratio editing and quadratic programming

    • Controlled rounding

  • Error localization as a combinatorial optimization problem

    • Continuous data

    • Successful on very large data sets


Imputation papers 1 2
Imputation: papers (1/2)

Six papers:

  • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

  • WP 31: Smoothing imputations for categorical data in the linear regression paradigm (USCB, US)

  • WP 36: Integrated modeling approach to imputation and discussion on imputation variance (Statistics Finland)


Imputation papers 2 2
Imputation: papers (2/2)

  • WP 40: Imputation of data subject to balance and inequality restrictions using the truncated normal distribution (Statistics Netherlands)

  • WP 41: On the imputation of categorical data subject to edit restrictions using loglinear models (Statistics Netherlands)

  • WP 48: Improving imputation: the plan to examine count, status, vacancy and item imputation in the decennial census (USCB, US)


Imputation main developments
Imputation: main developments

Model based methods

  • Discrete Data

    • Constrained loglinear model

    • Linear regression model

  • Continuous Data

    • Truncated normal distribution followed by MCEM


Imputation main developments1
Imputation: main developments

Implementation of imputation methods

  • Use Bayesian networks for imputation of discrete data

  • Development of QUIS for imputation of continuous data

    • written in SAS

    • uses EM algorithm, nearest neighbor, and MI


Imputation main developments2
Imputation: main developments

Implementation of imputation methods

  • Integrated Modeling Approach (IMAI)

    • Summary and analysis of principles of IMAI

    • Estimation of imputation variance

  • U.S. Decennial Census

    • Research on alternative imputation options

    • Administrative records, model based imputation, CANCEIS, hot deck

    • Development of a truth deck for evaluation


E i for demographic variables papers
E & I for demographic variables: papers

Three papers:

  • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

  • WP 35: Edit and imputation for the 2006 Canadian Census (Statistics Canada)

  • WP 38: New procedures for editing and imputation of demographic variables (ISTAT, Italy)


E i for demographic variables main developments
E & I for demographic variables: main developments

  • Further improvement of CANCEIS

    • capability of processing all census variables

    • improved editing and imputation of alphanumeric, discrete, continuous and coded variables

    • improved user interface

  • Development of DIESIS

    • combined use of “data driven” approach (NIM) and “minimum change” approach (Fellegi-Holt)


E i for demographic variables main developments1
E & I for demographic variables: main developments

  • Development of DIESIS

    • Use of graph theory to improve quality of sequential imputation

    • Optimization procedure to locate the household reference person

    • New approach for selection of donors

      • based on partitioning passed records into smaller subsets of similar characteristics

      • search for donor records within the smaller clusters


Selective editing papers
Selective editing: papers

Two papers:

  • WP 42: Evaluation of score functions for selective editing of annual structural business statistics (Statistics Netherlands)

  • WP 45: An editing procedure for low pay data in the annual survey of hours and earning (Office for National Statistics, UK)


Selective editing main developments
Selective editing: main developments

  • Continued use and development of selective editing

  • Evaluation of selective editing approaches

    • experiments with different sets of score functions

  • Development of “hybrid editing”

    • validate a sample of failed records

    • use associated data to impute remaining records


Software papers
Software: papers

Four papers:

  • WP 34: The transition from GEIS to BANFF (Statistics Canada)

  • WP 37: Concepts, materials and IT modules for data editing of German statistics (Destatis, Germany)

  • WP 39: SLICE 1.5: a software framework for automatic edit and imputation (Statistics Netherlands)

  • WP 47: Improving an edit and imputation system for the US Census of agriculture (NASS, US)


Software main developments
Software: main developments

  • Flexibility

    • modules rather than large systems are developed

    • standard statistical packages are used (SAS in BANFF and US Census of Agriculture)

  • Testing and implementation of the software

  • Quality control measures

    • e.g. for (donor) imputation

  • Integration of the edit and imputation software in entire production process

    • process chain: planning, data collection, edit and imputation


General points for discussion
General points for discussion

  • Are there any really new approaches?

    • new approaches extensions of existing ideas?

    • new approaches combinations of old ones?

  • Develop new approaches or consolidate old approaches?

    • development versus evaluation studies and testing

    • prototype software versus implementation of production software

  • Is our focus shifting?

    • from editing towards imputation?

    • from development towards implementation?

    • from computational aspects towards quality issues?


Automatic editing points for discussion
Automatic editing: points for discussion

  • Can operations research techniques be combined with techniques from mathematical logic?

  • What are the (dis)advantages of using SAT solvers when compare to direct integer programming methods?

  • What is the quality of the imputations when editing data using the quadratic programming approach?


Automatic editing points for discussion1
Automatic editing: points for discussion

  • What is the quality of the solutions found by using the combinatorial optimization approach on real survey data? How fast is this approach on realistic data?

  • Can finite mixture models be used for detection of other types of systematic errors?

  • Should we invest on developing generic tools or software tools tailored to a particular application?


Automatic editing points for discussion2
Automatic editing: points for discussion

  • Are there any other types of surveys that are worth the effort of generating implied edits prior to error localization?

  • What are the most cost-effective methods for edit/imputation in terms of resources, time, clerical intervention, quality of results?


Imputation points for discussion
Imputation: points for discussion

  • What are the (dis)advantages of using complex mathematical models for missing data imputation? Are these models too complex for survey practitioners?

  • What are the expected computational difficulties of applying complex models to real survey data?

  • What are the largest (most complex) surveys that can be imputed using these models?


Imputation points for discussion1
Imputation: points for discussion

  • What is the quality of the imputations carried out using model based methods for filling-in missing data?

  • Can we compare the different imputation models?


Imputation points for discussion2
Imputation: points for discussion

  • Can more guidelines for the IMAI process be developed?

  • To what extent can we develop a systematic way of applying IMAI?

  • Is imputation variance an important issue at the moment, or should we (still) focus on imputation bias?


E i for demographic variables points for discussion
E & I for demographic variables: points for discussion

  • Can CANCEIS/DIESIS be used for other data besides demographic census data?

  • Can CANCEIS/DIESIS be further developed?

  • Should we use a combination of edit and imputation methods or a single method for demographic variables?


Selective editing points for discussion
Selective editing: points for discussion

  • Can selective editing be successfully applied to large/complex surveys?

  • Can current methods for selective editing be further developed?

  • Can a general theory for selective editing be developed?

  • How promising is hybrid editing?


Software points for discussion
Software: points for discussion

  • Should we develop generic software or software tools for particular applications?

  • How can we ensure the flexibility of software?

  • Are the software tools fast enough for large/complex data sets?

  • To what extent should we aim to automate the editing process?


ad