Imputation in the 2011 Census - PowerPoint PPT Presentation

Imputation in the 2011 census
Download
1 / 58

  • 106 Views
  • Uploaded on
  • Presentation posted in: General

Imputation in the 2011 Census. NILS Brownbag Talk – 6 May 2014 Richard Elliott. Overview. Background What is imputation How did we impute the 2011 Census Strategy Process Implementation Considerations Information Next steps. Background.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Imputation in the 2011 Census

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Imputation in the 2011 census

Imputation in the 2011 Census

NILS Brownbag Talk – 6 May 2014

Richard Elliott


Overview

Overview

  • Background

  • What is imputation

  • How did we impute the 2011 Census

    • Strategy

    • Process

    • Implementation

  • Considerations

  • Information

  • Next steps


Background

Background

  • Legal obligation on the public to complete a Census Questionnaire accurately

  • A minority didn’t provide such information

    • Item non-response

      • Leave questions unanswered

      • Make mistakes (i.e. neglect to follow questionnaire instructions)

      • Provide values that are out of range (e.g. Born in 1791)

    • Item inconsistency

      • Captured values not consistent with other values on the questionnaire (e.g. 6 year old mother)

    • Non-response

      • Don’t fill in the questionnaire at all


Background1

Background

  • It is NISRA’s policy to report estimates for the entire population. Therefore Imputation was utilised to:

    • Correct for non-response

      • Estimate the missing persons and households

    • Correct for Item non-response

      • Fill the gaps left by unanswered questions

    • Correct for Item inconsistency

      • Ensure that the information provided is consistent

  • These types of data quality issues apply equally to any data collection exercise

    • Not specific to Census

    • Census Office recognises that users need to be aware


Background2

Background

  • While imputation was used to “fill the gaps”, its strength comes from the information that was recorded

  • Therefore, it is important to recognise the following:

    • Responses to the Census represent a self-assessment of a respondents circumstances

      • Proxy responses given by main householder

    • Respondents didn’t always complete the questionnaire correctly

    • 85% of questionnaires were completed on paper forms

      • handwriting that had to be captured using an electronic character recognition system

      • While Service Levels in place for capture, errors still possible


Two types of imputation

Two Types of Imputation

  • Item Edit and Imputation

    • Correcting a dataset for inconsistencies and item non-response

    • Making each record “complete and consistent”

  • Record imputation

    • The addition of whole records to a dataset

    • Estimate and adjust for persons missed, duplicated and counted in the wrong place

    • Increases the accuracy of the overall estimates


Item edit and imputation strategy

Item Edit and Imputation Strategy

  • Primary Objective:

    • to produce a complete and consistent database where unobserved distributions were estimated accurately by the imputation process

  • There were three key principles

    • All changes made maintain the quality of the data

    • The number of changes to inconsistent data are minimised

    • As far as possible, missing data should be imputed for all variables to provide a complete and consistent database


Item edit and imputation strategy1

Item Edit and Imputation Strategy

  • In adhering to these principles, the following key aims were defined

    • Editing must not introduce bias or distortion in the data

    • Editing facilitates the production of output data that is fit for purpose

    • Editing methods help to ensure that pre-determined levels of data quality are met

      • Highest priority given to variables which define population bases (e.g. Age and Sex)

    • Editing supports the production of the population estimates by ensuring that the basic population estimates are accurate


Item edit and imputation strategy2

Item Edit and Imputation Strategy

  • Used a similar but enhanced version of the framework adopted in 2001

    • One Number Census Process

    • Tried and tested in 2001

  • Was undertaken as part of the Downstream Processing (DSP) project at ONS

    • Included both Item and Record Imputation

    • Supplemented by detailed QA at every stage by NISRA Census Office

    • NISRA benefitted from enhancements to the system found through ONS data processing

    • Ultimately NISRA responsible for processing of NI data and any parameter tweaking / re-runs


Imputation process 4 key stages

Imputation Process – 4 Key Stages

1. Cleansing the Data

2. Item Imputation


Imputation process 4 key stages1

Imputation Process – 4 Key Stages

1. Cleansing the Data

2. Item Imputation


Implementation capture and coding

Implementation – Capture and Coding

  • Capture and coding rules

    • Turning tick and text responses into data that could be edited and imputed

      • Complex coding used to assign numerical values to written text and ticked boxes (e.g. occupation and industry coding)

      • Invalid responses flagged for imputation (V, W, Y and Z)

    • Determinations made to responses to resolve combinations of tick and text

      • Ticks that could not be determined set to W (failed multi-tick)

      • Text that was uncodeable set to V (uncodeable text response)

    • Data subject to checks to ensure each question response was within a predefined range (e.g. No year of birth before 1896 or after 2011)

      • Invalid responses set to Z (out of range)

    • Missing data flagged as Y (missing requires imputation)


Implementation capture and coding1

Implementation – Capture and Coding

  • Determining combinations of ticks

    • Single tick


Implementation capture and coding2

Implementation – Capture and Coding

  • Determining combinations of ticks

    • Resolvable multi-tick


Implementation capture and coding3

Implementation – Capture and Coding

  • Determining combinations of ticks

    • Irresolvable multi- tick

This will be assumed missing and imputed.


Implementation capture and coding4

Implementation – Capture and Coding

  • Missing data

This will be imputed.


Implementation capture and coding5

Implementation – Capture and Coding

  • Resolving write-ins

    • Numbers

1

01


Implementation capture and coding6

Implementation – Capture and Coding

  • Resolving write-ins

    • Numbers

two

02


Implementation capture and coding7

Implementation – Capture and Coding

  • Resolving write-ins

    • Range Check

199

This will be assumed missing and imputed.


Implementation capture and coding8

Implementation – Capture and Coding

  • Resolving write-ins

    • Codeable response

“FRANCE” gets coded to 250

F R A NC E


Implementation capture and coding9

Implementation – Capture and Coding

  • Resolving write-ins

    • Uncodeable response

“SUGAR” is clearly not a country so set to set to VVV.

This will be assumed missing and imputed.

S U G A R


Imputation process 4 key stages2

Imputation Process – 4 Key Stages

1. Cleansing the Data

2. Item Imputation


Implementation rmr

Implementation – RMR

  • Reconcile Multiple Responses (RMR)

    • Removal of false persons

      • Removal of persons generated by capture anomalies

      • For example: strike throughs, inadequately completed questionnaires

    • Removal of duplicates (multiple persons / households)

      • Individuals who included themselves more than once

      • Separated parents who included their children at both addresses

    • Creating households / communals from multiple questionnaires

      • Consolidating H4 / HC4 / I4 etc

    • Validation

      • Renumbering person records within households / communals


Imputation process 4 key stages3

Imputation Process – 4 Key Stages

1. Cleansing the Data

2. Item Imputation


Implementation frdvp

Implementation – FRDVP

  • Filter Rules and Derived Variables for Processing (FRDVP)

    • Correct data by applying edits to correct for questionnaire routing errors

    • Apply hard edits to keep individual records consistent

      • Minimal at this stage (mostly applied within imputation system)

    • Information not required set to X

      • No imputation done on any variable set to X

    • Create high level variables to be used within the Item Imputation system

      • Blocking variables for donor searching

      • Makes it easier to find donors


Implementation frdvp1

Implementation – FRDVP

  • Surplus information – questionnaire routing

In this scenario the respondent should have skipped question 6 and gone straight to question 7.

Therefore, since the respondent should not have answered question 6, it is set to:

X (not required)


Implementation frdvp2

Implementation – FRDVP

  • Surplus information – questionnaire consistency

0 1 0 1 2 0 1 1

9 L O RD WAR DE N S

In this scenario, since the respondent is aged under 1 on Census day, and therefore did not have a usual address one year ago, the captured address information is set to X.

C RE S C E N T

B T 1 9 1 Y J


Imputation process 4 key stages4

Imputation Process – 4 Key Stages

1. Cleansing the Data

2. Item Imputation


Implementation item imputation

Implementation – Item Imputation

  • Achieved using CANCEIS

    • Canadian Census Edit and Imputation System

    • Developed specifically for Census type data

      • ie a mix of categorical and numerical variables

    • Donor-based edit and imputation system that can simultaneously:

      • Apply nearest-neighbour donor imputation

      • Apply deterministic edits and maintain consistency

    • Evaluated and endorsed as the 2011 Census imputation tool

      • Faster

      • Less resource intensive

      • Allowed for more joint-imputation


Implementation canceis

Implementation – CANCEIS

  • How did CANCEIS work in practice

    • The database was divided up into processing units for the purposes of resource management and maximising donor pools

Household questions

Individual imputation

Donor unit = household

Three Geographic units

Household Persons 1 to 6

Joint Household imputation

Between person edits and relationships

Donor unit = household of same size

Household Persons 7 to 30

Individual imputation

Relationship to Person 1

Donor unit = individual person

Person questions

Communal Persons

Individual imputation

Donor unit = individual person


Delivery groups processing units

Delivery Groups (Processing Units)


Implementation canceis1

Implementation – CANCEIS

  • How did CANCEIS work in practice

    • The database was divided up into processing units for the purposes of resource management and maximising donor pools

Household questions

Individual imputation

Donor unit = household

Three Geographic units

Household Persons 1 to 6

Joint Household imputation

Between person edits and relationships

Donor unit = household of same size

Household Persons 7 to 30

Individual imputation

Relationship to Person 1

Donor unit = individual person

Person questions

Communal Persons

Individual imputation

Donor unit = individual person


Implementation canceis2

Implementation – CANCEIS

  • How did CANCEIS work in practice

    • The household questions were imputed within a single module

    • Person data was divided up into 4 modules

      • Aim was to group variables that help predict each other

      • Attempt to maximise the number of donors for a given group

Demographics

e.g. Age, Sex, Marital status, Student, Activity last week

Culture

e.g. Ethnicity, Country of birth, Language, Passports

Health

e.g. General health, Disability, Long-term condition

Labour Market

e.g. Economic activity, Hours worked, Qualifications


Implementation canceis3

Implementation – CANCEIS

  • How were the donors selected?

    • Within each module a number of matching variables were used to select donors

    • Matching variables were weighted according to several factors

      • How well they would predict other values and how highly they should be prioritised when resolving inconsistencies

      • For example, age is often a good predictor of other demographic variables

      • Age was given a high weight, therefore observed ages were prioritised over other values if there was an inconsistency and changes were required

    • Northings and Eastings were used to control for geographical differences and find donors from similar areas

      • These were given a small weight compared to demographic characteristics like age, sex and marital status etc


Implementation canceis4

Implementation – CANCEIS

  • Matching variables (example)

    • Suppose someone omitted to fill in their occupation details

    • The record would be flagged for imputation under the Labour Market module

    • Donor pool identified by matching on (for example):

      • Economic Activity

      • Industry

      • Hours worked

      • Qualifications

    • These variables deemed to influence Occupation

    • Occupation information imputed from a donor with similar Labour Market characteristics


Implementation canceis5

Implementation – CANCEIS

  • Editing and Imputing was done simultaneously

    • Each record was checked for consistency before imputation

    • Any items that failed the checks were marked for imputation along with the missing items

  • A single donor was selected to resolve inconsistencies and non-response

    • Only values which satisfied the edit constraints were imputed into the recipient record

    • CANCEIS sought to minimise the number of changes required to repair a record when edit constraints were in place

  • There were 31 edit rules which were broadly based on 2001

    • e.g. If aged between 5 and 15 then must be in full-time education

  • Some rules had to be updated to account for changes since 2001

    • e.g. Removal of rule that did not allow same-sex couples

    • Replaced with rules that said married couples had to be opposite-sex and civil partners had to be same-sex


Implementation canceis6

Implementation – CANCEIS

  • Say we have the following (oversimplified) example:

  • Student is missing

    • Requires imputation under the demographic module

  • This record is subject to two edit constraints

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Fails Rule A since aged 10 and married

    • Therefore, both Age and Marital Status are also flagged for imputation


Implementation canceis7

Implementation – CANCEIS

  • Say we have the following (oversimplified) example:

  • Student is missing

    • Requires imputation under the demographic module

  • This record is subject to two edit constraints

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Fails Rule A since aged 10 and married

    • Therefore, both Age and Marital Status are also flagged for imputation


Implementation canceis8

Implementation – CANCEIS

  • The system searches for potential donors

    • Matching on demographic variables

    • Uses Northings and Eastings to find a donor in the area

  • The following records are returned:


Implementation canceis9

Implementation – CANCEIS

  • Must be aged 16+ to be married

  • Aged 5 to 15 must be a student


Implementation canceis10

Implementation – CANCEIS

  • Must be aged 16+ to be married

  • Aged 5 to 15 must be a student

  • Donor1

    • Using Donor1 would mean that “Single” is taken as well as “No”


  • Implementation canceis11

    Implementation – CANCEIS

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Donor1

    • Using Donor1 would mean that “Single” is taken as well as “No”

    • The new record fails Rule B

    • Therefore Age is taken from the donor as well


  • Implementation canceis12

    Implementation – CANCEIS

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Donor1

    • Using Donor1 would mean that “Single” is taken as well as “No”

    • The new record fails Rule B

    • Therefore Age is taken from the donor as well

  • Two observed value changes


    Implementation canceis13

    Implementation – CANCEIS

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Donor2

    • Using Donor2 would mean that “Single” is taken as well as “Yes”


  • Implementation canceis14

    Implementation – CANCEIS

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Donor2

    • Using Donor2 would mean that “Single” is taken as well as “Yes”

    • The new record passes both Rule A and Rule B

  • Only one observed value change


    Implementation canceis15

    Implementation – CANCEIS

    • Must be aged 16+ to be married

    • Aged 5 to 15 must be a student

  • Donor2

    • Using Donor2 would mean that “Single” is taken as well as “Yes”

    • The new record passes both Rule A and Rule B

  • Donor2 given a higher probability of selection

  • Only one observed value change


    Implementation canceis16

    Implementation – CANCEIS

    • Points to note

      • Variables were imputed in blocks of similar variables (modules)

        • there was no individual model for any one question

      • There is independency between the modules

        • for example, cultural characteristics might come from a different donor to employment characteristics

      • Imputed person data was combined in a way that maintained relationship consistency within a household

      • Given the processing approach, quality was maintained at the geographic unit level


    Imputation process 4 key stages5

    Imputation Process – 4 Key Stages

    1. Cleansing the Data

    2. Item Imputation


    Implementation manual imputation

    Implementation – Manual Imputation

    • Manual Imputation kept to a minimum but was necessary

    • Manual Imputation – QA checks

      • Quality Assurance at every stage of processing

      • Distributional checks and checks against comparator data sources

      • Edits made through Data File Amendments (DFAs)

      • DFAs not taken lightly

        • Involved detailed questionnaire image analysis

        • Mostly correcting for capture errors

        • e.g. Centenarians

    • Manual Imputation to increase donor pool

      • Temporary changes sometimes required when donor pool too small

      • E.g. Postcode matching

        (would have been done later in processing but brought forward)


    Imputation process 4 key stages6

    Imputation Process – 4 Key Stages

    1. Cleansing the Data

    2. Item Imputation


    Implementation coverage

    Implementation – Coverage

    • Coverage Assessment and Adjustment (Record Imputation)

      • Estimating wholly missed households and/or missing persons within households

        • Enumerated persons (92%)

        • Census Under-enumeration project (CUE) (4%)

        • Census Coverage Survey (CCS) (5%)

      • Further information can be found at http://www.nisra.gov.uk/Census/pop_QA_2011.pdf


    Imputation process 4 key stages7

    Imputation Process – 4 Key Stages

    1. Cleansing the Data

    2. Item Imputation


    Implementation post coverage

    Implementation – Post-Coverage

    • Post-Coverage Item Imputation

      • Making wholly imputed records complete and consistent

      • Using same methods as initial Item Imputation

      • Required only basic demographic information to be available for each record

      • Final check for consistency


    Considerations

    Considerations

    • Self completion

      • Incorrect information provided (mother putting down wrong age of baby)

      • Bad understanding of question or layout (marital status / relationships)

    • Some capture errors exist

      • eg dob captured as 1961 instead of 1981 – valid in family

      • Strike throughs

    • Item Imputation assumes missingness is at random (MAR)

      • It has to – cant make any other assumption

      • Attempt to control for dependency by using modules

      • Negligible change to marginal distributions

    • Record Imputation doesn’t assume MAR

      • Designed to correct for under-coverage which is not uniform

      • This imputation will change variable distributions

      • Extent of change driven by CCS and CUE


    Considerations1

    Considerations

    • While based on similar approach in 2001, some differences exist that can affect Imputation rates

      • Changes to definitions (eg Marital Status)

      • Some questions are quite similar but subtly different

        • eg Religion / Qualifications

      • Change in processing ability

        • Workplace postcode matching much easier in 2011

    • Census QA undertaken at every stage

      • Census assessed against various comparator datasets

      • However, unable to compare Census to Census unit records

        • 2001 to 2011 link not available when processing


    Information

    Information

    • Information already available

      • ONS paper on Item Edit and Imputation process

        • Item Edit and Imputation Process paper

      • ONS Evaluation report on Item Edit and Imputation

        • Item Edit and Imputation Process Evaluation paper

      • 2011 NI Census Methodology Overview

        • http://www.nisra.gov.uk/Census/pop_meth_2011.pdf

      • Details on the NI Census Under-enumeration project

        • 2011 Census Under Enumeration Project: Methodology paper

      • 2011 NI Census Quality papers

        • http://www.nisra.gov.uk/Census/pop_QA_2011.pdf

        • http://www.nisra.gov.uk/Census/pop_QA_2_2011.pdf

        • http://www.nisra.gov.uk/Census/key_QA_2011.pdf

      • Census Quality Survey

        • http://www.nisra.gov.uk/archive/census/2011/census_quality_survey.pdf


    Next steps

    Next Steps

    • Census Imputation rates will be published in due course

    • Change Rates available on NILS website

      • Note that rates are change rates and not Imputation Rates

        • Imputation rates are expressed as a percentage of expected response rather than total response

    • Most people filled in most of the questionnaire

    • A small proportion didn’t

    • Robust procedures applied to “fill the gaps”


    Questions

    Questions


  • Login