Large scale microdata workshop an introduction to the sars and esds government surveys
1 / 104

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys - PowerPoint PPT Presentation

  • Uploaded on

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys. University of Plymouth 15 April 2005 Jo Wathan & Reza Afkhami. Today. SARs: Introduction to 2001 Individual Licensed SARs Hands-on: Accessing the SARs in Nesstar Lunch 12:30

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys' - kail

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Large scale microdata workshop an introduction to the sars and esds government surveys l.jpg
Large-scale Microdata workshop:An introduction to the SARs and ESDS Government Surveys

University of Plymouth 15 April 2005

Jo Wathan & Reza Afkhami

Today l.jpg


  • Introduction to 2001 Individual Licensed SARs

  • Hands-on: Accessing the SARs in Nesstar

  • Lunch 12:30

  • Working with the Individual Licensed SAR: Data quality and analysis issues

  • Hands-on: The SARs in SPSS

  • Coffee 14: 45

  • Further SARs Issues – CAMs, Household data, SAMs files, User support

    ESDS Government Data

    End 16:00

Introduction to the 2001 licensed individual sar l.jpg

Introduction to the 2001 Licensed Individual SAR

Background to data development


Accessing the data

Census microdata l.jpg
Census Microdata

  • Census outputs have historically been aggregate tables – safe but inflexible

    • Can be obtained from:

      • ONS:

      • Casweb:

    • Well suited to analyses at small geographical detail

  • Microdata permits more flexibility

    • Longitudinal Survey links data from 1971 good for process but has to be secure

    • Demand for a cross-sectional dataset that can be used on own desktop

The 1991 samples of anonymised records l.jpg
The 1991 Samples of Anonymised Records

  • Available for the first time after research into the confidentiality risk

  • Two samples

    • Individual SARDetailed geog (large LAs)2% Sample

    • Household SARHierarchical, linked individuals- Detailed occupational information1% Sample

The request for the 2001 individual sar l.jpg
The Request for the 2001 Individual SAR

  • Request sent in autumn 2001

  • Following consultation with users and confidentiality assessment, we asked for similar detail as 1991, e.g:

    • 16 categories of ethnic group (or national equivalent)

    • SOC 2000 minor (81 categories)

  • But with a 3% sample and more LADs

  • ONS greater concerns over confidentiality

  • ‘Controlled Access Microdata Sample’ more detailed available in safe setting

Safe data l.jpg
Safe Data

  • Subject to extensive disclosure control

    • Broad banding

    • Special uniques analysis

    • Further recodes

    • Less detail than 1991 on:

      • Geography

      • Industry/occupation

      • Age

      • Country of birth

    • Released October 2004

Second version of sars l.jpg
Second version of SARs

  • ONS reconsidered confidentiality of SARs

  • Current version of data is version 2: contains more detail than version 1

  • Users must undertake to destroy version 1 before downloading version 2

Licensed file content geographical l.jpg
Licensed file content - geographical

  • Regional Geography

    • GOR Region PLUS

      • Inner/Outer London

      • Northern Ireland

      • Scotland

      • Wales

  • Country of birth

    • 16 categories

    • Increased from version 1

Licensed file contents demographic l.jpg
Licensed file contents: demographic

  • Age banded v.2

    • Individual year to 15

    • 16-19; 20-24; 25-29; 30-44;

    • 45-59; 60-64; 65-69;

    • 70-74; 75-94 single years; 95+

  • Ethnic group v.2

    • 16 categories (E and W)

    • 14 Scotland

    • 2 N. Ireland

Licensed file content socio economic l.jpg
Licensed file content:Socio-economic

  • Occupation

    • 2000 SOC Minor categories

    • NS-SEC 38 valid categories

  • Industry

    • 15 categories A-O, P, Q

  • Hours of work – single hours to 80+

New or improved data l.jpg
New or Improved Data

  • Improved highest qualification

    • 4 categories

  • Religion – varies considerably by nation v.2

    • 9 categories in England and Wales

    • 7 in Scotland – current only

    • 7 in Northern Ireland, plus religion brought up in

  • General health

    • Good / fairly good / not good

  • Caring

    • Hours caring, 3 bands

    • Number of carers in household

Research value l.jpg
Research value

  • Ability to recode variables as wished

  • Ability to select populations and variables

  • Ability to conduct multivariate analysis

  • Learning and Teaching

  • Preliminary work before using in-house file (CAMS)

The licence l.jpg
The Licence

  • All users need to be licensed

  • Academics complete license as part of the Census Registration System Process

  • Non-academic users sign license as part of the data registration process

  • Cannot pass the data to an unlicensed user

  • Cannot attempt to identify an individual

The licence good practice l.jpg
The licence – good practice

  • Keep your data password protected

  • Destroy your data when you have finished using it

  • Remove SAR files before passing on your PC to someone else

  • Tell CCSR about your publications

  • Tell CCSR if you leave your institution

Access arrangements l.jpg
Access Arrangements

  • Data distributed by CCSR

  • Academics, no charge

    • Register for the data under Census Registration System

    • Access the data online from CCSR website

  • Non-academics

    • Not for profit £500 per file

    • Business users £1000 per file

    • 10 users per application, incl. software

    • Download End User License from web

Accessing the data l.jpg
Accessing the data

  • Non-academic users

    • Data available in NSDstat

    • Other formats available on CD

    • Can arrange direct download

  • Academic users

    • Direct download (SPSS/Stata/tab delimited)

    • Nesstar, explore online and subset (wider range of formats available)

    • NSDstat available

Working with the 2001 licensed individual sar l.jpg

Working with the 2001 Licensed Individual SAR

Coverage and quality

SAR data issues

Analysing SAR data


Census coverage l.jpg
Census coverage

  • Major effort to improve coverage in 2001

  • One Number Census

  • Use of large Census Coverage Survey to correct census results, 300K households

    • Design independent of census;

    • Used matched census and CCS data to estimate total population in each area,

    • adjusted all results for census non-response using imputation of households and individuals

    • Results in final database for UK adjusted for non-response

Census coverage44 l.jpg
Census coverage

  • Coverage before imputation:

    • 94% households returned forms, with another 4% estimated to be in households identified by enumerators.

  • Response rate lowest for

    • Young people in their early 20s (men aged 20-24 resp. rate of 87%)

    • Inner London (resp rate of 78%)

  • Once imputed cases are included estimated to be 100% coverage

Population base l.jpg
Population base

  • One population base: usual residents

    • differs from 1991 when user had to chose either present or usual resident base

  • Students enumerated at term time address

  • Communal establishments are included

  • Implications for 2001 sars l.jpg
    Implications for 2001 SARs

    • 1991 SARs selected from 10% sample

      • Did not include imputed households

      • 96% coverage

    • 2001 SARs selected from 100% ONC database

      • 94% response; 6% imputed

      • Imputed individuals/hholds are identified

      • Imputed items are flagged

    Two kinds of imputation l.jpg
    Two kinds of imputation

    • Entire individual or household may be imputed as part of ONC

      • Complete records copied from enumerated individuals/hhold

      • Variable oncperim

    • Variables imputed when information missing

    Slide48 l.jpg

    • 13.7 million edit procedures undertaken

      • 28% population had 1+ items imputed

      • Common:

        • Missing prof quals set to none

        • Carer set to no where missing (unless economic activity also missing)

        • Travel to work set to ‘work mainly at/from home’ where workplace was ‘mainly at/from home

      • Others

        • 14k people multi-ticked ‘sex’ (so imputed)

        • 6k children had marital status changed to single

    • impossible values set to missing then imputed

      • Missing values are imputed on the basis of similar local cases

  • does not remove unlikely values

  • Item imputation l.jpg
    Item imputation

    For census output database as a whole:

    • One or more items imputed for 28% of the population

    • Employment variables most affected:

      • Industry ever worked: 18%

      • Occupation ever worked: 14%

      • Workplace size: 9%

    • Under-enumerated groups are most imputed, esp. single people

    Can i tell what who has been imputed l.jpg
    Can I tell what/who has been imputed?

    • Oncperim records whether an individual has been imputed as part of the ONC

      • Copies entire record from census database

    • ‘z’ variables identify whether individual has imputed information on a specific variable

      • Parallel set of variables

      • zethew, zage0

    Should i use imputed individuals or variables l.jpg
    Should I use imputed individuals or variables?

    • Imputation of individuals is designed to compensate for under-enumeration

      - using imputed cases will give results comparable with national data

      - will help overcome bias from non-response

    • Imputed variables are generally reported as accurate - in general we advise using imputed information

    Ethnicity l.jpg

    • But doubt over imputed ethnic group

    • Simpson and Akinwale used Longitudinal Study to compare 1991 ethnic group with imputed 2001 ethnic group

      • Majority of imputed records are ‘wrong’

      • Recommend not using imputed records for minority groups

    • SARs Percentage ethnic group imputed:

    • 2.5% white; 7.4% black; 11.7% mixed

    Pramming l.jpg

    • PRAMMing is perturbation designed to deal with very unusual cases, eg widowed 16-year olds

    • Avoids additional broad-banding

    • Perturbation is constrained to

      • preserve univariate distributions

      • Preserve multivariate distributions on control variables

      • prevents strange results (like 5 year old widows)

    • Affects 15 variables

      • Primary economic activity – 1% cases

    The z variables l.jpg
    The z-variables

    • PRAMMed variables are flagged along with imputed variables

      • Cannot distinguish them

    • Imputation flags are stored in variables with z prefix

    • Two versions of the download file

      • use the larger *-impflag-*.extension version if interested in imputation/PRAMMing

    General advice l.jpg
    General advice

    • If unsure about impact of PRAMMing and imputation

      • Do a sensitivity test

      • use the z var to exclude cases with imputed variables and then repeat your analysis

      • Use ONCPERIM to exclude imputed individuals and repeat your analysis

    National variation l.jpg
    National variation

    • There is one file for the whole UK

    • Some variables are country specific:

      • Irish language

    • Other variables have national variations

      • educational qualifications

      • ethnicity

      • Watch out for the E,W,S and N suffixes!

    • Slight variation in the sampling fraction for each country:

      • 3.125 in England and Wales;

      • 3.246 in Scotland

      • 3.139 in Northern Ireland

    Get to know the data l.jpg
    Get to know the data

    • Use the documentation

    • SARs User Guide

      • Use Census schedules to check questions

      • Check univariate frequencies

      • Do exploratory analyses

      • Contact [email protected] if you can’t find the information you need in the online documentation

    • Contact [email protected] if you think there is a problem with the data

    Sars as a large dataset l.jpg
    SARs as a LARGE dataset

    • 1.8 Million cases can cause trouble!

    • Use Nesstar to do initial data exploration

    • Extract a subset using NESSTAR or take a subset from the downloaded file

    • For serious analysis using a syntax ( or .do) file to record syntax makes re-running easier

      • Create a single syntax file which starts with the original data

      • Use file naming conventions that will enable you to trace versions

      • Keep a record of work done

    Sars as sample data l.jpg
    SARs as sample data

    Geographically stratified sample

    • approximates to simple random sample

    • no clustering in Individual file

    • Household file – clustering within households

    • Although large sample you may have small sample sizes when using sub-groups

    • use standard errors and confidence intervals

    Comparisons between 1991 and 2001 l.jpg
    Comparisons between 1991 and 2001

    • Population base changed

      • Imputation (no imputed values in 1991 SARs)

      • Students – enumerated at term-time address

      • Residents only (choice in 1991)

    • Variable continuity

      • Variable names have been changed where the variable is not exactly the same

      • Some variables (e.g. age, LLI) are easy to compare by grouping 1991 values

      • Some variables are harder to compare as the question has changed (eg qualifications)

    Ethnicity 91 01 l.jpg
    Ethnicity 91/01

    • Different questions asked in 1991 and 2001

    • No agreed and perfect correspondence

    • Simpson and Akinwale use LS to show how 1991 maps on to 2001

    Software options l.jpg
    Software options

    • Supported packages

      • Nesstar

      • NSDstat

      • SPSS

      • Stata

    • Other options

      • Import or Stat/transfer to another package

      • Use Nesstar to save to SAS or Statistica

      • unless you use a v. small subsample the SARs will be too big for most spreadsheets!

    Looking forward moving forward l.jpg

    Looking forward: Moving forward

    Controlled Access Microdata Samples

    Household SARs

    Small Area Microdata sample

    Learning and Teaching

    Cams content l.jpg
    CAMS content

    • Controlled Access Microdata designed for professional researchers:

    • Access in safe setting only

    • Specification on SARs website

    • Individual file and Household file

    Content of cams files l.jpg
    Content of CAMs files

    • Files contains much more detail; e.g.

      • Individual year of age (topcoded at 95)

      • Full coding on country of birth

      • SOC Unit Goup

      • Local authority geography

      • Index of Deprivation for SOAs

      • Index of Deprivation for migrants last address

    Controlled access l.jpg
    Controlled Access

    • CAMS is managed by ONS

    • Data is accessed at London/Titchfield/Newport in Virtual Laboratory setting on a server

    • Virtual lab looks like a standard windows interface

    • Use SPSS/Stata in usual way

    • output checked for confidentiality before release

    • Further information and appropriate forms at

    • Contact [email protected] for more details

    Cams good practice l.jpg
    CAMS Good practice

    • Use the licensed SARs...

      • to exhaust the potential of other datasets

      • to write your syntax files

    • check the disclosure guidelines before writing your file

    • Avoid complex tables

      • small cell counts aren’t reliable

      • unique cells will usually be suppressed

    • Do use models

    Household sar l.jpg
    Household SAR

    • 1% of households and all individuals

    • Allows linkage between individual in hholds

    • Similar detail to Individual SAR

      • Continuing discussion over ONS’ confidentiality concerns

        • Large households, less detail on households of 6+

    • Specification of Household SAR on website

    Release of household sar l.jpg
    Release of household SAR

    • Discussions are continuing over release of Household SAR

    • One possibility is a dataset with full information on age and all individuals in large households but under more tightly regulated conditions than the Individual SAR

    Small area microdata file l.jpg
    Small Area Microdata file

    • 5% sample of individuals

    • Full range of variables

    • LA lowest geography

      • Except Isles of Scilly and City of London in E and W; similar exceptions in S and NI

    • Excludes communal establishments

    • Age 11-year bands

    • Ethnicity – 5 groups or 16 with records swapping between LAs

    • Economic activity – 3 categories

  • Summer 2005 for delivery

  • Using the sars in learning and teaching l.jpg
    Using the SARs in Learning and Teaching

    • SARs provides easy to use dataset

    • Fits well with aggregate data

    • Supported by learning and teaching materials


    • Access managed in same way:

      • use Census Registration System

      • need ATHENS (for data and CHCC)

    User support l.jpg
    User support

    • Web pages are regularly updated

    • Documentation online

    • Resources and links added as we go

    • Seminar invitations welcome!

    • Regional workshop invites welcome!

    • SARs Helpdesk

    • Join email and newsletter lists

    • SARs User Group – July 15th, RSS, London

    Introduction to the key large scale government surveys l.jpg

    Introduction to the key large-scale government surveys

    What is ESDS Government?

    What data are available?

    Why use the data?

    How do you get the data?

    Esds government l.jpg
    ESDS Government

    • One of four specialist services of ESDS. ESDS is a new national data service (since Jan 03)

      • ESDS Government

      • ESDS Longitudinal

      • ESDS Qualidata

      • ESDS International

    • ESDS Government provides access and user support for key large-scale government surveys such as Labour Force Survey and General Household Survey

    • Access remains via the UKDA

    Esds government some of the things we do l.jpg
    ESDS Government - some of the things we do!

    • Helpdesk

    • Survey pages incl. how to get started

    • Online guides – SPSS, STATA, Weighting, Employment Research, Health Research

    • User Group seminars (data users and data creators)

    • Publications Database

    • Derived variables - consistent over time

      • - consistent with Census

  • Teaching datasets

  • Training


    Which surveys l.jpg
    Which surveys?

    • General Household Survey

    • Labour Force Survey

    • Family Resources Survey

    • Expenditure and Food Survey (previously the National Food Survey and Family Expenditure Survey)

    • ONS Omnibus Survey

    • National Travel Survey

    • Time Use Survey

    • British Crime Survey/Scottish Crime Survey

    • British Social Attitudes/Scottish Social Attitudes/Northern Ireland Life & Times/Young People’s Social Attitudes

    • Health Survey for England/Wales/Scotland

    • Survey of English Housing (England only)

    What is the data like l.jpg
    What is the data like?

    • Survey microdata

    • Large sample sizes (but smaller than the SARs)

    • Continuous surveys – always up-to-date

    • Cross-sectional (although the LFS has a 5-quarter panel element)

    • Specialist topic surveys – more depth than the Census

    • Freely available to academics via ESDS

    Quality of data 1 l.jpg
    Quality of Data (1)

    • Two main data collectors:

      • Office for National Statistics (ONS)

      • NatCen

    • Both have considerable experience

      • ONS Social Surveys started in 1941

      • Natcen founded in 1969 (as SCPR)

    • Permanent panels of highly trained field interviewers

    • Management and Quality Checking

    • (Relatively) high response rates – but falling

    • Widespread use by secondary analysts

    Quality of data 2 example of ghs data collection l.jpg
    QUALITY OF DATA (2)Example of GHS data collection

    What would you use the data for l.jpg
    What would you use the data for?

    • Straightforward secondary analysis

      • To assess theoretical accounts

      • To quantify characteristics or behaviours

      • To challenge official views

      • To apply alternative definitions

    • Context to your own primary research

      • Your research could be quantitative or qualitative

      • To assess the national context of an area study

      • To assess whether your sample is typical

      • To assess the scale of behaviours

    Practical research uses of the data l.jpg
    Practical research uses of the data

    • Looking at change over time

    • Look at sub-populations

    • Using the flexibility of the data to look at alternative definitions

    • Looking within households

    Secondary analysis change over time among sub populations l.jpg
    Secondary analysis:change over time among sub-populations

    Marmot, M (2003)

    Using successive cross sectional data over time l.jpg


    Reasonable amount of comparability

    Can pool years/quarters

    Data is representative at each time point

    Good at looking at impacts on groups


    Limits to continuity in the data (e.g. ethnic)

    Cannot establish individual change

    Using successive cross-sectional data over time

    Looking at small populations l.jpg
    Looking at small populations

    • Only the Samples of Anonymised Records have larger sample sizes

    • Many surveys with 10+k respondents

      • Permits minority groups to be represented

      • Rare subpopulations sample size may be too small… can consider combining years if appropriate

    Slide92 l.jpg

    Combining datasets to increase sample size

    • Survey data is subject to sampling error!

    • Example: Pregnancy and Employment

    • Using 1998-99 General Household Survey data alone there are only 168 pregnant women aged 16-49

    • 95% Confidence interval for % pregnant women economically inactive 34.2 – 49.1%

    • Combined 3 years’ data to obtain sample of 465 pregnant women

    • Confidence interval using 3 years’ data: 34.9 – 43.9%

    Using the flexibility of the data to look at alternative definitions l.jpg
    Using the flexibility of the data to look at alternative definitions

    What are ‘hours worked’?

    • Is it just paid work? Or unpaid as well?

    • Hours usually worked, or actually worked last week?

    • In main job, or in any job?

    • What about students?

    • Overtime – paid?

    • Overtime – unpaid?

    • Lunch hours?

    • Do non-workers work zero hours or should they be excluded?

    Choosing a survey for research l.jpg
    Choosing a survey for research definitions

    • Which surveys cover your main topic?

    • Which other topics are you interested in?

    • Measurement over time

    • Geography

    • Respondents – whole household, children?

    • Sample size

    Using the data in teaching l.jpg
    Using the data in teaching definitions

    • Methods courses

      • Using the data in a hands on manner

      • Using substantive exemplars to demonstrate a methodological point

      • Using the surveys as methodological exemplars

    • Substantive courses

      • Making your point using data

      • Integrating methods into substantive courses

    • Teaching datasets

      • General Household Survey

      • Labour Force Survey

      • British Crime Survey

      • Health Survey for England

    Quality of documentation l.jpg
    Quality of Documentation definitions

    • Questionnaire

    • Code book of Variables

    • Description of Derived Variables

    • Definitions

    • Methodology including

      • Sampling method

      • Response achieved

      • Population base

    • Published reports

    Continuous population survey l.jpg
    Continuous Population Survey definitions

    • Integrates LFS, GHS, EFS, OMN, APS into single survey

      • Annual achieved sample size = 265,000 households (over half a million adults) GB

      • Common core module of questions to whole sample

      • Topic modules to portions of sample

      • Core and topic modules combined to construct a number of different viable interview combinations

    • January to March 2007 second consultation

    • April 2007 Decision

    • January 2008 Start date

    Cps benefits risks l.jpg
    CPS: Benefits & Risks definitions


    • Builds on harmonisation

    • Increased flexibility while maintaining comparability over time

    • a very large annual sample for core module variables – improves precision;

    • an improved, ‘unclustered’ sample design;

    • better representation at local authority district level, and

    • improved weighting methodology

    • Greater coherence in official statistics, fewer ‘competing estimates’ between surveys


    • Major change to the government surveys

    • All eggs in one basket!

    Accessing the data104 l.jpg
    Accessing the data definitions

    • Access the documentation and data online in a similar manner to the Licensed SAR data

    • ESDS data stored at UK Data Archive

    • Can be downloaded in SPSS/Stata format

    • Can be explored using Nesstar

    • Users must be registered

    • Registration is managed by the Census Registration System

    • Once registered under CRS register a project

    • Order a dataset for a particular project