Large scale microdata workshop an introduction to the sars and esds government surveys
1 / 104

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys. University of Plymouth 15 April 2005 Jo Wathan & Reza Afkhami. Today. SARs: Introduction to 2001 Individual Licensed SARs Hands-on: Accessing the SARs in Nesstar Lunch 12:30

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Large-scale Microdata workshop:An introduction to the SARs and ESDS Government Surveys

University of Plymouth 15 April 2005

Jo Wathan & Reza Afkhami



  • Introduction to 2001 Individual Licensed SARs

  • Hands-on: Accessing the SARs in Nesstar

  • Lunch 12:30

  • Working with the Individual Licensed SAR: Data quality and analysis issues

  • Hands-on: The SARs in SPSS

  • Coffee 14: 45

  • Further SARs Issues – CAMs, Household data, SAMs files, User support

    ESDS Government Data

    End 16:00

Introduction to the 2001 Licensed Individual SAR

Background to data development


Accessing the data

Census Microdata

  • Census outputs have historically been aggregate tables – safe but inflexible

    • Can be obtained from:

      • ONS:

      • Casweb:

    • Well suited to analyses at small geographical detail

  • Microdata permits more flexibility

    • Longitudinal Survey links data from 1971 good for process but has to be secure

    • Demand for a cross-sectional dataset that can be used on own desktop

The 1991 Samples of Anonymised Records

  • Available for the first time after research into the confidentiality risk

  • Two samples

    • Individual SARDetailed geog (large LAs)2% Sample

    • Household SARHierarchical, linked individuals- Detailed occupational information1% Sample

The Request for the 2001 Individual SAR

  • Request sent in autumn 2001

  • Following consultation with users and confidentiality assessment, we asked for similar detail as 1991, e.g:

    • 16 categories of ethnic group (or national equivalent)

    • SOC 2000 minor (81 categories)

  • But with a 3% sample and more LADs

  • ONS greater concerns over confidentiality

  • ‘Controlled Access Microdata Sample’ more detailed available in safe setting

Safe Data

  • Subject to extensive disclosure control

    • Broad banding

    • Special uniques analysis

    • Further recodes

    • Less detail than 1991 on:

      • Geography

      • Industry/occupation

      • Age

      • Country of birth

    • Released October 2004

Second version of SARs

  • ONS reconsidered confidentiality of SARs

  • Current version of data is version 2: contains more detail than version 1

  • Users must undertake to destroy version 1 before downloading version 2

Licensed file content - geographical

  • Regional Geography

    • GOR Region PLUS

      • Inner/Outer London

      • Northern Ireland

      • Scotland

      • Wales

  • Country of birth

    • 16 categories

    • Increased from version 1

Licensed file contents: demographic

  • Age bandedv.2

    • Individual year to 15

    • 16-19; 20-24; 25-29; 30-44;

    • 45-59; 60-64; 65-69;

    • 70-74; 75-94 single years; 95+

  • Ethnic group v.2

    • 16 categories (E and W)

    • 14 Scotland

    • 2 N. Ireland

Licensed file content:Socio-economic

  • Occupation

    • 2000 SOC Minor categories

    • NS-SEC 38 valid categories

  • Industry

    • 15 categories A-O, P, Q

  • Hours of work – single hours to 80+

New or Improved Data

  • Improved highest qualification

    • 4 categories

  • Religion – varies considerably by nation v.2

    • 9 categories in England and Wales

    • 7 in Scotland – current only

    • 7 in Northern Ireland, plus religion brought up in

  • General health

    • Good / fairly good / not good

  • Caring

    • Hours caring, 3 bands

    • Number of carers in household

Research value

  • Ability to recode variables as wished

  • Ability to select populations and variables

  • Ability to conduct multivariate analysis

  • Learning and Teaching

  • Preliminary work before using in-house file (CAMS)

The Licence

  • All users need to be licensed

  • Academics complete license as part of the Census Registration System Process

  • Non-academic users sign license as part of the data registration process

  • Cannot pass the data to an unlicensed user

  • Cannot attempt to identify an individual

The licence – good practice

  • Keep your data password protected

  • Destroy your data when you have finished using it

  • Remove SAR files before passing on your PC to someone else

  • Tell CCSR about your publications

  • Tell CCSR if you leave your institution

Access Arrangements

  • Data distributed by CCSR

  • Academics, no charge

    • Register for the data under Census Registration System

    • Access the data online from CCSR website

  • Non-academics

    • Not for profit £500 per file

    • Business users £1000 per file

    • 10 users per application, incl. software

    • Download End User License from web

Accessing the data

  • Non-academic users

    • Data available in NSDstat

    • Other formats available on CD

    • Can arrange direct download

  • Academic users

    • Direct download (SPSS/Stata/tab delimited)

    • Nesstar, explore online and subset (wider range of formats available)

    • NSDstat available

Working with the 2001 Licensed Individual SAR

Coverage and quality

SAR data issues

Analysing SAR data


Census coverage

  • Major effort to improve coverage in 2001

  • One Number Census

  • Use of large Census Coverage Survey to correct census results, 300K households

    • Design independent of census;

    • Used matched census and CCS data to estimate total population in each area,

    • adjusted all results for census non-response using imputation of households and individuals

    • Results in final database for UK adjusted for non-response

Census coverage

  • Coverage before imputation:

    • 94% households returned forms, with another 4% estimated to be in households identified by enumerators.

  • Response rate lowest for

    • Young people in their early 20s (men aged 20-24 resp. rate of 87%)

    • Inner London (resp rate of 78%)

  • Once imputed cases are included estimated to be 100% coverage

Population base

  • One population base: usual residents

    • differs from 1991 when user had to chose either present or usual resident base

  • Students enumerated at term time address

  • Communal establishments are included

  • Implications for 2001 SARs

    • 1991 SARs selected from 10% sample

      • Did not include imputed households

      • 96% coverage

    • 2001 SARs selected from 100% ONC database

      • 94% response; 6% imputed

      • Imputed individuals/hholds are identified

      • Imputed items are flagged

    Two kinds of imputation

    • Entire individual or household may be imputed as part of ONC

      • Complete records copied from enumerated individuals/hhold

      • Variable oncperim

    • Variables imputed when information missing


    • 13.7 million edit procedures undertaken

      • 28% population had 1+ items imputed

      • Common:

        • Missing prof quals set to none

        • Carer set to no where missing (unless economic activity also missing)

        • Travel to work set to ‘work mainly at/from home’ where workplace was ‘mainly at/from home

      • Others

        • 14k people multi-ticked ‘sex’ (so imputed)

        • 6k children had marital status changed to single

    • impossible values set to missing then imputed

      • Missing values are imputed on the basis of similar local cases

  • does not remove unlikely values

  • Item imputation

    For census output database as a whole:

    • One or more items imputed for 28% of the population

    • Employment variables most affected:

      • Industry ever worked: 18%

      • Occupation ever worked: 14%

      • Workplace size: 9%

    • Under-enumerated groups are most imputed, esp. single people

    Can I tell what/who has been imputed?

    • Oncperim records whether an individual has been imputed as part of the ONC

      • Copies entire record from census database

    • ‘z’ variables identify whether individual has imputed information on a specific variable

      • Parallel set of variables

      • zethew, zage0

    Crosstab ethnic group (ethew) by imputation flag (zethew)

    Percentage with ethnicity variable imputed, 2001 SARs

    Percentage ONC imputed, 2001 SARs

    Should I use imputed individuals or variables?

    • Imputation of individuals is designed to compensate for under-enumeration

      -using imputed cases will give results comparable with national data

      - will help overcome bias from non-response

    • Imputed variables are generally reported as accurate - in general we advise using imputed information


    • But doubt over imputed ethnic group

    • Simpson and Akinwale used Longitudinal Study to compare 1991 ethnic group with imputed 2001 ethnic group

      • Majority of imputed records are ‘wrong’

      • Recommend not using imputed records for minority groups

    • SARs Percentage ethnic group imputed:

    • 2.5% white; 7.4% black; 11.7% mixed


    • PRAMMing is perturbation designed to deal with very unusual cases, eg widowed 16-year olds

    • Avoids additional broad-banding

    • Perturbation is constrained to

      • preserve univariate distributions

      • Preserve multivariate distributions on control variables

      • prevents strange results (like 5 year old widows)

    • Affects 15 variables

      • Primary economic activity – 1% cases

    The z-variables

    • PRAMMed variables are flagged along with imputed variables

      • Cannot distinguish them

    • Imputation flags are stored in variables with z prefix

    • Two versions of the download file

      • use the larger *-impflag-*.extension version if interested in imputation/PRAMMing

    General advice

    • If unsure about impact of PRAMMing and imputation

      • Do a sensitivity test

      • use the z var to exclude cases with imputed variables and then repeat your analysis

      • Use ONCPERIM to exclude imputed individuals and repeat your analysis

    National variation

    • There is one file for the whole UK

    • Some variables are country specific:

      • Irish language

    • Other variables have national variations

      • educational qualifications

      • ethnicity

      • Watch out for the E,W,S and N suffixes!

    • Slight variation in the sampling fraction for each country:

      • 3.125 in England and Wales;

      • 3.246 in Scotland

      • 3.139 in Northern Ireland

    How does the SARs compare to the aggregate data?

    Get to know the data

    • Use the documentation

    • SARs User Guide

      • Use Census schedules to check questions

      • Check univariate frequencies

      • Do exploratory analyses

      • Contact if you can’t find the information you need in the online documentation

    • Contact if you think there is a problem with the data

    SARs as a LARGE dataset

    • 1.8 Million cases can cause trouble!

    • Use Nesstar to do initial data exploration

    • Extract a subset using NESSTAR or take a subset from the downloaded file

    • For serious analysis using a syntax ( or .do) file to record syntax makes re-running easier

      • Create a single syntax file which starts with the original data

      • Use file naming conventions that will enable you to trace versions

      • Keep a record of work done

    SARs as sample data

    Geographically stratified sample

    • approximates to simple random sample

    • no clustering in Individual file

    • Household file – clustering within households

    • Although large sample you may have small sample sizes when using sub-groups

    • use standard errors and confidence intervals

    Comparisons between 1991 and 2001

    • Population base changed

      • Imputation (no imputed values in 1991 SARs)

      • Students – enumerated at term-time address

      • Residents only (choice in 1991)

    • Variable continuity

      • Variable names have been changed where the variable is not exactly the same

      • Some variables (e.g. age, LLI) are easy to compare by grouping 1991 values

      • Some variables are harder to compare as the question has changed (eg qualifications)

    Ethnicity 91/01

    • Different questions asked in 1991 and 2001

    • No agreed and perfect correspondence

    • Simpson and Akinwale use LS to show how 1991 maps on to 2001

    Software options

    • Supported packages

      • Nesstar

      • NSDstat

      • SPSS

      • Stata

    • Other options

      • Import or Stat/transfer to another package

      • Use Nesstar to save to SAS or Statistica

      • unless you use a v. small subsample the SARs will be too big for most spreadsheets!

    Looking forward: Moving forward

    Controlled Access Microdata Samples

    Household SARs

    Small Area Microdata sample

    Learning and Teaching

    CAMS content

    • Controlled Access Microdata designed for professional researchers:

    • Access in safe setting only

    • Specification on SARs website

    • Individual file and Household file

    Content of CAMs files

    • Files contains much more detail; e.g.

      • Individual year of age (topcoded at 95)

      • Full coding on country of birth

      • SOC Unit Goup

      • Local authority geography

      • Index of Deprivation for SOAs

      • Index of Deprivation for migrants last address

    Controlled Access

    • CAMS is managed by ONS

    • Data is accessed at London/Titchfield/Newport in Virtual Laboratory setting on a server

    • Virtual lab looks like a standard windows interface

    • Use SPSS/Stata in usual way

    • output checked for confidentiality before release

    • Further information and appropriate forms at

    • Contact for more details

    CAMS Good practice

    • Use the licensed SARs...

      • to exhaust the potential of other datasets

      • to write your syntax files

    • check the disclosure guidelines before writing your file

    • Avoid complex tables

      • small cell counts aren’t reliable

      • unique cells will usually be suppressed

    • Do use models

    Household SAR

    • 1% of households and all individuals

    • Allows linkage between individual in hholds

    • Similar detail to Individual SAR

      • Continuing discussion over ONS’ confidentiality concerns

        • Large households, less detail on households of 6+

    • Specification of Household SAR on website

    The hierarchy of the household file

    Release of household SAR

    • Discussions are continuing over release of Household SAR

    • One possibility is a dataset with full information on age and all individuals in large households but under more tightly regulated conditions than the Individual SAR

    Small Area Microdata file

    • 5% sample of individuals

    • Full range of variables

    • LA lowest geography

      • Except Isles of Scilly and City of London in E and W; similar exceptions in S and NI

    • Excludes communal establishments

    • Age 11-year bands

    • Ethnicity – 5 groups or 16 with records swapping between LAs

    • Economic activity – 3 categories

  • Summer 2005 for delivery

  • Using the SARs in Learning and Teaching

    • SARs provides easy to use dataset

    • Fits well with aggregate data

    • Supported by learning and teaching materials


    • Access managed in same way:

      • use Census Registration System

      • need ATHENS (for data and CHCC)

    User support

    • Web pages are regularly updated

    • Documentation online

    • Resources and links added as we go

    • Seminar invitations welcome!

    • Regional workshop invites welcome!

    • SARs Helpdesk


      • (0161) 275 4735

    • Join email and newsletter lists

    • SARs User Group – July 15th, RSS, London

    Introduction to the key large-scale government surveys

    What is ESDS Government?

    What data are available?

    Why use the data?

    How do you get the data?

    ESDS Government

    • One of four specialist services of ESDS. ESDS is a new national data service (since Jan 03)

      • ESDS Government

      • ESDS Longitudinal

      • ESDS Qualidata

      • ESDS International

    • ESDS Government provides access and user support for key large-scale government surveys such as Labour Force Survey and General Household Survey

    • Access remains via the UKDA

    ESDS Government - some of the things we do!

    • Helpdesk

    • Survey pages incl. how to get started

    • Online guides – SPSS, STATA, Weighting, Employment Research, Health Research

    • User Group seminars (data users and data creators)

    • Publications Database

    • Derived variables - consistent over time

      • - consistent with Census

  • Teaching datasets

  • Training


    Which surveys?

    • General Household Survey

    • Labour Force Survey

    • Family Resources Survey

    • Expenditure and Food Survey (previously the National Food Survey and Family Expenditure Survey)

    • ONS Omnibus Survey

    • National Travel Survey

    • Time Use Survey

    • British Crime Survey/Scottish Crime Survey

    • British Social Attitudes/Scottish Social Attitudes/Northern Ireland Life & Times/Young People’s Social Attitudes

    • Health Survey for England/Wales/Scotland

    • Survey of English Housing (England only)

    What is the data like?

    • Survey microdata

    • Large sample sizes (but smaller than the SARs)

    • Continuous surveys – always up-to-date

    • Cross-sectional (although the LFS has a 5-quarter panel element)

    • Specialist topic surveys – more depth than the Census

    • Freely available to academics via ESDS

    Quality of Data (1)

    • Two main data collectors:

      • Office for National Statistics (ONS)

      • NatCen

    • Both have considerable experience

      • ONS Social Surveys started in 1941

      • Natcen founded in 1969 (as SCPR)

    • Permanent panels of highly trained field interviewers

    • Management and Quality Checking

    • (Relatively) high response rates – but falling

    • Widespread use by secondary analysts

    QUALITY OF DATA (2)Example of GHS data collection

    What would you use the data for?

    • Straightforward secondary analysis

      • To assess theoretical accounts

      • To quantify characteristics or behaviours

      • To challenge official views

      • To apply alternative definitions

    • Context to your own primary research

      • Your research could be quantitative or qualitative

      • To assess the national context of an area study

      • To assess whether your sample is typical

      • To assess the scale of behaviours

    Practical research uses of the data

    • Looking at change over time

    • Look at sub-populations

    • Using the flexibility of the data to look at alternative definitions

    • Looking within households

    Change over time

    Secondary analysis:change over time among sub-populations

    Marmot, M (2003)


    Reasonable amount of comparability

    Can pool years/quarters

    Data is representative at each time point

    Good at looking at impacts on groups


    Limits to continuity in the data (e.g. ethnic)

    Cannot establish individual change

    Using successive cross-sectional data over time

    Looking at small populations

    • Only the Samples of Anonymised Records have larger sample sizes

    • Many surveys with 10+k respondents

      • Permits minority groups to be represented

      • Rare subpopulations sample size may be too small… can consider combining years if appropriate

    Combining datasets to increase sample size

    • Survey data is subject to sampling error!

    • Example: Pregnancy and Employment

    • Using 1998-99 General Household Survey data alone there are only 168 pregnant women aged 16-49

    • 95% Confidence interval for % pregnant women economically inactive 34.2 – 49.1%

    • Combined 3 years’ data to obtain sample of 465 pregnant women

    • Confidence interval using 3 years’ data: 34.9 – 43.9%

    Using the flexibility of the data to look at alternative definitions

    What are ‘hours worked’?

    • Is it just paid work? Or unpaid as well?

    • Hours usually worked, or actually worked last week?

    • In main job, or in any job?

    • What about students?

    • Overtime – paid?

    • Overtime – unpaid?

    • Lunch hours?

    • Do non-workers work zero hours or should they be excluded?

    Choosing a survey for research

    • Which surveys cover your main topic?

    • Which other topics are you interested in?

    • Measurement over time

    • Geography

    • Respondents – whole household, children?

    • Sample size

    Using the data in teaching

    • Methods courses

      • Using the data in a hands on manner

      • Using substantive exemplars to demonstrate a methodological point

      • Using the surveys as methodological exemplars

    • Substantive courses

      • Making your point using data

      • Integrating methods into substantive courses

    • Teaching datasets

      • General Household Survey

      • Labour Force Survey

      • British Crime Survey

      • Health Survey for England

    Quality of Documentation

    • Questionnaire

    • Code book of Variables

    • Description of Derived Variables

    • Definitions

    • Methodology including

      • Sampling method

      • Response achieved

      • Population base

    • Published reports

    Health Survey for England documentation

    Health Survey for England documentation

    Health Survey for England documentation

    Documentation - GHS Questionnaire

    Documentation - GHS Questionnaire

    Continuous Population Survey

    • Integrates LFS, GHS, EFS, OMN, APS into single survey

      • Annual achieved sample size = 265,000 households (over half a million adults) GB

      • Common core module of questions to whole sample

      • Topic modules to portions of sample

      • Core and topic modules combined to construct a number of different viable interview combinations

    • January to March 2007 second consultation

    • April 2007 Decision

    • January 2008 Start date

    CPS: Benefits & Risks


    • Builds on harmonisation

    • Increased flexibility while maintaining comparability over time

    • a very large annual sample for core module variables – improves precision;

    • an improved, ‘unclustered’ sample design;

    • better representation at local authority district level, and

    • improved weighting methodology

    • Greater coherence in official statistics, fewer ‘competing estimates’ between surveys


    • Major change to the government surveys

    • All eggs in one basket!

    Accessing the data

    • Access the documentation and data online in a similar manner to the Licensed SAR data

    • ESDS data stored at UK Data Archive

    • Can be downloaded in SPSS/Stata format

    • Can be explored using Nesstar

    • Users must be registered

    • Registration is managed by the Census Registration System

    • Once registered under CRS register a project

    • Order a dataset for a particular project

  • Login