large scale microdata workshop an introduction to the sars and esds government surveys
Skip this Video
Download Presentation
Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys

Loading in 2 Seconds...

play fullscreen
1 / 104

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys - PowerPoint PPT Presentation

  • Uploaded on

Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys. University of Plymouth 15 April 2005 Jo Wathan & Reza Afkhami. Today. SARs: Introduction to 2001 Individual Licensed SARs Hands-on: Accessing the SARs in Nesstar Lunch 12:30

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Large-scale Microdata workshop: An introduction to the SARs and ESDS Government Surveys' - kail

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
large scale microdata workshop an introduction to the sars and esds government surveys
Large-scale Microdata workshop:An introduction to the SARs and ESDS Government Surveys

University of Plymouth 15 April 2005

Jo Wathan & Reza Afkhami



  • Introduction to 2001 Individual Licensed SARs
  • Hands-on: Accessing the SARs in Nesstar
  • Lunch 12:30
  • Working with the Individual Licensed SAR: Data quality and analysis issues
  • Hands-on: The SARs in SPSS
  • Coffee 14: 45
  • Further SARs Issues – CAMs, Household data, SAMs files, User support

ESDS Government Data

End 16:00

introduction to the 2001 licensed individual sar

Introduction to the 2001 Licensed Individual SAR

Background to data development


Accessing the data

census microdata
Census Microdata
  • Census outputs have historically been aggregate tables – safe but inflexible
    • Can be obtained from:
      • ONS:
      • Casweb:
    • Well suited to analyses at small geographical detail
  • Microdata permits more flexibility
    • Longitudinal Survey links data from 1971 good for process but has to be secure
    • Demand for a cross-sectional dataset that can be used on own desktop
the 1991 samples of anonymised records
The 1991 Samples of Anonymised Records
  • Available for the first time after research into the confidentiality risk
  • Two samples
    • Individual SARDetailed geog (large LAs)2% Sample
    • Household SARHierarchical, linked individuals- Detailed occupational information1% Sample
the request for the 2001 individual sar
The Request for the 2001 Individual SAR
  • Request sent in autumn 2001
  • Following consultation with users and confidentiality assessment, we asked for similar detail as 1991, e.g:
    • 16 categories of ethnic group (or national equivalent)
    • SOC 2000 minor (81 categories)
  • But with a 3% sample and more LADs
  • ONS greater concerns over confidentiality
  • ‘Controlled Access Microdata Sample’ more detailed available in safe setting
safe data
Safe Data
  • Subject to extensive disclosure control
    • Broad banding
    • Special uniques analysis
    • Further recodes
    • Less detail than 1991 on:
      • Geography
      • Industry/occupation
      • Age
      • Country of birth
    • Released October 2004
second version of sars
Second version of SARs
  • ONS reconsidered confidentiality of SARs
  • Current version of data is version 2: contains more detail than version 1
  • Users must undertake to destroy version 1 before downloading version 2
licensed file content geographical
Licensed file content - geographical
  • Regional Geography
    • GOR Region PLUS
      • Inner/Outer London
      • Northern Ireland
      • Scotland
      • Wales
  • Country of birth
    • 16 categories
    • Increased from version 1
licensed file contents demographic
Licensed file contents: demographic
  • Age banded v.2
    • Individual year to 15
    • 16-19; 20-24; 25-29; 30-44;
    • 45-59; 60-64; 65-69;
    • 70-74; 75-94 single years; 95+
  • Ethnic group v.2
    • 16 categories (E and W)
    • 14 Scotland
    • 2 N. Ireland
licensed file content socio economic
Licensed file content:Socio-economic
  • Occupation
    • 2000 SOC Minor categories
    • NS-SEC 38 valid categories
  • Industry
    • 15 categories A-O, P, Q
  • Hours of work – single hours to 80+
new or improved data
New or Improved Data
  • Improved highest qualification
    • 4 categories
  • Religion – varies considerably by nation v.2
    • 9 categories in England and Wales
    • 7 in Scotland – current only
    • 7 in Northern Ireland, plus religion brought up in
  • General health
    • Good / fairly good / not good
  • Caring
    • Hours caring, 3 bands
    • Number of carers in household
research value
Research value
  • Ability to recode variables as wished
  • Ability to select populations and variables
  • Ability to conduct multivariate analysis
  • Learning and Teaching
  • Preliminary work before using in-house file (CAMS)
the licence
The Licence
  • All users need to be licensed
  • Academics complete license as part of the Census Registration System Process
  • Non-academic users sign license as part of the data registration process
  • Cannot pass the data to an unlicensed user
  • Cannot attempt to identify an individual
the licence good practice
The licence – good practice
  • Keep your data password protected
  • Destroy your data when you have finished using it
  • Remove SAR files before passing on your PC to someone else
  • Tell CCSR about your publications
  • Tell CCSR if you leave your institution
access arrangements
Access Arrangements
  • Data distributed by CCSR
  • Academics, no charge
    • Register for the data under Census Registration System
    • Access the data online from CCSR website
  • Non-academics
    • Not for profit £500 per file
    • Business users £1000 per file
    • 10 users per application, incl. software
    • Download End User License from web
accessing the data
Accessing the data
  • Non-academic users
    • Data available in NSDstat
    • Other formats available on CD
    • Can arrange direct download
  • Academic users
    • Direct download (SPSS/Stata/tab delimited)
    • Nesstar, explore online and subset (wider range of formats available)
    • NSDstat available
working with the 2001 licensed individual sar

Working with the 2001 Licensed Individual SAR

Coverage and quality

SAR data issues

Analysing SAR data


census coverage
Census coverage
  • Major effort to improve coverage in 2001
  • One Number Census
  • Use of large Census Coverage Survey to correct census results, 300K households
    • Design independent of census;
    • Used matched census and CCS data to estimate total population in each area,
    • adjusted all results for census non-response using imputation of households and individuals
    • Results in final database for UK adjusted for non-response
census coverage44
Census coverage
  • Coverage before imputation:
    • 94% households returned forms, with another 4% estimated to be in households identified by enumerators.
  • Response rate lowest for
    • Young people in their early 20s (men aged 20-24 resp. rate of 87%)
    • Inner London (resp rate of 78%)
  • Once imputed cases are included estimated to be 100% coverage
population base
Population base
  • One population base: usual residents
      • differs from 1991 when user had to chose either present or usual resident base
  • Students enumerated at term time address
  • Communal establishments are included
implications for 2001 sars
Implications for 2001 SARs
  • 1991 SARs selected from 10% sample
    • Did not include imputed households
    • 96% coverage
  • 2001 SARs selected from 100% ONC database
    • 94% response; 6% imputed
    • Imputed individuals/hholds are identified
    • Imputed items are flagged
two kinds of imputation
Two kinds of imputation
  • Entire individual or household may be imputed as part of ONC
    • Complete records copied from enumerated individuals/hhold
    • Variable oncperim
  • Variables imputed when information missing
  • 13.7 million edit procedures undertaken
    • 28% population had 1+ items imputed
    • Common:
      • Missing prof quals set to none
      • Carer set to no where missing (unless economic activity also missing)
      • Travel to work set to ‘work mainly at/from home’ where workplace was ‘mainly at/from home
    • Others
      • 14k people multi-ticked ‘sex’ (so imputed)
      • 6k children had marital status changed to single
  • impossible values set to missing then imputed
      • Missing values are imputed on the basis of similar local cases
  • does not remove unlikely values
item imputation
Item imputation

For census output database as a whole:

  • One or more items imputed for 28% of the population
  • Employment variables most affected:
    • Industry ever worked: 18%
    • Occupation ever worked: 14%
    • Workplace size: 9%
  • Under-enumerated groups are most imputed, esp. single people
can i tell what who has been imputed
Can I tell what/who has been imputed?
  • Oncperim records whether an individual has been imputed as part of the ONC
    • Copies entire record from census database
  • ‘z’ variables identify whether individual has imputed information on a specific variable
    • Parallel set of variables
    • zethew, zage0
should i use imputed individuals or variables
Should I use imputed individuals or variables?
  • Imputation of individuals is designed to compensate for under-enumeration

- using imputed cases will give results comparable with national data

- will help overcome bias from non-response

  • Imputed variables are generally reported as accurate - in general we advise using imputed information
  • But doubt over imputed ethnic group
  • Simpson and Akinwale used Longitudinal Study to compare 1991 ethnic group with imputed 2001 ethnic group
      • Majority of imputed records are ‘wrong’
      • Recommend not using imputed records for minority groups

    • SARs Percentage ethnic group imputed:
    • 2.5% white; 7.4% black; 11.7% mixed
  • PRAMMing is perturbation designed to deal with very unusual cases, eg widowed 16-year olds
  • Avoids additional broad-banding
  • Perturbation is constrained to
    • preserve univariate distributions
    • Preserve multivariate distributions on control variables
    • prevents strange results (like 5 year old widows)
  • Affects 15 variables
    • Primary economic activity – 1% cases
the z variables
The z-variables
  • PRAMMed variables are flagged along with imputed variables
    • Cannot distinguish them
  • Imputation flags are stored in variables with z prefix
  • Two versions of the download file
    • use the larger *-impflag-*.extension version if interested in imputation/PRAMMing
general advice
General advice
  • If unsure about impact of PRAMMing and imputation
    • Do a sensitivity test
    • use the z var to exclude cases with imputed variables and then repeat your analysis
    • Use ONCPERIM to exclude imputed individuals and repeat your analysis
national variation
National variation
  • There is one file for the whole UK
  • Some variables are country specific:
    • Irish language
  • Other variables have national variations
    • educational qualifications
    • ethnicity
    • Watch out for the E,W,S and N suffixes!
  • Slight variation in the sampling fraction for each country:
    • 3.125 in England and Wales;
    • 3.246 in Scotland
    • 3.139 in Northern Ireland
get to know the data
Get to know the data
  • Use the documentation
  • SARs User Guide
    • Use Census schedules to check questions
    • Check univariate frequencies
    • Do exploratory analyses
    • Contact [email protected] if you can’t find the information you need in the online documentation
  • Contact [email protected] if you think there is a problem with the data
sars as a large dataset
SARs as a LARGE dataset
  • 1.8 Million cases can cause trouble!
  • Use Nesstar to do initial data exploration
  • Extract a subset using NESSTAR or take a subset from the downloaded file
  • For serious analysis using a syntax ( or .do) file to record syntax makes re-running easier
    • Create a single syntax file which starts with the original data
    • Use file naming conventions that will enable you to trace versions
    • Keep a record of work done
sars as sample data
SARs as sample data

Geographically stratified sample

  • approximates to simple random sample
  • no clustering in Individual file
  • Household file – clustering within households
  • Although large sample you may have small sample sizes when using sub-groups
  • use standard errors and confidence intervals
comparisons between 1991 and 2001
Comparisons between 1991 and 2001
  • Population base changed
    • Imputation (no imputed values in 1991 SARs)
    • Students – enumerated at term-time address
    • Residents only (choice in 1991)
  • Variable continuity
    • Variable names have been changed where the variable is not exactly the same
    • Some variables (e.g. age, LLI) are easy to compare by grouping 1991 values
    • Some variables are harder to compare as the question has changed (eg qualifications)
ethnicity 91 01
Ethnicity 91/01
  • Different questions asked in 1991 and 2001
  • No agreed and perfect correspondence
  • Simpson and Akinwale use LS to show how 1991 maps on to 2001

software options
Software options
  • Supported packages
    • Nesstar
    • NSDstat
    • SPSS
    • Stata
  • Other options
    • Import or Stat/transfer to another package
    • Use Nesstar to save to SAS or Statistica
    • unless you use a v. small subsample the SARs will be too big for most spreadsheets!
looking forward moving forward

Looking forward: Moving forward

Controlled Access Microdata Samples

Household SARs

Small Area Microdata sample

Learning and Teaching

cams content
CAMS content
  • Controlled Access Microdata designed for professional researchers:
  • Access in safe setting only
  • Specification on SARs website
  • Individual file and Household file
content of cams files
Content of CAMs files
  • Files contains much more detail; e.g.
    • Individual year of age (topcoded at 95)
    • Full coding on country of birth
    • SOC Unit Goup
    • Local authority geography
    • Index of Deprivation for SOAs
    • Index of Deprivation for migrants last address
controlled access
Controlled Access
  • CAMS is managed by ONS
  • Data is accessed at London/Titchfield/Newport in Virtual Laboratory setting on a server
  • Virtual lab looks like a standard windows interface
  • Use SPSS/Stata in usual way
  • output checked for confidentiality before release
  • Further information and appropriate forms at
  • Contact [email protected] for more details
cams good practice
CAMS Good practice
  • Use the licensed SARs...
    • to exhaust the potential of other datasets
    • to write your syntax files
  • check the disclosure guidelines before writing your file
  • Avoid complex tables
    • small cell counts aren’t reliable
    • unique cells will usually be suppressed
  • Do use models
household sar
Household SAR
  • 1% of households and all individuals
  • Allows linkage between individual in hholds
  • Similar detail to Individual SAR
    • Continuing discussion over ONS’ confidentiality concerns
      • Large households, less detail on households of 6+
  • Specification of Household SAR on website
release of household sar
Release of household SAR
  • Discussions are continuing over release of Household SAR
  • One possibility is a dataset with full information on age and all individuals in large households but under more tightly regulated conditions than the Individual SAR
small area microdata file
Small Area Microdata file
  • 5% sample of individuals
  • Full range of variables
  • LA lowest geography
      • Except Isles of Scilly and City of London in E and W; similar exceptions in S and NI
    • Excludes communal establishments
    • Age 11-year bands
    • Ethnicity – 5 groups or 16 with records swapping between LAs
    • Economic activity – 3 categories
  • Summer 2005 for delivery
using the sars in learning and teaching
Using the SARs in Learning and Teaching
  • SARs provides easy to use dataset
  • Fits well with aggregate data
  • Supported by learning and teaching materials
  • Access managed in same way:
    • use Census Registration System
    • need ATHENS (for data and CHCC)
user support
User support
  • Web pages are regularly updated
  • Documentation online
  • Resources and links added as we go
  • Seminar invitations welcome!
  • Regional workshop invites welcome!
  • SARs Helpdesk
  • Join email and newsletter lists
  • SARs User Group – July 15th, RSS, London
introduction to the key large scale government surveys

Introduction to the key large-scale government surveys

What is ESDS Government?

What data are available?

Why use the data?

How do you get the data?

esds government
ESDS Government
  • One of four specialist services of ESDS. ESDS is a new national data service (since Jan 03)
    • ESDS Government
    • ESDS Longitudinal
    • ESDS Qualidata
    • ESDS International
  • ESDS Government provides access and user support for key large-scale government surveys such as Labour Force Survey and General Household Survey
  • Access remains via the UKDA

esds government some of the things we do
ESDS Government - some of the things we do!
  • Helpdesk
  • Survey pages incl. how to get started
  • Online guides – SPSS, STATA, Weighting, Employment Research, Health Research
  • User Group seminars (data users and data creators)
  • Publications Database
  • Derived variables - consistent over time
          • - consistent with Census
  • Teaching datasets
  • Training

which surveys
Which surveys?
  • General Household Survey
  • Labour Force Survey
  • Family Resources Survey
  • Expenditure and Food Survey (previously the National Food Survey and Family Expenditure Survey)
  • ONS Omnibus Survey
  • National Travel Survey
  • Time Use Survey
  • British Crime Survey/Scottish Crime Survey
  • British Social Attitudes/Scottish Social Attitudes/Northern Ireland Life & Times/Young People’s Social Attitudes
  • Health Survey for England/Wales/Scotland
  • Survey of English Housing (England only)
what is the data like
What is the data like?
  • Survey microdata
  • Large sample sizes (but smaller than the SARs)
  • Continuous surveys – always up-to-date
  • Cross-sectional (although the LFS has a 5-quarter panel element)
  • Specialist topic surveys – more depth than the Census
  • Freely available to academics via ESDS
quality of data 1
Quality of Data (1)
  • Two main data collectors:
    • Office for National Statistics (ONS)
    • NatCen
  • Both have considerable experience
    • ONS Social Surveys started in 1941
    • Natcen founded in 1969 (as SCPR)
  • Permanent panels of highly trained field interviewers
  • Management and Quality Checking
  • (Relatively) high response rates – but falling
  • Widespread use by secondary analysts
what would you use the data for
What would you use the data for?
  • Straightforward secondary analysis
    • To assess theoretical accounts
    • To quantify characteristics or behaviours
    • To challenge official views
    • To apply alternative definitions
  • Context to your own primary research
    • Your research could be quantitative or qualitative
    • To assess the national context of an area study
    • To assess whether your sample is typical
    • To assess the scale of behaviours
practical research uses of the data
Practical research uses of the data
  • Looking at change over time
  • Look at sub-populations
  • Using the flexibility of the data to look at alternative definitions
  • Looking within households
using successive cross sectional data over time

Reasonable amount of comparability

Can pool years/quarters

Data is representative at each time point

Good at looking at impacts on groups


Limits to continuity in the data (e.g. ethnic)

Cannot establish individual change

Using successive cross-sectional data over time
looking at small populations
Looking at small populations
  • Only the Samples of Anonymised Records have larger sample sizes
  • Many surveys with 10+k respondents
    • Permits minority groups to be represented
    • Rare subpopulations sample size may be too small… can consider combining years if appropriate
Combining datasets to increase sample size
  • Survey data is subject to sampling error!
  • Example: Pregnancy and Employment
  • Using 1998-99 General Household Survey data alone there are only 168 pregnant women aged 16-49
  • 95% Confidence interval for % pregnant women economically inactive 34.2 – 49.1%
  • Combined 3 years’ data to obtain sample of 465 pregnant women
  • Confidence interval using 3 years’ data: 34.9 – 43.9%
using the flexibility of the data to look at alternative definitions
Using the flexibility of the data to look at alternative definitions

What are ‘hours worked’?

  • Is it just paid work? Or unpaid as well?
  • Hours usually worked, or actually worked last week?
  • In main job, or in any job?
  • What about students?
  • Overtime – paid?
  • Overtime – unpaid?
  • Lunch hours?
  • Do non-workers work zero hours or should they be excluded?
choosing a survey for research
Choosing a survey for research
  • Which surveys cover your main topic?
  • Which other topics are you interested in?
  • Measurement over time
  • Geography
  • Respondents – whole household, children?
  • Sample size
using the data in teaching
Using the data in teaching
  • Methods courses
    • Using the data in a hands on manner
    • Using substantive exemplars to demonstrate a methodological point
    • Using the surveys as methodological exemplars
  • Substantive courses
    • Making your point using data
    • Integrating methods into substantive courses
  • Teaching datasets
    • General Household Survey
    • Labour Force Survey
    • British Crime Survey
    • Health Survey for England
quality of documentation
Quality of Documentation
  • Questionnaire
  • Code book of Variables
  • Description of Derived Variables
  • Definitions
  • Methodology including
    • Sampling method
    • Response achieved
    • Population base
  • Published reports
continuous population survey
Continuous Population Survey
  • Integrates LFS, GHS, EFS, OMN, APS into single survey
    • Annual achieved sample size = 265,000 households (over half a million adults) GB
    • Common core module of questions to whole sample
    • Topic modules to portions of sample
    • Core and topic modules combined to construct a number of different viable interview combinations
  • January to March 2007 second consultation
  • April 2007 Decision
  • January 2008 Start date
cps benefits risks
CPS: Benefits & Risks


  • Builds on harmonisation
  • Increased flexibility while maintaining comparability over time
  • a very large annual sample for core module variables – improves precision;
  • an improved, ‘unclustered’ sample design;
  • better representation at local authority district level, and
  • improved weighting methodology
  • Greater coherence in official statistics, fewer ‘competing estimates’ between surveys


  • Major change to the government surveys
  • All eggs in one basket!
accessing the data104
Accessing the data
  • Access the documentation and data online in a similar manner to the Licensed SAR data
  • ESDS data stored at UK Data Archive
  • Can be downloaded in SPSS/Stata format
  • Can be explored using Nesstar
  • Users must be registered
  • Registration is managed by the Census Registration System
  • Once registered under CRS register a project
  • Order a dataset for a particular project