Data Management for Social Survey Research

Data Management for Social Survey Research Training Workshop, 24-25 August 2009, Univ. Stirling Organised by the ESRC Node ‘Data Management through e-Social Science’ (www.dames.org.uk).

‘Data Management though e-Social Science’ • DAMES – www.dames.org.uk • ESRC Node funded 2008-2011 • Aim: Useful social science provisions • Specialist data topics – occupations; education qualifications; ethnicity; social care; health • Mainstream packages and accessible resources • Engage with existing provisions (e.g. ESDS; CESSDA) • Programme of case studies and provisions – more later

1. The significance of data management for social survey research Paul Lambert, 24-25 August 2009 Presented to ‘Data Management for Social Survey Research’, a workshop organised by the ESRC ‘Data Management through e-Social Science’ research Node (www.dames.org.uk).

The significance of..

‘Data management’ means… • ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’[…DAMES Node..] • Usually performed by social scientists themselves • Most overt in quantitative survey data analysis • ‘variable constructions’, ‘data manipulations’ • navigating abundance of data – thousands of variables • Usually a substantial component of the work process • Here we differentiate from archiving / controlling data itself

Some components… • Manipulating data • Recoding categories / ‘operationalising’ variables • Linking data • Linking related data (e.g. longitudinal studies) • combining / enhancing data (e.g. linking micro- and macro-data) • Secure access to data • Linking data with different levels of access permission • Detailed access to micro-data cf. access restrictions • Harmonisation standards • Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) • Recommendations on particular ‘variable constructions’ • Cleaning data • ‘missing values’; implausible responses; extreme values

Example – recoding data

Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

‘The significance of data management for social survey research’ • The data manipulations described above are a major component of the social survey research workload • Pre-release manipulations performed by distributors / archivists • Coding measures into standard categories; Dealing with missing records • Post-release manipulations performed by researchers • Re-coding measures into simple categories • All serious researchers perform extended post-release management (and have the scars to show for it) • We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently • So the ‘significance’ of DM is about how much better research might be if we did things more effectively…

Some provocative examples for the UK… • Social mobility is increasing, not decreasing • Popularity of controversial findings associated with Blanden et al (2004) • Contradicted by wider ranging datasets and/or better measures of stratification position • DM: researchers ought to be able to more easily access wider data and better variables • Degrees, MSc’s and PhD’s are getting easier • {or at least, more people are getting such qualifications} • Correlates with measures of education are changing over time • DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isn’t, but should, and could, be widespread • ‘Black-Caribbeans’ are not disappearing • As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly prominent due to return migration and social integration of immigrant descendants • Data collectors under-pressure to measure large groups only • DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such as by merging survey data sources and/or linking with suitable summary measures

Thoughts on UK survey research • We’re data rich e.g. Thousands of studies at UKDA • We’re statistics rich Ample analytical methods • Yet we lack social survey analysts…? • Lack of fluency in ‘handling’ or ‘preparing’ data • Contribution to be made by a methodology of ‘data management’

Our own motivation (in DAMES) • DM is a big part of the research process • ..but receives limited methodological attention • Poor practice in soc. sci. DM is easily observed • Not keeping adequate records • Not linking relevant data • Not trying out relevant variable operationalisations • Even though.. • There are plenty of existing resources and standards relevant to data management activities • There are suitable software and internet facilities (e.g. Long 2009) • People are working on DM support (e.g. ESDS, DAMES)

A bit of focus… • Most of the DAMES applications aim to facilitate one of two data management activities and their documentation: • Variable constructions • Coding and re-coding values • Linking datasets • Internal and external linkages

A bit more focus… • The current workshop is concerned with research practices and facilities for social survey data management • To raise for discussion important topics associated with data management • To illustrate effective means of achieving good practice during data management • Software perspectives – e.g. Treiman 2009; Long 2009; Levesque 2008; Sarantakos 2007

So that leads us to… • Our pragmatic interest in effective data management means we’ll concentrate on: • Stata software implementations • Documentation and replicability • Approaches to variable operationalisations and matching files in Stata

Why did Stata suddenly come into this? We see Stata emerging as effective for specific tasks and compatible with generic approaches

The relevance of e-Science • ‘Data management through e-Social Science’ • ‘E-Science’ refers to adopting a number of particular approaches and standards from computing science, to applied research areas • These approaches include ‘the Grid’; distributed computing; data and computing standardisation; metadata; security; research infrastructures • UK investment in capitalising on these developments • DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks

Some other selected e-Science projects (NCeSS)(concerned with accessing/handling complex data)

E-Science and Data Management E-Science isn’t essential to good DM, but it has capacity to improve and support conduct of DM… • Concern with standards setting in communication and enhancement of data • Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources • Contribution of metadata tools/standards for variable harmonisation and standardisation • Linking data subject to different security levels • The workflow nature of many DM tasks

E.g. of GEODE: Organising and distributing specialist data resources (on occupations)

The contribution of DAMES8 project themes

DAMES research Node • social researchers often spend more time on data management than any other part of the research process DAMES ONS support ESDS support UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs NCRM workshops Essex summer school ESRC RDI initiatives CQeSS Data Management Data access / collection Data Analysis

4 good habits and principles • 3 Challenges

(a) Good habit: Keep clear records of your DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.longitudinal.stir.ac.uk

Stata syntax example (‘do file’)

Software and handling variables – our view • Stata is the superior package for secondary survey data analysis: • Advanced data management and data analysis functionality • Supports easy evaluation of alternative measures (e.g. est store) • Culture of transparency of programming/data manipulation • Cf. Scott Long (2009) • But: Not available to all users

(b) Principle: Use existing standards and previous research • Variable operationalisations Use recognised recodes / standard classifications • NSI harmonisation standards (e.g. ONS) • Cross-national standards [Hoffmeyer-Zlotnick & Wolf 2003; Harkness et al. 2005; Jowell et al. 2007] • Research reviews [e.g. Shaw et al. 2007] • Common v’s best practices (e.g. dichotomisations) Use reproducible recodes / classifications (paper trail) • Other data file manipulations • Missing data treatments • Matching data files (finding the right data)

(c) Principle: Do something, not nothing • We currently put much more effort into data collection and data analysis, and neglect data manipulation • Survey research – the influence of ‘what was on the archive version’ …In my experience, a common reason why people didn’t do more DM was because they were frightened to…

(d) Principle: Learn how to match files (‘deterministic’) Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... • One-to-one matching SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta • One-to-many matching (‘table distribution’) SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid . Stata: merge pid using file2.dta • Many-to-one matching (‘aggregation’) SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) • Many-to-Many matches • Related cases matching

Some challenges for data management.. (e) Agreeing about variable constructions • Unresolved debates about optimal measures and variables • Esp. in comparative research such as across time, between countries In DAMES, we have particular interests in comparability for: • Longitudinal comparability (http://www.longitudinal.stir.ac.uk/variables/) • Scaling / scoring categories to achieve ‘meaning equivalence’ or ‘specific measures’

Some challenges for data management.. (f) Worrying about data security • DM activities could challenge data security • Inspecting individual cases • Multiple copies of related data files • Ability to link with other datasets ‘Hands-on’ model of data review • New and exciting data resources • have more individual information • are more likely to be released with stringent conditions • may jeopardize traditional DM approaches

Some routes to secure data • Secure ‘portals’ for direct access to remote data • Secure settings (e.g. safe labs) • Data annonymisation and attenuation • Emphasis on users’ responsibility rather than the data provider

Some challenges for data management.. (g) Incentivising documentation / replicability • There is little to press researchers to better document DM, but much to press them not to • Make DM and its documentation easier? • Reward documentation (e.g. citations)?

Conclusions: Practices, services and standards …For deriving variables, handling missing data, and cleaning data… • Practices • Key, or common, features of current approaches • Services • Resources available/conceivable • Standards • Preliminary thoughts on standards setting

Social survey data management practices, services and standards Currently…, • Practices are messy and painful • Lack of replication and consistency in data manipulation tasks with complex survey data • Few people relish data manipulations! • Services exist but are under-exploited • Standards are not agreed • Ignoring standards is no barrier to publication(!)

(i) An brief illustrative example from the UK RAE 2008 • Research Assessment Exercise data published Dec 2008 • Raw data available online: www.rae.ac.uk • Relevant supplementary data: www.hesa.ac.uk ; www.dames.org.uk • Extended reporting on basic data by media/within HE sector, e.g. • Cambridge leads the way • Nursing raises its status • Numerous enhancements/amendments to data & analysis could be easily generated, and often lead to a different story • Lambert, P.S. and Gayle, V. (2008). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008 , University of Stirling: Technical Paper 2008-3 of the Data Management through e-Social Science Research Node (www.dames.org.uk).

…Extending analysis of the 2008 RAE using data manipulations... • Deriving variables • Commonly used RAE ‘Grade point average’ • [4.(%4*) + 3.(%3*) + 2.(%2*) + (%1*)] / 100 • Calculate alternative GPA measures • Standardise GPA within Units of Assessment • Rate Units of Assessment by external measures of relative ‘prestige’ • Link with 2001 standard thresholds • Other external data – e.g. Univ. typologies; RAE panel membership • LSE outranks Cambridge • Nursing ranks 6 least prestigious UoA from 67 • Cleaning data • Of 159 HEI’s, 27 HEIs have only 1 UoA • mean 15 UoA’s within HEI, max 53 (Manchester) • The single UoA HEI’s often have outlying GPAs • Handling missing data • not all HEI staff included within RAE; consider analysis accounting for number of excluded staff..?

Practices: apparent trends Deriving variables, handling missing data, cleaning data • More interest in harmonisation and comparability • Longitudinal and cross-national data • Documentation challenges encourage simplifying approaches • New data and analytical opportunities • Increasing opportunities for enhancing data by linking at micro- or aggregate level • Increasing availability of routines for missing values, extreme values • Raising standards in secondary analysis of large scale surveys • Inadequacy of simple analyses which ignore multivariate relations, missing data, multiprocess systems, hierarchical structures • Data manipulations often conducted outside these considerations • Desirability of replication

Services: key challenges Deriving variables, handling missing data, cleaning data • Software issues • Dominance of major proprietary database packages • Other specialist/minority packages (e.g. MLwiN) • Documentation / replication between packages..? • Data security • Few services can offer to let experts take over a dataset • Approaches to reviewing data ought to avoid inspecting cases, duplicate copies • Keeping up-to-date? • Finding data - need for search facilities [via metadata] • Updating specialist advice • E.g. of GEODE, occupational data out of date before completion • NSI’s strict focus on contemporary data

Standards: key requirements Deriving variables, handling missing data, cleaning data • Need for documentation for replication • Detailed accounts of process • Citation of sources • DAMES – to facilitate with metadata and process tools • Resolving some difficult debates • Approaches to comparative research (measurement equivalence v’s meaning equivalence) • Necessary standards for analysis/reporting on missing data • Appropriate approaches to extreme values, e.g. robust regressions

Aside – e-Science and data management compared to the Industrial revolution… • Landes (1969) The Unbound Prometheus • Knowledge-based revolution • Importance of standardising technology for cooperation (not just creating it) • Importance of having access to underlying materials – coal, cotton, etc. • Uneven development (nationally) Landes, D. S. (1969). The Unbound Prometheus: Technological Change and Industrial Development in Western Europe from 1750 to the Present. Cambridge: Cambridge University Press.

References • Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe. Cambridge: Cambridge University Press. • Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. • Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007. • Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley. • Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. • Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. • Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc. • Long, J. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. • Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan. • Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press. • Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. • University of Essex, & Institute for Social and Economic Research. (2009). British Household Panel Survey: Waves 1-17, 1991-2008 [computer file], 5th Edition. Colchester, Essex: UK Data Archive [distributor], March 2009, SN 5151.

Data Management for Social Survey Research

Data Management for Social Survey Research

Presentation Transcript

Data Management for Research

Marsh Bird Survey Data Management

ConvertGrid – Data Grids for Social Science Research

Research Data Management

Data Management for Research

Research Data Management

Data Management for XML: Research Directions

Data Management and Frontiers in Survey Research

Prospects, data and case studies for survey research

Survey Data Management Using Metadata

Distributed Data Management for Biomedical Research

Data Management for Health Sciences Research

Data Management for Research

Social Desirability Effects in Survey Research

e-Science, Data Management and Frontiers in Survey Research

Research Data Management for Support Staff

Research Data Management

Research Data Management

Research Data Management

Brady T. West Survey Research Center, Institute for Social Research