Data Management for Longitudinal Data

Longitudinal Studies Seminars: Longitudinal Analyses Using STATAStirling University, 20.7.04Data and Variable ManagementPaul Lambert 20.7.04: LSS

Data Management for Longitudinal Data 20.7.04: LSS

The nature of ‘large and complex’ longitudinal resources: complicating the variable by case matrix 20.7.04: LSS

Large and complex =  Complexity in: • Multiple hierarchies of measurement • Array of variables / operationalisations • Relations between / subgroups of cases • Multiple points of measurement • Balanced or unbalanced repeated contacts • Censored duration data • Sample collection and weighting 20.7.04: LSS

i) Multiple hierarchies (levels) of measurement • Common examples: • Both individuals and households • Schools and pupils • People and local districts and regions • Solutions: • Separate VxC matrix for each level, eg BHPS • Merged VxC matrix at lowest level 20.7.04: LSS

Illustration: Hierarchical dataset

ii) Array of variables • Vast number of variable responses, eg 1K+ • Recoding multiplies these up, eg dummies • Multiple response var.s (‘all that apply’) • Categorisations / indexes (eg occupations) • Implication: • Either separate files for separate var. groups • Or very long and difficult files… 20.7.04: LSS

iii) Relations between cases • All respondents in a household • Husbands and wives both sampled • Fellow school pupils sampled • Longitudinal: differing relations with others at different times • Outcomes: • Link information between related cases 20.7.04: LSS

iv) Multiple measurement points • Longitudinal: information on same cases for multiple time points • Panel or cohort: several records via repeated contact for each individual • Problems of ‘unbalanced’ panels • Life history / retrospective: • Durations in spells: multistate / multiepisode, overlapping spells; time varying covariates • Left or right censoring of durations in spells 20.7.04: LSS

v) Sample collection / weighting • Multistage cluster particularly popular • Sample may have been clustered, stratified • Longitudinal: uneven inclusion of cases over time • Sample weights designed to solve, but: • Complex in application • Not suited to all applications 20.7.04: LSS

STATA data management examples: see datmanag_part1.do Claim: For data management, STATA is powerful, but not always well designed • Batch files / interactive syntax / programs • Data entry / browsing • Variable labels • Computing / recoding • Missing values • Weighting data • Survey estimators (svy) 20.7.04: LSS

Typology of longitudinal data files • 3 Sets of contrasts : • Repeated X-section / Panel / Cohort Event History / Time Series • Wide v’s Long • Discrete v’s Continuous time See datmanag_part 2.do 20.7.04: LSS

Contrast 1 Type A: Repeated x-sect data 20.7.04: LSS

C1 Type B: Panel dataset (Unbalanced) 20.7.04: LSS

C1 Type C : Event history data analysis • Alternative data sources: • Panel / cohort (more reliable) • Retrospective (cheaper, but recall errors) • Aka: ‘Survival data analysis’; ‘Failure time analysis’; ‘hazards’; ‘risks’; .. Focus shifts to length of time in a ‘state’ - analyses determinants of time in state 20.7.04: LSS

Key to event histories is ‘state space’ 20.7.04: LSS

C1 Type D: Time series data **Exact equivalence to panel data format Examples: • Unemployment rates by year in UK • University entrance rates by year by country Statistical summary of one particular concept, collected at repeated time points from one or more subjects 20.7.04: LSS

Contrast 2: ‘Wide’ versus ‘Long’ format Relevant to all types of dataset: • ‘Wide’ = 1 case per record (person), additional vars for time points : Person 1 Sex YoB Var1_92 Var1_93 Var1_94 … Person 2 … • ‘Long’ = 1 case per time point within person (as panel data example) • STATA: ‘reshape’ command allows transfer between the two formats 20.7.04: LSS

Contrast 3: Continuous v’s Discrete time Primarily in terms of event history datasets • Continuous time (‘spell files’, ‘event oriented’) • One episode per case, time in case is a variable • Discrete time • One episode per time unit, type of event and event occurrence as variables • Analyses: Most packages can handle either format comfortably 20.7.04: LSS

20.7.04: LSS

Matching files • Complex data inevitably involves more than one related data file • A vital data analysis skill!! • Link data between files by connecting them according to key linking variable(s) • Eg, ‘person identifier’ variable ‘pid’ • Eg : http://iserwww.essex.ac.uk/bhps/doc/ See datmanag_part3.do 20.7.04: LSS

Types of file matching • Case-to-case matching • One-to-one link, eg two files with different sets of variables for same people • STATA: append or merge • Table distribution • One-to-many link, eg one file has individuals, another has households, and match household info to the individuals • STATA: merge 20.7.04: LSS

Types of file matching ctd • Aggregating • Summarise over multiple cases then link summaries back to cases • STATA: collapse • Related cases matching • Link info from one related case to another case, eg info on spouse put on own case • STATA: merge or joinby 20.7.04: LSS

STATA file matching crib: _merge = indicator of cases present for: 1 = Master file but not input file 2 = Input file but not Master file 3 = Master and input file 20.7.04: LSS

Data Management for Longitudinal Data

Data Management for Longitudinal Data

Presentation Transcript

Collaborative Data Management for Longitudinal Studies

Modelling Longitudinal Data

CS639: Data Management for Data Science

Longitudinal Data Fall 2002

Longitudinal Data Techniques:

CS639: Data Management for Data Science

Longitudinal data

Restructuring longitudinal data

Advances in Longitudinal Data and Data Use

Longitudinal data

Longitudinal Social Network Data

Longitudinal Data Systems

Virginia’s Longitudinal Data System