A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP

A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP Gopalakrishnan Netuveli Imperial College London 1 Jan 2007 – 31 March 2008 Leeds, 22 March 2007

Treating longitudinal data longitudinally • The objective of this fellowship is to gain experience and proficiency in the secondary analysis of longitudinal data sets. • Motivation • Large amount of resources spent on collecting longitudinal data • Inadequate utilisation of the “longitudinality” of the data

The substantive research question The objectives of the fellowship will be pursued by investigating the inter-relationship of trajectories of employment status and health in British Household Panel Survey. English Longitudinal Study of Ageing will also be used. In this presentation I present results from the first two months of work done.

Complexity of longitudinal data Structure of longitudinal data - like an arrow in flight, which ‘is paradoxically at one stage while also pursuing its path to the target’. Challenge - resolve that paradox by utilising the whole information included in the stage and the track of the longitudinal variable.

Importance of trajectories • In the data, trajectories are represented as an array of time indexed variables. • Every point in the trajectory contains information about • the current value at the point of measurement • the direction • how that point was reached • where it might go. Longitudinal data are underused if only the magnitude of the time- indexed variables are used without taking into account the possible interrelationship between them.

Examples of research meeting the challenge Mainland European demographers, used linked information from Norwegian national decennial censuses to construct social trajectories through the life course. They used state-order-time models to predict mortality: State model: logit P(Y = 1) = d + Σ(aiAi + biBi + ciCi) Order model: logit P(Y = 1) = a + ΣbiOi -Wunsch et al. 1996

Combining states and orders: sequences Order of states can be expressed as sequences, which can represent longitudinal trajectories. Sequences are clustered according to their similarity in all to all comparisons or in comparisons against ideal types. Cluster membership is used as dependent or independent variable in analyses. A method becoming popular for this purpose is optimal matching

Optimal matching: a short primer A measure of dissimilarity between two sequences is d = (L1 + L2) – 2*LCS Where L1 and L2 are total lengths of first and second sequences and LCS, the length of the largest common sequence they share. e.g. LONDON LEEDS L1 = 6; L2 = 5; LCS (LD) = 2; d= 11-4 = 7 The matching is optimal – when d has the smallest possible value, which depend on identifying highest possible LCS

A special case If sequences are made of two states and of equal length (L): 1. AAAAAA = 111111Σ1=6 =L 2. ABAABA = 101101 Σ2=4 3. BBAABB = 001100 Σ3=2 Σ is the LCS d1→2 = (6+6)-2*4 = 4 = 2*(L-Σ2) d1→3 = (6+6)-2*2 = 8 = 2*(L-Σ3) d2→3 cannot be extrapolated from these relations d is often standardised by dividing with L (or longer length in case of unequal lengths)

Developing methods to compare trajectories of two different variables: progress to-date I used BHPS wave 1 to 14. 2 trajectories were selected: People in labour force (1= in LF) People with limiting illness in the previous 12 months (0 = no limiting illness) Reference sequences were being in labour force for all waves; and no limiting illness for all waves Hypothesis was people who are ill will not be working Research question is how trajectories of labour force participation vary with in a given pattern of illness?

Methods Data restricted to all those who had information on both variables in all waves. STATA commands used to match and produce the standardised distances against reference sequences. As there are only 14 waves, there were only 14 discrete values for distances (small enough to look at each value separately but large enough for treating as continuous)

Method to describe a pattern graphically Traditionally, area charts are used to describe patterns. Disadvantage: uncertainty at each time point in the pattern is not reflected. Information content of a distribution of states at a time point can be calculated using Shannon’s information measure. Information at a position, R = Where a is state (0,1) and fa is frequency of state a. This method is based on ‘sequence logos’ used to describe genetic sequences Schneider, 1999

Method to analyse variations in patterns To study variations in distribution of patterns I used the Gini coefficients. The Gini coefficient can be decomposed as between groups, within groups and overlap. It has no distributional assumptions except for the variable should be monotonically increasing. Similarity of this procedure with ANOVA has lead to it being called ANoGi (Frick et al. 2004)

Results Sample size: 4796 Sequence with only one state: Limiting illness: 2924 (60.9%) In labour force: 2477 (51.7%) Number of episodes:

Limiting illness: distribution of patterns according to distance from reference pattern (No illness in all waves)

Limiting illness patterns at distance 1: Traditional graphic representation

Limiting illness patterns at distance 1: Information theoretic (sequence logo) representation L limiting illness N No limitations

Pattern for labour force participation : whole sample E In labour force N Not in labour force

Pattern for labour force participation : in those with no limiting illness E In labour force N Not in labour force

Pattern for labour force participation : in those with limiting illness in half the waves E In labour force N Not in labour force

Pattern for labour force participation : in those with limiting illness in all waves E In labour force N Not in labour force

Analysis of Gini: Patterns of employment grouped by patterns of limiting ill health

Relationship of patterns of labour force participation and patterns of limiting illness • r=0.43

In conclusion… Work in progress. Need to explore using more complex patterns and full optimal matching Other methods Fellowship mentored by: Professor David Blane, Imperial College Professor Mel Bartley, UCL Professor Richard Wiggins, City University Professor Nicky Best, Imperial College

A SECONDARY ANALYSIS OF DATA MID CAREER FELLOWSHIP