sc968 panel data methods for sociologists n.
Skip this Video
Loading SlideShow in 5 Seconds..
SC968: Panel Data Methods for Sociologists PowerPoint Presentation
Download Presentation
SC968: Panel Data Methods for Sociologists

Loading in 2 Seconds...

play fullscreen
1 / 65

SC968: Panel Data Methods for Sociologists - PowerPoint PPT Presentation

  • Uploaded on

SC968: Panel Data Methods for Sociologists. Introduction to survival/event history models. Types of outcome. Continuous OLS Linear regression Binary Binary regression Logistic or probit regression Time to event data Survival or event history analysis.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'SC968: Panel Data Methods for Sociologists' - cybill

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sc968 panel data methods for sociologists

SC968: Panel Data Methods for Sociologists

Introduction to survival/event history models

types of outcome
Types of outcome

Continuous OLS

Linear regression

Binary Binary regression

Logistic or probit regression

Time to event data Survival or event history analysis

examples of time to event data
Examples of time to event data
  • Time to death
  • Time to incidence of disease
  • Unemployed - time till find job
  • Time to birth of first child
  • Smokers – time till quit smoking
time to event data
Time to event data
  • Set of a finite, discrete states
  • Units (individuals, firms, households etc.) –in one state
  • Transitions between states
  • Time until a transition takes place
4 key concepts for survival analysis
4 key concepts for survival analysis
  • States
  • Events
  • Risk period
  • Duration/ time
  • States are categories of the outcome variable of interest
  • Each unit (ex.: person, household etc.) occupies exactly one state at any moment in time
  • Examples
    • alive, dead
    • single, married, divorced, widowed
    • never smoker, smoker, ex-smoker
    • employed, unemployed, inactive
  • Set of possible states called the state space
  • Event=a transition from one state to another
  • From an origin state to a destination state
  • Examples
    • From smoker to ex-smoker
    • From married to widowed
  • Not all transitions may be possible
    • E.g. from smoker to never smoker
risk period
Risk period
  • 2 states: A & B
  • Event: transition from A B
  • To be able to undergo this transition, one must be in state A (if in state B already cannot transition)
  • Not all individuals will be in state A at any given time
  • Example
    • can only experience divorce if married
  • The period of time that someone is at risk of a particular event is called the risk period
  • All subjects at risk of an event at a point in time called the risk set
  • Various meanings...
  • Calendar time
  • ...but onset of risk usually not simultaneous for all units
    • Ex: by age 40, some individuals will have smoked for 20+ years, other for 1 year
  • Duration=time since onset of risk
    • divorce: time since getting married
    • finding a job: time since becoming unemployed
    • death: time since being born
  • Event history analysis : analyzing the length of duration, i.e. the length of time between the onset of risk and the occurrence of an event
  • Examples
    • Duration of marriage
    • Length of life
  • In practice we model the probability of a transition conditional on being in the risk set
calendar time
Calendar time




1991 1994 1997 2000 2003 2006 2009

  • Ideally: observe individual since the onset of risk until event has occurred
  • ...very demanding in terms of data collection
    • (ex: risk of death starts when one is born)
  • Usually– incomplete data  censoring
  • An observation is censored if it has incomplete information cannot accurately measure duration
  • Types of censoring
    • Right censoring
    • Left censoring
  • Right censoring: the person did not experience the event during the time that they were studied
  • Common reasons for right censoring
    • the study ends
    • the person drops-out of the study
  • We do not know when the person experiences the event but we do know that it is later than a given time T
  • Left censoring: the person became at risk before we started observing her
    • We do not know when the person entered the risk set EHA cannot deal with
    • We know when the person entered the risk set condition on the person having survived long enough to enter the study
  • Censoring independent of survival processes!!
study time in years
Study time in years





0 3 6 9 12 15 18

why a special set of methods
Why a special set of methods?
  • duration =continuous variable why not OLS?
  • Censoring
    • If excluding  higher probability to throw out longer durations
    • If treating as complete mis-measurement of duration
  • Non normality of residuals
  • Time varying co-variates
  • Interested in the probability of a transition at any given time rather than in the length of complete spells
  • Need to simultaneously take into account:
    • Whether the event has taken place or not
    • The length of the period at risk before the event occurred
survival function
Survival function
  • Length of time (duration) before an event occurs (length of ‘spell’-T) 
    • probability density function (pdf)- f(t)

f(t)= limPr(t<=T<=t+Δt) = δF(t) δt

Δt0 Δt

    • cumulative density function (cdf)- F(t)

F(t)= Pr( T<=t) =∫f(t) dt

  • Survival function:
    • S(t)=1-F(t)
survival function1
Survival function




Duration (T)


Duration (T)

Duration (T)

hazard rate
Hazard rate
  • h(t)= f(t)/ S(t)
  • The exact definition & interpretation of h(t) differs:
    • duration is continuous
    • duration is discrete
  • Conditional on having survived up to t, what is the probability of leaving between t and t+Δt
  • It is a measure of risk intensity
  • h(t) >=0
  • In principle h(t)= rate; not a probability
  • There is a 1-1 relationship between h(t), f(t), F(t), S(t)
  • EHA analysis:
    • h(t)= g (t, Xs)
    • g=parametric & semi-parametric specifications
Survival or event history data characterised by 2 variables

Time or duration of risk period

Failure (event)

1 if not survived or event observed

0 if censored or event not yet occurred

Data structure different:

Duration is discrete

Duration is continuous

Assume: 2 states; 1 transition; no repeated events

data structure discrete time
Data structure-Discrete time

t records (1 for each unit of time the person is at risk)

data structure discrete time1
Data structure-Discrete time
  • The row is a an individual period
  • An individual has as many rows as the number of periods he is observed to be at risk
  • No longer at risk when
    • Experienced event
    • No longer under observation (censored)
  • For each period (row)- explanatory variable X  very easy to incorporate time varying co-variates
  • Stata: reshape long
data structure continuous time
Data structure-continuous time

ID Entry Died End date Duration Event X

  • 1 01/01/1991 01/01/2008 17.00 0
  • 2 01/01/1991 01/01/200201/01/2002 11.0 1 0
  • 3 01/01/1995 01/01/2000 5.0 0 0
  • 3 01/01/2000 01/01/2005 01/01/2005 5.0 1 1
data structure continuous time1
Data structure-continuous time
  • The row is a person
  • Indicator for observed events/ censored cases
  • Calculate duration= exit date – entry date
  • Exit date=
    • Failure date
    • Censoring date
  • If time-varying covariates-
    • Split the period an individual is under observation by the number of times time-varying Xs change
    • If many Xs-change often- multiple rows
worked example
Worked example
  • Random 20% sample from BHPS
  • Waves 1 – 15
  • One record per person/wave
  • Outcome: Duration of cohabitation
  • Condition on cohabiting
  • Survival time: years from starting cohabitation till year living without a partner
the data
The data

Duration = 5 years

Event = 1

Ignore data after

event = 1

the data continued
The data (continued)

Note missing waves

before event

preparing the data
Preparing the data

Select records after onset of risk

Generate new duration variable

Declare that you want to

set the data to survival time

Important to check that you

have set data as intended


Checking the data setup

1 if observation is to be used

and 0 otherwise

time of entry

1 if event, 0 if censoring or

event not yet occurred

time of exit


Checking the data setup

How do we know when

this person divorced?


Checking the new data setup

Now censored instead of

an event

summarising time to event data
Summarising time to event data
  • Individuals followed up for different lengths of time
  • So can’t use prevalence rates (% people who have an event)
  • Use rates instead that take account of person years at risk
    • Incidence rate per year
    • Death rate per 1000 person years
summarising time to event data1
Summarising time to event data

Number of observations


Less than 50% of the

sample has experienced the

event by the end of the study

Rate per year

list the cumulative hazard function
List the cumulative hazard function

Default is the survivor function

kaplan meier graphs
Kaplan-Meier graphs
  • Can read off the estimated probability of surviving a relationship at any time point on the graph
    • E.g. at 5 years 88% are still cohabiting
  • The survival probability only changes when an event occurs
  • So the graph is stepped and not a smooth curve
testing equality of survival curves among groups
Testing equality of survival curves among groups

The log-rank test

A non –parametric test that assesses the null hypothesis that there are no differences in survival times between groups

log rank test example
Log-rank test example

Significant difference

between men and women


Cox regression model

Event History with Cox Model

  • Also known as the Cox proportional hazard model
  • No longer modelling the duration
  • Modelling the hazard rate
  • Hazard rate
    • h(t)= f(t)/ S(t)
    • conditional on having survived up to t, what is the probability of leaving between t and t+Δt
some hazard shapes
Some hazard shapes
  • Increasing:
    • as time elapses, more likely to experience the event
    • Ex: onset of Alzheimer's
  • Decreasing
    • as time elapses, less likely to experience the event
    • Ex: Survival after surgery
  • U-shaped
    • the hazard rate is highest when duration is low/ high
    • Ex: age specific mortality
  • Constant
    • hazard rate does not change with time
    • Ex: time till next email arrives
cox regression model
Cox regression model
  • Regression model for survival analysis
  • Can model time invariant and time varying explanatory variables
  • Produces estimated hazard ratios
    • (sometimes called rate ratios or risk ratios)
  • Regression coefficients are on a log scale
    • Exponentiate to get hazard ratio
    • Similar to odds ratios from logistic models
  • Model is semi-parametric:
    • Does not estimate how the hazard rate changes with time
    • Estimates the effect of co-variates in shifting a baseline hazard rate
cox regression equation i
Cox regression equation (i)

is the hazard function for individual i

is the baseline hazard function and can take any form

are the covariates

are the regression coefficients estimated from the data

  • One important assumption!:
  • the effect of co-variates does not change with time (proportional hazards)
cox regression equation ii
Cox regression equation (ii)
  • If we divide both sides of the equation on the previous slide by h0(t) and take logarithms, we obtain:
  • We call h(t) / h0(t) the hazard ratio
  • The coefficients are estimated by Cox regression, and can be interpreted in a similar manner to that of multiple logistic regression
  • exp(bi) is the instantaneous relative risk of an event
cox regression assumptions
Cox regression assumptions
  • Assumption of proportional hazards
  • No censoring patterns (i.e. censoring process independent of the survival process)
  • No left censoring
  • True starting time (on-set of risk unambiguous)
  • Plus assumptions for all modelling
    • Sufficient sample size, proper model specification, independent observations, exogenous covariates, no high multicollinearity, random sampling, and so on
proportional hazards assumption
Proportional hazards assumption
  • The ratio of the hazard functions of two groups is constant over time
  • Cox estimates of βs= change in the hazard rate relative to the baseline hazard (=hazard function of the reference group)
  • If a covariate fails this assumption
    • for hazard ratios that increase over time for that covariate, relative risk is overestimated
    • for ratios that decrease over time, relative risk is underestimated
    • standard errors are incorrect and significance tests are decreased in power
cox regression in stata
Cox regression in Stata
  • Will first model a time invariant covariate (sex) on risk of partnership ending
  • Then will add a time dependent covariate (age) to the model
interpreting output from cox regression
Interpreting output from Cox regression
  • Hazard ratios (exp(β))
    • The effect of the covariate on the hazard rate relative to the baseline hazard
    • Baseline hazard function not estimated no intercept
    • In our example, the baseline hazard is when sex=1 (male)
  • The hazard ratio is the ratio of the hazard for a unit change in the covariate
    • HR = 0.9for women vs. men
    • The risk of partnership breakdown is lowered by 10% for women compared with men
  • Hazard ratio assumed constant over time
    • At any time point, the hazard of partnership breakdown for a woman is 0.9times the hazard for a man
interpreting output from cox regression ii
Interpreting output from Cox regression (ii)
  • The hazard rate may change with time but the hazard ratio is constant
  • So if we know that the probability of a man having a partnership breakdown in the following year is 1.5% then the probability of a woman having a partnership breakdown in the following year is

0.015*0.9 = 1.35%

  • The (instantaneous) hazard rate cannot be derived from Cox estimates alone
    • Need to estimate the baseline hazard function
  • But... can calculate the odds of experiencing the event first = (hazard ratio) / (1 + hazard ratio)
  • So in our example, a HR of 0.9 corresponds to a

probability of 0.47 that a woman will experience a partnership breakdown first

time dependent covariates
Time dependent covariates
  • Examples
    • Current age group rather than age at baseline
    • GHQ score may change over time and predict break-ups
  • Will use age to predict duration of cohabitation
    • Nonlinear relationship hypothesised
    • Recode age into 8 equally spaced age groups
testing the proportional hazards assumption
Testing the proportional hazards assumption
  • Graphical methods
    • Comparison of Kaplan-Meier observed & predicted curves by group. Observed lines should be close to predicted
    • Survival probability plots (cumulative survival against time for each group). Lines should not cross
    • Log minus log plots (minus log cumulative hazard against log survival time). Lines should be parallel
testing the proportional hazards assumption1
Testing the proportional hazards assumption
  • Formal tests of proportional hazard assumption
    • Include an interaction between the covariate and a function of time. Log time often used but could be any function. If significant then assumption violated
    • Test the proportional hazards assumption on the basis of partial residuals. Type of residual known as Schoenfeld residuals.
when assumptions are not met
When assumptions are not met
  • If categorical covariate, include the variable as a strata variable
    • Allows underlying hazard function to differ between categories and be non proportional
    • Estimates separate underlying baseline hazard for each stratum
when assumptions are not met1
When assumptions are not met
  • If a continuous covariate
    • Consider splitting the follow-up time. For example, hazard may be proportional within first 5 years, next 5-10 years and so on
    • Could covariate be included as time dependent covariate?
    • Consider another (parametric) model
censoring assumptions
Censoring assumptions
  • Censored cases must be independent of the survival distribution. There should be no pattern to these cases, which instead should be missing at random.
  • If censoring is not independent, then censoring is said to be informative
  • You have to judge this for yourself
    • Usually don’t have any data that can be used to test the assumption
    • Think carefully about start and end dates
    • Always check a sample of records
true starting time
True starting time
  • The ideal model for survival analysis would be where there is a true zero time
  • If the zero point is arbitrary or ambiguous, the data series will be different depending on starting point. The computed hazard rate coefficients could differ, sometimes markedly
  • Conduct a sensitivity analysis to see how coefficients may change according to different starting points
other extensions to survival analysis
Other extensions to survival analysis
  • Repeated events
  • Multi-state models (more than 1 event type) competing risks
    • Transition from employment to unemployment or leaving labour market
    • Modelling type of exit from cohabiting relationship- separation/divorce/widowhood
could you use logistic regression instead
Could you use logistic regression instead?
  • May produce similar results for short or fixed follow-up periods
    • Examples
      • everyone followed-up for 7 years
      • maximum follow-up 5 years
  • Results may differ if there are varying follow-up times
  • If dates of entry and dates of events are available then better to use Cox regression
  • This is just an introduction to survival/ event history analysis
  • Only reviewed the Cox regression model
    • Also parametric survival methods
    • But Cox regression likely to suit type of analyses of interest to sociologists
  • Consider an intensive course if you want to use survival analysis in your own work