Discrete and Categorical Data

1 / 253

# Discrete and Categorical Data - PowerPoint PPT Presentation

Discrete and Categorical Data. William N. Evans Department of Economics/MPRC University of Maryland. Part I. Introduction. Introduction. Workhorse statistical model in social sciences is the multivariate regression model Ordinary least squares (OLS)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Discrete and Categorical Data' - dyami

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Discrete and Categorical Data

William N. Evans

Department of Economics/MPRC

University of Maryland

### Part I

Introduction

Introduction
• Workhorse statistical model in social sciences is the multivariate regression model
• Ordinary least squares (OLS)
• yi = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi
• yi = xi β + εi
Linear model yi =  + xi + i
•  and  are “population” values – represent the true relationship between x and y
• Unfortunately – these values are unknown
• The job of the researcher is to estimate these values
• Notice that if we differentiate y with respect to x, we obtain
• dy/dx = 
• Increase in income for more education
• Change in crime or bankruptcy when slots are legalized
• Increase in test score if you study more
Put some concretenesson the problem
• State of Maryland budget problems
• Drop in revenues
• Expensive k-12 school spending initiatives
• Short-term solution – raise tax on cigarettes by 34 cents/pack
• Problem – a tax hike will reduce consumption of taxable product
• Question for state – as taxes are raised, how much will cigarette consumption fall?
Simple model: yi =  + xi + i
• Suppose y is a state’s per capita consumption of cigarettes
• x represents taxes on cigarettes
• Question – how much will y fall if x is increased by 34 cents/pack?
• Problem – many reasons why people smoke – cost is but one of them –
Data
• (Y) State per capita cigarette consumption for the years 1980-1997
• (X) tax (State + Federal) in real cents per pack
• “Scatter plot” of the data
• Negative covariance between variables
• When x>, more likely that y<
• When x<, more likely that y>
• Goal: pick values of  and  that “best fit” the data
• Define best fit in a moment
Notation
• True model
• yi =  + xi + i
• We observe data points (yi,xi)
• The parameters  and  are unknown
• The actual error (i)is unknown
• Estimated model
• (a,b) are estimates for the parameters (,)
• ei is an estimate of i where
• ei=yi-a-bxi
• How do you estimate a and b?
Objective: Minimize sum of squared errors
• Min iei2 = i(yi – a – bxi)2
• Minimize sum of squared errors (SSE)
• Treat (+) and (-) errors equally
• Over or under predict by “5” is the same magnitude of error
• The optimal value for a and b are those that make the 1st derivative equal zero
• Functions reach min or max values when derivatives are zero
The model has a lot of nice features
• Statistical properties easy to establish
• Optimal estimates easy to obtain
• Parameter estimates are easy to interpret
• Model maximizes prediction
• If you minimize SSE you maximize R2
• The model does well as a first order approximation to lots of problems
Discrete and Qualitative Data
• The OLS model work well when y is a continuous variable
• Income, wages, test scores, weight, GDP
• Does not has as many nice properties when y is not continuous
• Example: doctor visits
• Integer values
• Low counts for most people
• Mass of observations at zero
Downside of forcing non-standard outcomes into OLS world?
• Can predict outside the allowable range
• e.g., negative MD visits
• Does not describe the data generating process well
• e.g., mass of observations at zero
• Violates many properties of OLS
• e.g. heteroskedasticity
This talk
• Look at situations when the data generating process does not lend itself well to OLS models
• Mathematically describe the data generating process
• Show how we use different optimization procedure to obtain estimates
• Describe the statistical properties
Show how to interpret parameters
• Illustrate how to estimate the models with popular program STATA
Types of data generating processes we will consider
• Dichotomous events (yes or no)
• 1=yes, 0=no
• Graduate high school? work? Are obese? Smoke?
• Ordinal data
• Self reported health (fair, poor, good, excel)
• Strongly disagree, disagree, agree, strongly agree
Count data
• Doctor visits, lost workdays, fatality counts
• Duration data
• Time to failure, time to death, time to re-employment
Recommended Textbooks
• Jeffrey Wooldridge, “Econometric analysis of cross sectional and panel data”
• Lots of insight and mathematical/statistical detail
• Very good examples
• William Greene, “Econometric Analysis”
• more topics
• Somewhat dated examples
Course web page
• www.bsos.umd.edu/econ/evans/jpsm.html
• Contains
• These notes
• All STATA programs and data sets
• A couple of “Introduction to STATA” handouts
• Links to some useful web sites
STATA Resources Discrete Outcomes
• “Regression Models for Categorical Dependent Variables Using STATA”
• J. Scott Long and Jeremy Freese
• Available for sale from STATA website for \$52 (www.stata.com)
• Post-estimation subroutines that translate results
• Do not need to buy the book to use the subroutines
In STATA command line type
• net search spost
• One is

• Click on the link and install the files

### Part II

A brief introduction to STATA

STATA
• Very fast, convenient, well-documented, cheap and flexible statistical package
• Excellent for cross-section/panel data projects, not as great for time series but getting better
• Not as easy to manipulate large data sets from flat files as SAS
• I usually clean data in SAS, estimate models in STATA
Key characteristic of STATA
• All data must be loaded into RAM
• Computations are very fast
• But, size of the project is limited by available memory
• Results can be generated two different ways
• Command line
• Write a program, (*.do) then submit from the command line
Sample program to get you started
• cps87_or.do
• Program gets you to the point where can
• Construct new variables
• Get simple statistics
• Run a basic regression
• Store the results on a disk
Data (cps87_do.dta)
• Random sample of data from 1987 Current Population Survey outgoing rotation group
• Sample selection
• Males
• 21-64
• Working 30+hours/week
• 19,906 observations
Major caveat
• Hardest thing to learn/do: get data from some other source and get it into STATA data set
• We skip over that part
• All the data sets are loaded into a STATA data file that can be called by saying:

use data file name

Housekeeping at the top of the program
• * this line defines the semicolon as the ;
• * end of line delimiter;
• # delimit ;
• * set memork for 10 meg;
• set memory 10m;
• * write results to a log file;
• * the replace options writes over old;
• * log files;
• log using cps87_or.log,replace;
• * open stata data set;
• use c:\bill\stata\cps87_or;
• * list variables and labels in data set;
• desc;
------------------------------------------------------------------------------------------------------------------------------------------------------------
• > -
• storage display value
• variable name type format label variable label
• ------------------------------------------------------------------------------
• > -
• age float %9.0g age in years
• race float %9.0g 1=white, non-hisp, 2=place,
• n.h, 3=hisp
• educ float %9.0g years of education
• unionm float %9.0g 1=union member, 2=otherwise
• smsa float %9.0g 1=live in 19 largest smsa,
• 2=other smsa, 3=non smsa
• region float %9.0g 1=east, 2=midwest, 3=south,
• 4=west
• earnwke float %9.0g usual weekly earnings
• ------------------------------------------------------------------------------
Constructing new variables
• Use ‘gen’ command for generate new variables
• Syntax
• gen new variable name=math statement
• Easily construct new variables via
• Algebraic operations
• Math/trig functions (ln, exp, etc.)
• Logical operators (when true, =1, when false, =0)
From program
• * generate new variables;
• * lines 1-2 illustrate basic math functoins;
• * lines 3-4 line illustrate logical operators;
• * line 5 illustrate the OR statement;
• * line 6 illustrates the AND statement;
• * after you construct new variables, compress the data again;
• gen age2=age*age;
• gen earnwkl=ln(earnwke);
• gen union=unionm==1;
• gen topcode=earnwke==999;
• gen nonwhite=((race==2)|(race==3));
• gen big_ne=((region==1)&(smsa==1));
Getting basic statistics
• desc -- describes variables in the data set
• sum – gets summary statistics
• tab – produces frequencies (tables) of discrete variables
From program
• * get descriptive statistics;
• sum;
• * get detailed descriptics for continuous variables;
• sum earnwke, detail;
• * get frequencies of discrete variables;
• tabulate unionm;
• tabulate race;
• * get two-way table of frequencies;
• tabulate region smsa, row column cell;
Results from sum
• Variable | Obs Mean Std. Dev. Min Max
• -------------+--------------------------------------------------------
• age | 19906 37.96619 11.15348 21 64
• race | 19906 1.199136 .525493 1 3
• educ | 19906 13.16126 2.795234 0 18
• unionm | 19906 1.769065 .4214418 1 2
• smsa | 19906 1.908369 .7955814 1 3
• -------------+--------------------------------------------------------
Detailed summary
• usual weekly earnings
• -------------------------------------------------------------
• Percentiles Smallest
• 1% 128 60
• 5% 178 60
• 10% 210 60 Obs 19906
• 25% 300 63 Sum of Wgt. 19906
• 50% 449 Mean 488.264
• Largest Std. Dev. 236.4713
• 75% 615 999
• 90% 865 999 Variance 55918.7
• 95% 999 999 Skewness .668646
• 99% 999 999 Kurtosis 2.632356
Results for tab
• 1=union |
• member, |
• 2=otherwise | Freq. Percent Cum.
• ------------+-----------------------------------
• 1 | 4,597 23.09 23.09
• 2 | 15,309 76.91 100.00
• ------------+-----------------------------------
• Total | 19,906 100.00
2x2 Table
• 1=east, |
• 2=midwest, | 1=live in 19 largest smsa,
• 3=south, | 2=other smsa, 3=non smsa
• 4=west | 1 2 3 | Total
• -----------+---------------------------------+----------
• 1 | 2,806 1,349 842 | 4,997
• | 56.15 27.00 16.85 | 100.00
• | 38.46 18.89 15.39 | 25.10
• | 14.10 6.78 4.23 | 25.10
• -----------+---------------------------------+----------
• 2 | 1,501 1,742 1,592 | 4,835
• | 31.04 36.03 32.93 | 100.00
• | 20.58 24.40 29.10 | 24.29
• | 7.54 8.75 8.00 | 24.29
• -----------+---------------------------------+----------
• 3 | 1,501 2,542 1,904 | 5,947
• | 25.24 42.74 32.02 | 100.00
• | 20.58 35.60 34.80 | 29.88
• | 7.54 12.77 9.56 | 29.88
• -----------+---------------------------------+----------
• 4 | 1,487 1,507 1,133 | 4,127
• | 36.03 36.52 27.45 | 100.00
• | 20.38 21.11 20.71 | 20.73
• | 7.47 7.57 5.69 | 20.73
• -----------+---------------------------------+----------
• Total | 7,295 7,140 5,471 | 19,906
• | 36.65 35.87 27.48 | 100.00
• | 100.00 100.00 100.00 | 100.00
• | 36.65 35.87 27.48 | 100.00
Running a regression
• Syntax

reg dependent-variable independent-variables

• Example from program

*run simple regression;

reg earnwkl age age2 educ nonwhite union;

Source | SS df MS Number of obs = 19906
• -------------+------------------------------ F( 5, 19900) = 1775.70
• Model | 1616.39963 5 323.279927 Prob > F = 0.0000
• Residual | 3622.93905 19900 .182057239 R-squared = 0.3085
• -------------+------------------------------ Adj R-squared = 0.3083
• Total | 5239.33869 19905 .263217216 Root MSE = .42668
• ------------------------------------------------------------------------------
• earnwkl | Coef. Std. Err. t P>|t| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• age | .0679808 .0020033 33.93 0.000 .0640542 .0719075
• age2 | -.0006778 .0000245 -27.69 0.000 -.0007258 -.0006299
• educ | .069219 .0011256 61.50 0.000 .0670127 .0714252
• nonwhite | -.1716133 .0089118 -19.26 0.000 -.1890812 -.1541453
• union | .1301547 .0072923 17.85 0.000 .1158613 .1444481
• _cons | 3.630805 .0394126 92.12 0.000 3.553553 3.708057
• ------------------------------------------------------------------------------
Analysis of variance
• R2 = .3085
• Variables explain 31% of the variation in log weekly earnings
• F(5,19900)
• Tests the hypothesis that all covariates (except constant) are jointly zero
Interpret results
• Y = β0 + β1Xi + εi
• dY/dX = β1
• But in this case Y=ln(W) where W weekly wages
• dln(W)/dX = (dW/W)/dX = β1
• Percentage change in wages given a change in x
• Non whites earn 17.2% less than whites
• Union members earn 13% more than nonunion members

### Part III

Continuous Distributions
• Random variables with infinite number of possible values
• Examples -- units of measure (time, weight, distance)
• Many discrete outcomes can be treated as continuous, e.g., SAT scores
How to describe a continuous random variable
• The Probability Density Function (PDF)
• The PDF for a random variable x is defined as f(x), where

f(x) \$ 0

If(x)dx = 1

• Calculus review: The integral of a function gives the “area under the curve”
Cumulative Distribution Function (CDF)
• Suppose x is a “measure” like distance or time
• 0 # x # 4
• We may be interested in the Pr(x#a) ?
CDF

What if we consider all values?

Properties of CDF
• Note that Pr(x # b) + Pr(x>b) =1
• Pr(x>b) = 1 – Pr(x # b)
• Many times, it is easier to work with compliments
General notation for continuous distributions
• The PDF is described by lower case such as f(x)
• The CDF is defined as upper case such as F(a)
Standard Normal Distribution
• Most frequently used continuous distribution
• Symmetric “bell-shaped” distribution
• As we will show, the normal has useful properties
• Many variables we observe in the real world look normally distributed.
• Can translate normal into ‘standard normal’
Examples of variables that look normally distributed
• IQ scores
• SAT scores
• Heights of females
• Average gestation (weeks of pregnancy)
• As we will show in a few weeks – sample means are normally distributed!!!
Standard Normal Distribution
• PDF:
• For -# z #
Notation
• (z) is the standard normal PDF evaluated at z
• [a] = Pr(z  a)
Standard Normal
• Notice that:
• Normal is symmetric: (a) = (-a)
• Normal is “unimodal”
• Median=mean
• Area under curve=1
• Almost all area is between (-3,3)
• Evaluations of the CDF are done with
• Statistical functions (excel, SAS, etc)
• Tables
Standard Normal CDF
• Pr(z  -0.98) = [-0.98] = 0.1635
Pr(0.1  z  1.9)

= Pr(z  1.9) – Pr(z  0.1)

= M(1.9) - M(0.1) = 0.9713 - 0.5398

= 0.4315

Important Properties of Normal Distribution
• Pr(z  A) = [A]
• Pr(z > A) = 1 - [A]
• Pr(z  - A) = [-A]
• Pr(z > -A) = 1 - [-A] = [A]

### Section IV

Maximum likelihood estimation

Maximum likelihood estimation
• Observe n independent outcomes, all drawn from the same distribution
• (y1, y2, y3….yn)
• yi is drawn from f(yi; θ) where θ is an unknown parameter for the PDF f
• Recall definition of indepedence. If a and b and independent, Prob(a and b) = Pr(a)Pr(B)
Because all the draws are independent, the probability these particular n values of Y would be drawn at random is called the ‘likelihood function’ and it equals
• L = Pr(y1)Pr(y2)…Pr(yn)
• L = f(y1; θ)f(y2; θ)…..f(y3; θ)
MLE: pick a value for θ that best represents the chance these n values of y would have been generated randomly
• To maximize L, maximize a monotonic function of L
• Recall ln(abcd)=ln(a)+ln(b)+ln(c)+ln(d)
Max L = ln(L) = ln[f(y1; θ)] +ln[f(y2; θ)] +

….. ln[f(yn; θ) = Σi ln[f(yi; θ)]

• Pick θ so that L is maximized
• dL/dθ = 0

L

θ

θ1

θ2

Example: Poisson
• Suppose y measures ‘counts’ such as doctor visits.
• yi is drawn from a Poisson distribution
• f(yi;λ) =e-λλyi/yi! For λ>0
• E[yi]= Var[yi] = λ
Given n observations, (y1, y2, y3….yn)
• Pick value of λ that maximizes L
• Max L =Σi ln[f(yi; θ)] = Σi ln[e-λλyi/yi!]

= Σi [– λ + yiln(λ) – ln(yi!)]

= -n λ + ln(λ) Σi yi – Σi ln(yi!)

L = -n λ + ln(λ) Σi yi – Σi ln(yi!)
• dL/dθ = -n + (1/ λ )Σi yi = 0
• Solve for λ
• λ = Σi yi /n =  = sample mean of y
In most cases however, cannot find a ‘closed form’ solution for the parameter in ln[f(yi; θ)]
• Must ‘search’ over all possible solutions
• How does the search work?
• Calculate dL/dθ
• If dL/dθ < 0, decreasing θ will increase L so we decrease θ some
• Keep changing θ until dL/dθ = 0
• How far you ‘step’ when you change θ is determined by a number of different factors

L

dL/dθ > 0

θ1

θ

L

dL/dθ < 0

θ3

θ

Properties of MLE estimates
• Sometimes call efficient estimation. Can never generate a smaller variance than one obtained by MLE
• Parameters estimates are distributed as a normal distribution when samples sizes are large
• Therefore, if we divide the parameter by its standard error, should be normally distributed with a mean zero and variance 1 if the null (=0) is correct

### Section 5

Dichotomous outcomes

Dichotomous Data
• Suppose data is discrete but there are only 2 outcomes
• Examples
• Graduate high school or not
• Patient dies or not
• Working or not
• Smoker or not
• In data, yi=1 if yes, yi =0 if no
How to model the data generating process?
• There are only two outcomes
• Research question: What factors impact whether the event occurs?
• To answer, will model the probability the outcome occurs
• Pr(Yi=1) when yi=1 or
• Pr(Yi=0) = 1- Pr(Yi=1) when yi=0
Think of the problem from a MLE perspective
• Likelihood for i’th observation
• Li= Pr(Yi=1)Yi [1 - Pr(Yi=1)](1-Yi)
• When yi=1, only relevant part is Pr(Yi=1)
• When yi=0, only relevant part is [1 - Pr(Yi=1)]
L = Σi ln[Li] =

= Σi {yi ln[Pr(yi=1)] + (1-yi)ln[Pr(yi=0)] }

• Notice that up to this point, the model is generic. The log likelihood function will determined by the assumptions concerning how we determine Pr(yi=1)
Modeling the probability
• There is some process (biological, social, decision theoretic, etc) that determines the outcome y
• Some of the variables impacting are observed, some are not
• Requires that we model how these factors impact the probabilities
• Model from a ‘latent variable’ perspective
Consider a women’s decision to work
• yi* = the person’s net benefit to work
• Two components of yi*
• Characteristics that we can measure
• Education, age, income of spouse, prices of child care
• Some we cannot measure
• How much you like spending time with your kids
• how much you like/hate your job
We aggregate these two components into one equation
• yi* = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi

= xi β + εi

• xi β (measurable characteristics but with uncertain weights)
• εi random unmeasured characteristics
• Decision rule: person will work if yi* > 0

(if net benefits are positive)

yi=1 if yi*>0

yi=0 if yi*≤0

yi=1 if yi*>0
• yi* = xi β + εi > 0 only if
• εi > - xi β
• yi=0 if yi*≤0
• yi* = xi β + εi ≤ 0 only if
• εi ≤ - xi β
How to interpret ε?
• When we look at certain people, we have expectations about whether y should equal 1 or 0
• These expectations do not always hold true
• The error ε represents deviations from what we expect
• Go back to the work example, suppose xi β is ‘big.’ We observe a woman with:
• High wages
• Low husband’s income
• Low cost of child care
We would expect this person to work, UNLESS, there is some unmeasured variable that counteracts this
• For example:
• Suppose a mom really likes spending time with her kids, or she hates her job.
• The unmeasured benefit of working is then a big negative coefficient εi
If we observe them working, there are a certain range of values that εi must have been in excess of
• yi=1 if εi > - xi β
• If we observe someone not working, then Consider the opposite. Suppose we observe someone NOT working.
• Then εi must not have been big or it was a bigger negative number, since
• yi=0 if εi ≤ - xi β
The Probabilities
• The estimation procedure used is determined by the assumed distribution of ε
• What is the probability we observe someone with y=1?
• Use definition of the CDF
• Pr(yi=1) = Pr(yi*>0) = Pr(εi > - xi β)

= 1 – F(-xi β)

What is the probability we observe someone with y=0?
• Use definition of the CDF
• Pr(yi=0) = Pr(yi*≤ 0) = Pr(εi ≤ - xi β) = F(-xi β)
• Two standard models: ε is either
• normal or
• logistic
Normal (probit) Model
• ε is distributed as a standard normal
• Mean zero
• Variance 1
• Evaluate probability (y=1)
• Pr(yi=1) = Pr(εi > - xi β) = 1 – Ф(-xi β)
• Given symmetry: 1 – Ф(-xi β) = Ф(xi β)
• Evaluate probability (y=0)
• Pr(yi=0) = Pr(εi ≤ - xi β) = Ф(-xi β)
• Given symmetry: Ф(-xi β) = 1 - Ф(xi β)
Summary
• Pr(yi=1) = Ф(xi β)
• Pr(yi=0) = 1 -Ф(xi β)
• Notice that Ф(a) is increasing a. Therefore, is one of the x’s increases the probability of observing y, we would expect the coefficient on that variable to be (+)
The standard normal assumption (variance=1) is not critical
• In practice, the variance may be not equal t 1, but given the math of the problem, we cannot identify the variance. It is absorbed into parameter estimates
Logit
• CDF: F(a) = exp(a)/(1+exp(a))
• Symmetric, unimodal distribution
• Looks a lot like the normal
• Incredibly easy to evaluate the CDF and PDF
• Mean of zero, variance > 1 (more variance than normal)
• Evaluate probability (y=1)
• Pr(yi=1) = Pr(εi > - xi β) = 1 – F(-xi β)
• Given symmetry: 1 – F(-xi β) = F(xi β)
• F(xi β) = exp(xi β)/(1+exp(xi β))
Evaluate probability (y=0)
• Pr(yi=0) = Pr(εi ≤ - xi β) = F(-xi β)
• Given symmetry: F(-xi β) = 1 - F(xi β)
• 1 - F(xi β) = 1 /(1+exp(xi β))
• When εi is a logistic distribution
• Pr(yi =1) = exp(xi β)/(1+exp(xi β))
• Pr(yi=0) = 1/(1+exp(xi β))
Example: Workplace smoking bans
• Smoking supplements to 1991 and 1993 National Health Interview Survey
• Asked all respondents whether they currently smoke
• Sample: workers
• Key variables: current smoking and whether they faced by workplace ban
Data: workplace1.dta
• Sample program: workplace1.doc
• Results: workplace1.log
Description of variables in data
• . desc;
• storage display value
• variable name type format label variable label
• ------------------------------------------------------------------------
• > -
• smoker byte %9.0g is current smoking
• worka byte %9.0g has workplace smoking bans
• age byte %9.0g age in years
• male byte %9.0g male
• black byte %9.0g black
• hispanic byte %9.0g hispanic
• somecol byte %9.0g has some college
• college float %9.0g
• -----------------------------------------------------------------------
Summary statistics
• sum;
• Variable | Obs Mean Std. Dev. Min Max
• -------------+--------------------------------------------------------
• smoker | 16258 .25163 .433963 0 1
• worka | 16258 .6851396 .4644745 0 1
• age | 16258 38.54742 11.96189 18 87
• male | 16258 .3947595 .488814 0 1
• black | 16258 .1119449 .3153083 0 1
• -------------+--------------------------------------------------------
• hispanic | 16258 .0607086 .2388023 0 1
• incomel | 16258 10.42097 .7624525 6.214608 11.22524
• hsgrad | 16258 .3355271 .4721889 0 1
• somecol | 16258 .2685447 .4432161 0 1
• college | 16258 .3293763 .4700012 0 1
Running a probit
• probit smoker age incomel male black hispanic hsgrad somecol college worka;
• The first variable after ‘probit’ is the discrete outcome, the rest of the variables are the independent variables
• Includes a constant as a default
Running a logit
• logit smoker age incomel male black hispanic hsgrad somecol college worka;
• Same as probit, just change the first word
Running linear probability
• reg smoker age incomel male black hispanic hsgrad somecol college worka, robust;
• Simple regression.
• Standard errors are incorrect (heteroskedasticity)
• robust option produces standard errors with arbitrary form of heteroskedasticity
Probit Results
• Probit estimates Number of obs = 16258
• LR chi2(9) = 819.44
• Prob > chi2 = 0.0000
• Log likelihood = -8761.7208 Pseudo R2 = 0.0447
• ------------------------------------------------------------------------------
• smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• age | -.0012684 .0009316 -1.36 0.173 -.0030943 .0005574
• incomel | -.092812 .0151496 -6.13 0.000 -.1225047 -.0631193
• male | .0533213 .0229297 2.33 0.020 .0083799 .0982627
• black | -.1060518 .034918 -3.04 0.002 -.17449 -.0376137
• hispanic | -.2281468 .0475128 -4.80 0.000 -.3212701 -.1350235
• hsgrad | -.1748765 .0436392 -4.01 0.000 -.2604078 -.0893453
• somecol | -.363869 .0451757 -8.05 0.000 -.4524118 -.2753262
• college | -.7689528 .0466418 -16.49 0.000 -.860369 -.6775366
• worka | -.2093287 .0231425 -9.05 0.000 -.2546873 -.1639702
• _cons | .870543 .154056 5.65 0.000 .5685989 1.172487
• ------------------------------------------------------------------------------
How to measure fit?
• Regression (OLS)
• minimize sum of squared errors
• Or, maximize R2
• The model is designed to maximize predictive capacity
• Not the case with Probit/Logit
• MLE models pick distribution parameters so as best describe the data generating process
• May or may not ‘predict’ the outcome well
Pseudo R2
• LLk log likelihood with all variables
• LL1 log likelihood with only a constant
• 0 > LLk > LL1 so | LLk | < |LL1|
• Pseudo R2 = 1 - |LL1/LLk|
• Bounded between 0-1
• Not anything like an R2 from a regression
Predicting Y
• Let b be the estimated value of β
• For any candidate vector of xi , we can predict probabilities, Pi
• Pi = Ф(xib)
• Once you have Pi, pick a threshold value, T, so that you predict
• Yp = 1 if Pi > T
• Yp = 0 if Pi ≤ T
• Then compare, fraction correctly predicted
Question: what value to pick for T?
• Can pick .5
• Intuitive. More likely to engage in the activity than to not engage in it
• However, when the  is small, this criteria does a poor job of predicting Yi=1
• However, when the  is close to 1, this criteria does a poor job of picking Yi=0
*predict probability of smoking;
• predict pred_prob_smoke;
• * get detailed descriptive data about predicted prob;
• sum pred_prob, detail;
• * predict binary outcome with 50% cutoff;
• gen pred_smoke1=pred_prob_smoke>=.5;
• label variable pred_smoke1 "predicted smoking, 50% cutoff";
• * compare actual values;
• tab smoker pred_smoke1, row col cell;
. sum pred_prob, detail;
• Pr(smoker)
• -------------------------------------------------------------
• Percentiles Smallest
• 1% .0959301 .0615221
• 5% .1155022 .0622963
• 10% .1237434 .0633929 Obs 16258
• 25% .1620851 .0733495 Sum of Wgt. 16258
• 50% .2569962 Mean .2516653
• Largest Std. Dev. .0960007
• 75% .3187975 .5619798
• 90% .3795704 .5655878 Variance .0092161
• 95% .4039573 .5684112 Skewness .1520254
• 99% .4672697 .6203823 Kurtosis 2.149247
Notice two things
• Sample mean of the predicted probabilities is close to the sample mean outcome
• 99% of the probabilities are less than .5
• Should predict few smokers if use a 50% cutoff
| predicted smoking,
• is current | 50% cutoff
• smoking | 0 1 | Total
• -----------+----------------------+----------
• 0 | 12,153 14 | 12,167
• | 99.88 0.12 | 100.00
• | 74.93 35.90 | 74.84
• | 74.75 0.09 | 74.84
• -----------+----------------------+----------
• 1 | 4,066 25 | 4,091
• | 99.39 0.61 | 100.00
• | 25.07 64.10 | 25.16
• | 25.01 0.15 | 25.16
• -----------+----------------------+----------
• Total | 16,219 39 | 16,258
• | 99.76 0.24 | 100.00
• | 100.00 100.00 | 100.00
• | 99.76 0.24 | 100.00
Check on-diagonal elements.
• The last number in each 2x2 element is the fraction in the cell
• The model correctly predicts 74.75 + 0.15 = 74.90% of the obs
• It only predicts a small fraction of smokers
Do not be amazed by the 75% percent correct prediction
• If you said everyone has a  chance of smoking (a case of no covariates), you would be correct Max[(,(1-)] percent of the time
In this case, 25.16% smoke.
• If everyone had the same chance of smoking, we would assign everyone Pr(y=1) = .2516
• We would be correct for the 1 - .2516 = 0.7484 people who do not smoke
• MLE models are not designed to maximize prediction
• Should not be surprised they do not predict well
• In this case, not particularly good measures of predictive capacity
Translating coefficients in probit:Continuous Covariates
• Pr(yi=1) = Φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
• Suppose that x1i is a continuous variable
• d Pr(yi=1) /d x1i = ?
• What is the change in the probability of an event give a change in x1i?
Marginal Effect
• d Pr(yi=1) /d x1i
• = β1φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
• Notice two things. Marginal effect is a function of the other parameters and the values of x.
Translating Coefficients:Discrete Covariates
• Pr(yi=1) = Φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
• Suppose that x2i is a dummy variable (1 if yes, 0 if no)
• Marginal effect makes no sense, cannot change x2i by a little amount. It is either 1 or 0.
• Redefine the variable of interest. Compare outcomes with and without x2i
y1 = Pr(yi=1 | x2i=1)

= Φ[β0 + x1iβ1+ β2 + x3iβ3 +… ]

• y0 = Pr(yi=1 | x2i=0)

= Φ[β0 + x1iβ1+ x3iβ3 … ]

Marginal effect = y1 – y0.

Difference in probabilities with and without x2i?

In STATA
• Marginal effects for continuous variables, and Change in probabilities for dichotomous outcomes, STATA picks sample means for X’s
STATA command for Marginal Effects
• mfx compute;
• Must come after the outcome when estimates are still active in program.
Marginal effects after probit
• y = Pr(smoker) (predict)
• = .24093439
• ------------------------------------------------------------------------------
• variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
• ---------+--------------------------------------------------------------------
• age | -.0003951 .00029 -1.36 0.173 -.000964 .000174 38.5474
• incomel | -.0289139 .00472 -6.13 0.000 -.03816 -.019668 10.421
• male*| .0166757 .0072 2.32 0.021 .002568 .030783 .39476
• black*| -.0320621 .01023 -3.13 0.002 -.052111 -.012013 .111945
• hispanic*| -.0658551 .01259 -5.23 0.000 -.090536 -.041174 .060709
• hsgrad*| -.053335 .01302 -4.10 0.000 -.07885 -.02782 .335527
• somecol*| -.1062358 .01228 -8.65 0.000 -.130308 -.082164 .268545
• college*| -.2149199 .01146 -18.76 0.000 -.237378 -.192462 .329376
• worka*| -.0668959 .00756 -8.84 0.000 -.08172 -.052072 .68514
• ------------------------------------------------------------------------------
• (*) dy/dx is for discrete change of dummy variable from 0 to 1
Interpret results
• 10% increase in income will reduce smoking by 2.9 percentage points
• 10 year increase in age will decrease smoking rates .4 percentage points
• Those with a college degree are 21.5 percentage points less likely to smoke
• Those that face a workplace smoking ban have 6.7 percentage point lower probability of smoking
Do not confuse percentage point and percent differences
• A 6.7 percentage point drop is 29% of the sample mean of 24 percent.
• Blacks have smoking rates that are 3.2 percentage points lower than others, which is 13 percent of the sample mean
Marginal effects for specific characteristics
• Can generate marginal effects for a specific x
• prchange, x(age=40 black=0 hispanic=0 hsgrad=0 somecol=0 worka=0);
• If an x is not specified, STATA will use the sample mean (e.g., log income in this case)
• Make sure when you specify a particular dummy variable (=1) you set the rest to zero
probit: Changes in Predicted Probabilities for smoker
• min->max 0->1 -+1/2 -+sd/2 MargEfct
• age -0.0323 -0.0005 -0.0005 -0.0056 -0.0005
• incomel -0.1795 -0.0320 -0.0344 -0.0263 -0.0345
• male 0.0198 0.0198 0.0198 0.0097 0.0198
• black -0.0385 -0.0385 -0.0394 -0.0124 -0.0394
• hispanic -0.0804 -0.0804 -0.0845 -0.0202 -0.0847
• hsgrad -0.0625 -0.0625 -0.0648 -0.0306 -0.0649
• somecol -0.1235 -0.1235 -0.1344 -0.0598 -0.1351
• college -0.2644 -0.2644 -0.2795 -0.1335 -0.2854
• worka -0.0742 -0.0742 -0.0776 -0.0361 -0.0777
Testing significance of individual parameters
• In large samples, MLE estimates are normally distributed
• Null hypothesis, βj=0
• If the null is true and the sample is larges, βj is distributed as a normal with variance σj2.
• Using notes from before, if we divide βj by the standard deviation, we get standard normal
βj/se(βj) should be N(0,1)
• βj/se(βj) = z-score
• 95% of the distribution of a N(0,1) is between -1.96, 1.96
• Reject null of the z-score > |1.96|
• Only age is statistically insignificant (cannot reject null)
When will results differ?
• Normal and logit CDF look:
• Similar in the mid point of the distribution
• Different in the tails
• You obtain more observations in the tails of the distribution when
• Samples sizes are large
•  approaches 1 or 0
• These situations will produce more differences in estimates
Some nice properties of the Logit
• Outcome, y=1 or 0
• Treatment, x=1 or 0
• Other covariates, x
• Context,
• x = whether a baby is born with a low weight birth
• x = whether the mom smoked or not during pregnancy
Risk ratio

RR = Prob(y=1|x=1)/Prob(y=1|x=0)

Differences in the probability of an event when x is and is not observed

How much does smoking elevate the chance your child will be a low weight birth

Let Yyx be the probability y=1 or 0 given x=1 or 0
• Think of the risk ratio the following way
• Y11 is the probability Y=1 when X=1
• Y10 is the probability Y=1 when X=0
• Y11 = RR*Y10
Odds Ratio

OR=A/B = [Y11/Y01]/[Y10/Y00]

A = [Pr(Y=1|X=1)/Pr(Y=0|X=1)]

= odds of Y occurring if you are a smoker

B = [Pr(Y=1|X=0)/Pr(Y=0|X=0)]

= odds of y happening if you are not a smoker

What are the relative odds of Y happening if you do or do not experience X

Suppose Pr(Yi =1) = F(βo+ β1Xi + β2Z) and F is the logistic function
• Can show that
• OR = exp(β1) = e β1
• This number is typically reported by most statistical packages
Details
• Y11 = exp(βo+ β1 + β2Z) /(1+ exp(βo+ β1+ β2Z) )
• Y10 = exp(βo+ β2Z)/(1+ exp(βo+β2Z))
• Y01 = 1 /(1+ exp(βo+ β1 + β2Z) )
• Y00 = 1/(1+ exp(βo+β2Z)
• [Y11/Y01] = exp(βo+ β1 + β2Z)
• [Y10/Y00] = exp(βo+ β2Z)
• OR=A/B = [Y11/Y01]/[Y10/Y00]

= exp(βo+ β1 + β2Z)/ exp(βo + β2Z)

= exp(β1)

Suppose Y is rare,  close to 0
• Pr(Y=0|X=1) and Pr(Y=0|X=0) are both close to 1, so they cancel
• Therefore, when  is close to 0
• Odds Ratio = Risk Ratio
• Why is this nice?
Population attributable risk
• Average outcome in the population
•  = (1-) Y10 +  Y11 = (1- )Y10 + (RR)Y10
• Average outcomes are a weighted average of outcomes for X=0 and X=1
• What would the average outcome be in the absence of X (e.g., reduce smoking rates to 0)
• Ya = Y10
Population Attributable Risk
• PAR
• Fraction of outcome attributed to X
• The difference between the current rate and the rate that would exist without X, divided by the current rate
• PAR = ( – Ya)/

= (RR – 1)/[(1-) + RR]

Example: Maternal Smoking and Low Weight Births
• 6% births are low weight
• < 2500 grams (
• Average birth is 3300 grams (5.5 lbs)
• Maternal smoking during pregnancy has been identified as a key cofactor
• 13% of mothers smoke
• This number was falling about 1 percentage point per year during 1980s/90s
• Doubles chance of low weight birth
Natality detail data
• Census of all births (4 million/year)
• Annual files starting in the 60s
• Baby (birth weight, length, date, sex, plurality, birth injuries)
• Demographics (age, race, marital, educ of mom)
• Birth (who delivered, method of delivery)
• Health of mom (smoke/drank during preg, weight gain)
Smoking not available from CA or NY
• ~3 million usable observations
• I pulled .5% random sample from 1995
• Variables: birthweight (grams), smoked, married, 4-level race, 5 level education, mothers age at birth
------------------------------------------------------------------------------------------------------------------------------------------------------------
• > -
• storage display value
• variable name type format label variable label
• ------------------------------------------------------------------------------
• > -
• birthw int %9.0g birth weight in grams
• smoked byte %9.0g =1 if mom smoked during
• pregnancy
• age byte %9.0g moms age at birth
• married byte %9.0g =1 if married
• race4 byte %9.0g 1=white,2=black,3=asian,4=other
• educ5 byte %9.0g 1=0-8, 2=9-11, 3=12, 4=13-15,
• 5=16+
• visits byte %9.0g prenatal visits
• ------------------------------------------------------------------------------
dummy |
• variable, |
• =1 | =1 if mom smoked
• ifBW<2500 | during pregnancy
• grams | 0 1 | Total
• -----------+----------------------+----------
• 0 | 11,626 1,745 | 13,371
• | 86.95 13.05 | 100.00
• | 94.64 89.72 | 93.96
• | 81.70 12.26 | 93.96
• -----------+----------------------+----------
• 1 | 659 200 | 859
• | 76.72 23.28 | 100.00
• | 5.36 10.28 | 6.04
• | 4.63 1.41 | 6.04
• -----------+----------------------+----------
• Total | 12,285 1,945 | 14,230
• | 86.33 13.67 | 100.00
• | 100.00 100.00 | 100.00
• | 86.33 13.67 | 100.00
Notice a few things
• 13.7% of women smoke
• 6% have low weight birth
• Pr(LBW | Smoke) =10.28%
• Pr(LBW |~ Smoke) = 5.36%
• RR

= Pr(LBW | Smoke)/ Pr(LBW |~ Smoke)

= 0.1028/0.0536 = 1.92

Logit results
• Log likelihood = -3136.9912 Pseudo R2 = 0.0330
• ------------------------------------------------------------------------------
• lowbw | Coef. Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• smoked | .6740651 .0897869 7.51 0.000 .4980861 .8500441
• age | .0080537 .006791 1.19 0.236 -.0052564 .0213638
• married | -.3954044 .0882471 -4.48 0.000 -.5683654 -.2224433
• _Ieduc5_2 | -.1949335 .1626502 -1.20 0.231 -.5137221 .1238551
• _Ieduc5_3 | -.1925099 .1543239 -1.25 0.212 -.4949791 .1099594
• _Ieduc5_4 | -.4057382 .1676759 -2.42 0.016 -.7343769 -.0770994
• _Ieduc5_5 | -.3569715 .1780322 -2.01 0.045 -.7059081 -.0080349
• _Irace4_2 | .7072894 .0875125 8.08 0.000 .5357681 .8788107
• _Irace4_3 | .386623 .307062 1.26 0.208 -.2152075 .9884535
• _Irace4_4 | .3095536 .2047899 1.51 0.131 -.0918271 .7109344
• _cons | -2.755971 .2104916 -13.09 0.000 -3.168527 -2.343415
• ------------------------------------------------------------------------------
Odds Ratios
• Smoked
• exp(0.674) = 1.96
• Smokers are twice as likely to have a low weight birth
• _Irace4_2 (Blacks)
• exp(0.707) = 2.02
• Blacks are twice as likely to have a low weight birth
• Logistic y x1 x2;
• In this case
• xi: logistic lowbw smoked age married i.educ5 i.race4;
• ------------------------------------------------------------------------------
• lowbw | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• smoked | 1.962198 .1761796 7.51 0.000 1.645569 2.33975
• age | 1.008086 .0068459 1.19 0.236 .9947574 1.021594
• married | .6734077 .0594262 -4.48 0.000 .5664506 .8005604
• _Ieduc5_2 | .8228894 .1338431 -1.20 0.231 .5982646 1.131852
• _Ieduc5_3 | .8248862 .1272996 -1.25 0.212 .6095837 1.116233
• _Ieduc5_4 | .6664847 .1117534 -2.42 0.016 .4798043 .9257979
• _Ieduc5_5 | .6997924 .1245856 -2.01 0.045 .4936601 .9919973
• _Irace4_2 | 2.028485 .1775178 8.08 0.000 1.70876 2.408034
• _Irace4_3 | 1.472001 .4519957 1.26 0.208 .8063741 2.687076
• _Irace4_4 | 1.362817 .2790911 1.51 0.131 .9122628 2.035893
• ------------------------------------------------------------------------------
PAR
• PAR = (RR – 1)/[(1-) + RR]
• = 0.137
• RR = 1.96
• PAR = 0.116
• 11.6% of low weight births attributed to maternal smoking
Hypothesis Testing in MLE models
• MLE are asymptotically normally distributed, one of the properties of MLE
• Therefore, standard t-tests of hypothesis will work as long as samples are ‘large’
• What ‘large’ means is open to question
• What to do when samples are ‘small’ – table for a moment
Testing a linear combination of parameters
• Suppose you have a probit model
• Φ[β0 + x1iβ1+ x2iβ2 + x3iβ3 +… ]
• Test a linear combination or parameters
• Simplest example, test a subset are zero
• β1= β2 = β3 = β4 =0
• To fix the discussion
• N observations
• K parameters
• J restrictions (count the equals signs, j=4)
Wald Test
• Based on the fact that the parameters are distributed asymptotically normal
• Probability theory review
• Suppose you have m draws from a standard normal distribution (zi)
• M = z12 + z22 + …. zm2
• M is distributed as a Chi-square with m degrees of freedom
Wald test constructs a ‘quadratic form’ suggested by the test you want to perform
• This combination, because it contains squares of the true parameters, should, if the hypothesis is true, be distributed as a Chi square with J degrees of freedom.
• If the test statistic is ‘large’, relative to the degrees of freedom of the test, we reject, because there is a low probability we would have drawn that value at random from the distribution
Reading critical values from a table
• All stats books will report the ‘percentiles’ of a chi-square
• Vertical axis (degrees of freedom)
• Horizontal axis (percentiles)
• Entry is the value where ‘percentile’ of the distribution falls below
Example: Suppose 4 restrictions
• 95% of a chi-square distribution falls below 9.488.
• So there is only a 5% a number drawn at random will exceed 9.488
• If your test statistic is below, cannot reject null
• If your test statistics is above, reject null
Wald test in STATA
• Default test in MLE models
• Easy to do. Look at program
• Does not estimate the ‘restricted’ model
• ‘Lower power’ than other tests, i.e., high chance of false negative
• ( 1) hsgrad = 0
• ( 2) somecol = 0
• ( 3) college = 0
• chi2( 3) = 504.78
• Prob > chi2 = 0.0000
Notice the higher value of the test statistic. There is a low chance that a variable, drawn at random from a ch-square with three degrees of freedom will be this large.
• Reject null
-2 Log likelihood test
• * how to run the same tests with a -2 log like test;
• * estimate the unresticted model and save the estimates ;
• * in urmodel;
• probit smoker age incomel male black hispanic
• estimates store urmodel;
• * estimate the restricted model. save results in rmodel;
• probit smoker age incomel male black hispanic
• worka;
• estimates store rmodel;
• lrtest urmodel rmodel;
I prefer -2 log likelihood test
• Estimates the restricted and unrestricted model
• Therefore, has more power than a Wald test
• In most cases, they give the same ‘decision’ (reject/not reject)

### Section VI

Categorical Data

Ordered Probit
• Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation:
• Examples:
• Self reported health status
• (excellent, very good, good, fair, poor)
• Do you agree with the following statement
• Strongly agree, agree, disagree, strongly disagree
Can use the same type of model as in the previous section to analyze these outcomes
• Another ‘latent variable’ model
• Key to the model: there is a monotonic ordering of the qualitative responses
Self reported health status
• Excellent, very good, good, fair, poor
• Coded as 1, 2, 3, 4, 5 on National Health Interview Survey
• We will code as 5,4,3,2,1 (easier to think of this way)
• Asked on every major health survey
• Important predictor of health outcomes, e.g. mortality
• Key question: what predicts health status?
Important to note – the numbers 1-5 mean nothing in terms of their value, just an ordering to show you the lowest to highest
• The example below is easily adapted to include categorical variables with any number of outcomes
Model
• yi* = latent index of reported health
• The latent index measures your own scale of health. Once yi* crosses a certain value you report poor, then good, then very good, then excellent health
yi = (1,2,3,4,5) for (fair, poor, VG, G, excel)
• Interval decision rule
• yi=1 if yi* ≤ u1
• yi=2 if u1 < yi* ≤ u2
• yi=3 if u2 < yi* ≤ u3
• yi=4 if u3 < yi* ≤ u4
• yi=5 if yi* > u4
As with logit and probit models, we will assume yi* is a function of observed and unobserved variables
• yi* = β0 + x1i β1 + x2i β2 …. xki βk + εi
• yi* = xi β + εi
The threshold values (u1, u2, u3, u4) are unknown. We do not know the value of the index necessary to push you from very good to excellent.
• In theory, the threshold values are different for everyone
• Computer will not only estimate the β’s, but also the thresholds – average across people
As with probit and logit, the model will be determined by the assumed distribution of ε
• In practice, most people pick nornal, generating an ‘ordered probit’ (I have no idea why)
• We will generate the math for the probit version
Probabilities
• Lets do the outliers, Pr(yi=1) and Pr(yi=5) first
• Pr(yi=1)
• = Pr(yi* ≤ u1)
• = Pr(xi β +εi ≤ u1 )
• =Pr(εi ≤ u1 - xi β)
• = Φ[u1 - xi β] = 1- Φ[xi β – u1]
Pr(yi=5)
• = Pr(yi* > u4)
• = Pr(xi β +εi > u4 )
• =Pr(εi > u4 - xi β)
• = 1 - Φ[u4 - xi β] = Φ[xi β – u4]
Sample one for y=3
• Pr(yi=3) = Pr(u2 < yi* ≤ u3)

= Pr(yi* ≤ u3) – Pr(yi* ≤ u2)

= Pr(xi β +εi≤ u3) – Pr(xi β +εi≤ u2)

= Pr(εi≤ u3- xi β) - Pr(εi≤ u2 - xi β)

= Φ[u3- xi β] - Φ[u2 - xi β]

= 1 - Φ[xi β - u3] – 1 + Φ[xi β - u2]

= Φ[xi β - u2] - Φ[xi β - u3]

Summary
• Pr(yi=1) = 1- Φ[xi β – u1]
• Pr(yi=2) = Φ[xi β – u1] - Φ[xi β – u2]
• Pr(yi=3) = Φ[xi β – u2] - Φ[xi β – u3]
• Pr(yi=4) = Φ[xi β – u3] - Φ[xi β – u4]
• Pr(yi=5) = Φ[xi β – u4]
Likelihood function
• There are 5 possible choices for each person
• Only 1 is observed
• L = Σi ln[Pr(yi=k)] for k
Programming example
• Cancer control supplement to 1994 National Health Interview Survey
• Question: what observed characteristics predict self reported health (1-5 scale)
• 1=poor, 5=excellent
• Key covariates: income, education, age, current and former smoking status
• Programs
• sr_health_status.do, .dta, .log
desc;
• male byte %9.0g =1 if male
• age byte %9.0g age in years
• educ byte %9.0g years of education
• smoke byte %9.0g current smoker
• smoke5 byte %9.0g smoked in past 5 years
• black float %9.0g =1 if respondent is black
• othrace float %9.0g =1 if other race (white is ref)
• sr_health float %9.0g 1-5 self reported health,
• 5=excel, 1=poor
• famincl float %9.0g log family income
tab sr_health;
• 1-5 self |
• reported |
• health, |
• 5=excel, |
• 1=poor | Freq. Percent Cum.
• ------------+-----------------------------------
• 1 | 342 2.65 2.65
• 2 | 991 7.68 10.33
• 3 | 3,068 23.78 34.12
• 4 | 3,855 29.88 64.00
• 5 | 4,644 36.00 100.00
• ------------+-----------------------------------
• Total | 12,900 100.00
In STATA
• oprobit sr_health male age educ famincl black othrace smoke smoke5;
Ordered probit estimates Number of obs = 12900
• LR chi2(8) = 2379.61
• Prob > chi2 = 0.0000
• Log likelihood = -16401.987 Pseudo R2 = 0.0676
• ------------------------------------------------------------------------------
• sr_health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• male | .1281241 .0195747 6.55 0.000 .0897583 .1664899
• age | -.0202308 .0008499 -23.80 0.000 -.0218966 -.018565
• educ | .0827086 .0038547 21.46 0.000 .0751535 .0902637
• famincl | .2398957 .0112206 21.38 0.000 .2179037 .2618878
• black | -.221508 .029528 -7.50 0.000 -.2793818 -.1636341
• othrace | -.2425083 .0480047 -5.05 0.000 -.3365958 -.1484208
• smoke | -.2086096 .0219779 -9.49 0.000 -.2516855 -.1655337
• smoke5 | -.1529619 .0357995 -4.27 0.000 -.2231277 -.0827961
• -------------+----------------------------------------------------------------
• _cut1 | .4858634 .113179 (Ancillary parameters)
• _cut2 | 1.269036 .11282
• _cut3 | 2.247251 .1138171
• _cut4 | 3.094606 .1145781
• ------------------------------------------------------------------------------
Interpret coefficients
• Marginal effects/changes in probabilities are now a function of 2 things
• Point of expansion (x’s)
• Frame of reference for outcome (y)
• STATA
• Picks mean values for x’s
• You pick the value of y
Continuous x’s
• Consider y=5
• d Pr(yi=5)/dxi

= d Φ[xi β – u4]/dxi = βφ[xi β – u4]

• Consider y=3
• d Pr(yi=3)/dxi = βφ[xi β – u3] - βφ[xi β – u4]
Discrete X’s
• xiβ = β0 + x1i β1 + x2i β2 …. xki βk
• X2i is yes or no (1 or 0)
• ΔPr(yi=5) =
• Φ[β0 + x1i β1 + β2 + x3i β3 +.. xki βk]

- Φ[β0 + x1i β1 + x3i β3 …. xki βk]

• Change in the probabilities when x2i=1 and x2i=0
• mfx compute, predict(outcome(5));
mfx compute, predict(outcome(5));
• Marginal effects after oprobit
• y = Pr(sr_health==5) (predict, outcome(5))
• = .34103717
• ------------------------------------------------------------------------------
• variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
• ---------+--------------------------------------------------------------------
• male*| .0471251 .00722 6.53 0.000 .03298 .06127 .438062
• age | -.0074214 .00031 -23.77 0.000 -.008033 -.00681 39.8412
• educ | .0303405 .00142 21.42 0.000 .027565 .033116 13.2402
• famincl | .0880025 .00412 21.37 0.000 .07993 .096075 10.2131
• black*| -.0781411 .00996 -7.84 0.000 -.097665 -.058617 .124264
• othrace*| -.0843227 .01567 -5.38 0.000 -.115043 -.053602 .04124
• smoke*| -.0749785 .00773 -9.71 0.000 -.09012 -.059837 .289147
• smoke5*| -.0545062 .01235 -4.41 0.000 -.078719 -.030294 .081395
• ------------------------------------------------------------------------------
• (*) dy/dx is for discrete change of dummy variable from 0 to 1
Interpret the results
• Males are 4.7 percentage points more likely to report excellent
• Each year of age decreases chance of reporting excellent by 0.7 percentage points
• Current smokers are 7.5 percentage points less likely to report excellent health
• Wald tests/-2 log likelihood tests are done the exact same was as in PROBIT and LOGIT
• Tests of individual parameters are done the same way (z-score)

prchange, x(age=40 black=0 othrace=0 smoke=0 smoke5=0 educ=16);

• When a variable is NOT specified (famincl), STATA takes the sample mean.
PRCHANGE will produce results for all outcomes
• male
• Avg|Chg| 1 2 3 4
• 0->1 .0203868 -.0020257 -.00886671 -.02677558 -.01329902
• 5
• 0->1 .05096698
age
• Avg|Chg| 1 2 3 4
• Min->Max .13358317 .0184785 .06797072 .17686112 .07064757
• -+1/2 .00321942 .00032518 .00141642 .00424452 .00206241
• -+sd/2 .03728014 .00382077 .01648743 .04910323 .0237889
• MargEfct .00321947 .00032515 .00141639 .00424462 .00206252

### Section VII

Count Data Models

Introduction
• Many outcomes of interest are integer counts
• Doctor visits
• Low work days
• Cigarettes smoked per day
• Missed school days
• OLS models can easily handle some integer models
Example
• SAT scores are essentially integer values
• Few at ‘tails’
• Distribution is fairly continuous
• OLS models well
• In contrast, suppose
• High fraction of zeros
• Small positive values
OLS models will
• Predict negative values
• Do a poor job of predicting the mass of observations at zero
• Example
• Dr visits in past year, Medicare patients(65+)
• 1987 National Medical Expenditure Survey
• Top code (for now) at 10
• 17% have no visits
visits | Freq. Percent Cum.
• ------------+-----------------------------------
• 0 | 915 17.18 17.18
• 1 | 601 11.28 28.46
• 2 | 533 10.01 38.46
• 3 | 503 9.44 47.91
• 4 | 450 8.45 56.35
• 5 | 391 7.34 63.69
• 6 | 319 5.99 69.68
• 7 | 258 4.84 74.53
• 8 | 216 4.05 78.58
• 9 | 192 3.60 82.19
• 10 | 949 17.81 100.00
• ------------+-----------------------------------
• Total | 5,327 100.00
Poisson Model
• yi is drawn from a Poisson distribution
• Poisson parameter varies across observations
• f(yi;λi) =e-λi λiyi/yi! For λi>0
• E[yi]= Var[yi] = λi = f(xi, β)
λi must be positive at all times
• Therefore, we CANNOT let λi = xiβ
• Let λi = exp(xiβ)
• ln(λi) = (xiβ)
d ln(λi)/dxi = β
• Remember that d ln(λi) = dλi/λi
• Interpret β as the percentage change in mean outcomes for a change in x
Problems with Poisson
• Variance grows with the mean
• E[yi]= Var[yi] = λi = f(xi, β)
• Most data sets have over dispersion, where the variance grows faster than the mean
• In dr. visits sample,  = 5.6, s=6.7
• Impose Mean=Var, severe restriction and you tend to reduce standard errors
Negative Binomial Model
• Where γi = exp(xiβ) and δ ≥ 0
• E[yi] = δγi = δexp(xiβ)
• Var[yi] = δ (1+δ) γi
• Var[yi]/ E[yi] = (1+δ)
δ must always be ≥ 0
• In this case, the variance grows faster than the mean
• If δ=0, the model collapses into the Poisson
• Always estimate negative binomial
• If you cannot reject the null that δ=0, report the Poisson estimates
Notice that ln(E[yi]) = ln(δ) + ln(γi), so
• d ln(E[yi]) /dxi = β
• Parameters have the same interpretation as in the Poisson model
In STATA
• POISSON estimates a MLE model for poisson
• Syntax

POISSON y independent variables

• NBREG estimates MLE negative binomial
• Syntax

NBREG y independent variables

Interpret results for Poisson
• Those with CHRONIC condition have 50% more mean MD visits
• Those in EXCELent health have 78% fewer MD visits
• BLACKS have 33% fewer visits than whites
• Income elasticity is 0.021, 10% increase in income generates a 2.1% increase in visits
Negative Binomial
• Interpret results the same was as Poisson
• Look at coefficient/standard error on delta
• Ho: delta = 0 (Poisson model is correct)
• In this case, delta = 5.21 standard error is 0.15, easily reject null.
• Var/Mean = 1+delta = 6.21, Poisson is mis-specificed, should see very small standard errors in the wrong model

### Section VIII

Duration Data

Introduction
• Sometimes we have data on length of time of a particular event or ‘spells’
• Time until death
• Time on unemployment
• Time to complete a PhD
• Techniques we will discuss were originally used to examine lifespan of objects like light bulbs or machines. These models are often referred to as “time to failure”
Notation
• T is a random variable that indicates duration (time til death, find a new job, etc)
• t is the realization of that variable
• f(t) is a PDF that describes the process that determines the time to failure
• CDF is F(t) represents the probability an event will happen by time t
• What is the probability a person will die on or before the 65th birthday?
Survivor function, what is the chance you live past (t)
• S(t) = 1 – F(t)
• If 10% of a cohort dies by their 65th birthday, 90% will die sometime after their 65th birthday
Hazard function, h(t)
• What is the probability the spell will end at time t, given that it has already lasted t
• What is the chance you find a new job in month 12 given that you’ve been unemployed for 12 months already
PDF, CDF (Failure function), survivor function and hazard function are all related
• λ(t) = f(t)/S(t) = f(t)/(1-F(t))
• We focus on the ‘hazard’ rate because its relationship to time indicates ‘duration dependence’
Example: suppose the longer someone is out of work, the lower the chance they will exit unemployment – ‘damaged goods’
• This is an example of duration dependence, the probability of exiting a state of the world is a function of the length
Mathematically
• d λ(t) /dt = 0 then there is no duration dep.
• d λ(t) /dt > 0 there is + duration dependence

the probability the spell will end

increases with time

• d λ(t) /dt < 0 there is – duration dependence

the probability the spell will end

decreases over time

Different Functional Forms
• Exponential
• λ(t)= λ
• Hazard is the same over time, a ‘memory less’ process
• Weibull
• F(t) = 1 – exp(-γtρ) where ρ,γ > 0
• λ(t) = ργρ-1
• if ρ>1, increasing hazard
• if ρ<1, decreasing hazard
• if ρ=1, exponential
Others: Lognormal, log-logistic, Gompertz.
• Little more difficult – can examine when you get comfortable with Weibull
A note about most data sets
• Most data sets have ‘censored’ spells
• All will eventually die, but some do not in your period of analysis
• Incomplete spells or censored data
• Must build into the log likelihood function
Let ti be the duration we observe for all people
• Some people die, and their they lived until period ti
• Others are observed for ti periods, but do not
• Let di=1 if data is complete spell
• di=1 if incomplete
• F(t) is the probability you die by t
• 1-F(t) = the probability you die after (t)
If di=1 then we observe f(ti), someone who died in period ti
• If di=0 then someone lived past period ti and the probability of that is [1-F(ti)]
• L = Σi {di ln[f(ti)] + (1-di)ln[1-F(ti)]}
Introducing covariates
• Look at exponential
• λ(t)= λ
• Allow this to vary across people
• λi(t)= λi
• But like Poisson, λi is always positive
• Let λi = exp(β0 + x1i β1 + x2i β2 …. xki βk)
In the Weibull
• λ(t) = αγtα-1
• Allow it to vary across people
• λi(t) = αγi tα-1
• γi = exp(β0 + x1i β1 + x2i β2 …. xki βk)
Interpreting Coefficients
• This is the same for both Weibull and Exponential
• In Weibull, λ(ti ) = αγitα-1
• Suppose x1i is a dummy variable
• When xi1=1, then
• γi1 = eβ0 + β1 + x2i β2 …. xki βk
• When xi1=0, then
• γi0 = eβ0 + x2i β2 …. xki βk
When you construct the ratio of γi1/ γi0, all the others parameters cancel, so
• (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
• Percentage change in the hazard when x1i turns from 0 to 1.
• STATA prints out eβ1, just subtract 1
Suppose x2i is continuous
• Suppose we increase x2i by 1 unit
• γi1 = exp(β0 + β1x1i + x2iβ2 …. xkiβk)
• γi2 = exp(β0 + β1 (x1i+1) + x2iβ2 …. xkiβk)
• Can show that
• (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
• = exp(β2) – 1
• Percentage change in the hazard for 1 unit increase in x
NHIS Multiple Cause of Death
• NHIS
• annual survey of 60K households
• Data on individuals
• Self-reported healthm DR visits, lost workdays, etc.
• MCOD
• Linked NHIS respondents from 1986-1994 to National Death Index through Dec 31, 1995
• Identified whether respondent died and of what cause
Our sample
• Males, 50-70, who were married at the time of the survey
• 1987-1989 surveys
• Give everyone 5 years (60 months) of followup
Key Variables
• max_mths maximum months in the survey.
• Diedin5 respondent died during the 5 years of followup
• Note if diedn5=0, the max_mths=60. Diedin5 identifies whether the data is censored or not.
• -------------+--------------------------------------------------------
• age_s_yrs | 26654 59.42586 5.962435 50 70
• max_mths | 26654 56.49077 11.15384 0 60
• black | 26654 .0928566 .2902368 0 1
• hispanic | 26654 .0454716 .20834 0 1
• income | 26654 3.592181 1.327325 1 5
• -------------+--------------------------------------------------------
• educ | 26654 2.766677 .961846 1 4
• diedin5 | 26654 .1226082 .3279931 0 1
Duration Data in STATA
• Need to identify which is the duration data

stset length, failure(failvar)

• Length=duration variable
• Failvar=1 when durations end in failure, =0 for censored values
• If all data is uncensored, omit failure(failvar)
In our case
• stset max_mths, failure(diedin5)
Getting Kaplan-Meier Curves
• Tabular presentation of results

sts list

• Graphical presentation

sts graph

• Results by subgroup

sts graph, by(educ)

MLE of duration model with Covariates
• Basic syntax
• streg covariates,d(distribution)
• streg age_s_yrs black hispanic _Ie* _Ii*, d(weibull);
• In this model, STATA will print out exp(β)
• If you want the coefficients, add ‘nohr’ option (no hazard ratio)
Weibull coefficients
• No. of subjects = 26631 Number of obs = 26631
• No. of failures = 3245
• Time at risk = 1505705
• LR chi2(10) = 595.74
• Log likelihood = -12425.055 Prob > chi2 = 0.0000
• ------------------------------------------------------------------------------
• _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• age_s_yrs | .0452588 .0031592 14.33 0.000 .0390669 .0514508
• black | .4770152 .0511122 9.33 0.000 .3768371 .5771932
• hispanic | .1333552 .082156 1.62 0.105 -.0276676 .294378
• _Ieduc_2 | .0093353 .0591918 0.16 0.875 -.1066786 .1253492
• _Ieduc_3 | -.072163 .0503131 -1.43 0.151 -.1707748 .0264488
• _Ieduc_4 | -.1301173 .0657131 -1.98 0.048 -.2589126 -.0013221
• _Iincome_2 | -.1867752 .0650604 -2.87 0.004 -.3142914 -.0592591
• _Iincome_3 | -.3268927 .0688635 -4.75 0.000 -.4618627 -.1919227
• _Iincome_4 | -.5166137 .0769202 -6.72 0.000 -.6673747 -.3658528
• _Iincome_5 | -.5425447 .0722025 -7.51 0.000 -.684059 -.4010303
• _cons | -9.201724 .2266475 -40.60 0.000 -9.645945 -8.757503
• -------------+----------------------------------------------------------------
• /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901
• -------------+----------------------------------------------------------------
• p | 1.171789 .020183 1.132891 1.212022
• 1/p | .8533961 .014699 .8250675 .8826974
• ------------------------------------------------------------------------------
The sign of the parameters is informative
• Hazard increasing in age
• Blacks, hispanics have higher mortality rates
• Hazard decreases with income and age
• The parameter ρ= 1.17.
• Check 95% confidence interval (1.13, 1.21). Can reject null p=1 (exponential)
• Hazard is increasing over time
Hazard ratios
• No. of subjects = 26631 Number of obs = 26631
• No. of failures = 3245
• Time at risk = 1505705
• LR chi2(10) = 595.74
• Log likelihood = -12425.055 Prob > chi2 = 0.0000
• ------------------------------------------------------------------------------
• _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
• -------------+----------------------------------------------------------------
• age_s_yrs | 1.046299 .0033055 14.33 0.000 1.03984 1.052797
• black | 1.611258 .082355 9.33 0.000 1.457667 1.781032
• hispanic | 1.142656 .093876 1.62 0.105 .9727116 1.342291
• _Ieduc_2 | 1.009379 .059747 0.16 0.875 .8988145 1.133544
• _Ieduc_3 | .9303792 .0468103 -1.43 0.151 .8430114 1.026802
• _Ieduc_4 | .8779924 .0576956 -1.98 0.048 .7718905 .9986788
• _Iincome_2 | .8296302 .0539761 -2.87 0.004 .7303062 .9424625
• _Iincome_3 | .7211611 .0496617 -4.75 0.000 .6301089 .8253706
• _Iincome_4 | .5965372 .0458858 -6.72 0.000 .5130537 .6936049
• _Iincome_5 | .5812672 .041969 -7.51 0.000 .5045648 .6696297
• -------------+----------------------------------------------------------------
• /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901
• -------------+----------------------------------------------------------------
• p | 1.171789 .020183 1.132891 1.212022
• 1/p | .8533961 .014699 .8250675 .8826974
• ------------------------------------------------------------------------------
Interpret coefficients
• Age: every year hazard increases by 4.6%
• Black: have 61% greater hazard than whites
• Hispanics: 14% greater hazard than non-hispanics
• Educ 2, 3, 4 are some 9-11, 12-15 and 16+ years of school
Educ 3: those with 12-15 years of educ have .93 – 1 = -0.07 or a 7% lower hazard than those with <9 years of school
• Educ 4: those with a college degree have 0.88 – 1 = -0.12 or a 12% lower hazard than those with <9 years of school
Income 2-5 are dummies for people with \$10-\$20K, \$20-\$30K, \$30-\$40K, >\$40K
• Income 2: Those with \$10-\$20K have 0.83 – 1 = -0.17 or a 17% lower hazard than those with income <\$10K
• Income 5, those with >\$40K in income have a 0.58 – 1 = -0.42 or a 42% lower hazard than those with income <\$10K
Topics not covered
• Time varying covariates
• Competing risk models
• Die from multiple causes
• Cox proportional hazard model
• Heterogeneity in baseline hazard