- 64 Views
- Uploaded on
- Presentation posted in: General

Discrete and Categorical Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Discrete and Categorical Data

William N. Evans

Department of Economics/MPRC

University of Maryland

Part I

Introduction

- Workhorse statistical model in social sciences is the multivariate regression model
- Ordinary least squares (OLS)
- yi = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi
- yi = xi β + εi

- and are “population” values – represent the true relationship between x and y
- Unfortunately – these values are unknown
- The job of the researcher is to estimate these values
- Notice that if we differentiate y with respect to x, we obtain
- dy/dx =

- represents how much y will change for a fixed change in x
- Increase in income for more education
- Change in crime or bankruptcy when slots are legalized
- Increase in test score if you study more

- State of Maryland budget problems
- Drop in revenues
- Expensive k-12 school spending initiatives

- Short-term solution – raise tax on cigarettes by 34 cents/pack
- Problem – a tax hike will reduce consumption of taxable product
- Question for state – as taxes are raised, how much will cigarette consumption fall?

- Simple model: yi = + xi + i
- Suppose y is a state’s per capita consumption of cigarettes
- x represents taxes on cigarettes
- Question – how much will y fall if x is increased by 34 cents/pack?
- Problem – many reasons why people smoke – cost is but one of them –

- Data
- (Y) State per capita cigarette consumption for the years 1980-1997
- (X) tax (State + Federal) in real cents per pack
- “Scatter plot” of the data
- Negative covariance between variables
- When x>, more likely that y<
- When x<, more likely that y>

- Goal: pick values of and that “best fit” the data
- Define best fit in a moment

- True model
- yi = + xi + i
- We observe data points (yi,xi)
- The parameters and are unknown
- The actual error (i)is unknown

- (a,b) are estimates for the parameters (,)
- ei is an estimate of i where
- ei=yi-a-bxi

- Min iei2 = i(yi – a – bxi)2
- Minimize sum of squared errors (SSE)
- Treat (+) and (-) errors equally
- Over or under predict by “5” is the same magnitude of error
- “Quadratic form”
- The optimal value for a and b are those that make the 1st derivative equal zero
- Functions reach min or max values when derivatives are zero

- The model has a lot of nice features
- Statistical properties easy to establish
- Optimal estimates easy to obtain
- Parameter estimates are easy to interpret
- Model maximizes prediction
- If you minimize SSE you maximize R2

- The model does well as a first order approximation to lots of problems

- The OLS model work well when y is a continuous variable
- Income, wages, test scores, weight, GDP

- Does not has as many nice properties when y is not continuous
- Example: doctor visits
- Integer values
- Low counts for most people
- Mass of observations at zero

- Can predict outside the allowable range
- e.g., negative MD visits

- Does not describe the data generating process well
- e.g., mass of observations at zero

- Violates many properties of OLS
- e.g. heteroskedasticity

- Look at situations when the data generating process does not lend itself well to OLS models
- Mathematically describe the data generating process
- Show how we use different optimization procedure to obtain estimates
- Describe the statistical properties

- Show how to interpret parameters
- Illustrate how to estimate the models with popular program STATA

- Dichotomous events (yes or no)
- 1=yes, 0=no
- Graduate high school? work? Are obese? Smoke?

- Ordinal data
- Self reported health (fair, poor, good, excel)
- Strongly disagree, disagree, agree, strongly agree

- Count data
- Doctor visits, lost workdays, fatality counts

- Duration data
- Time to failure, time to death, time to re-employment

- Jeffrey Wooldridge, “Econometric analysis of cross sectional and panel data”
- Lots of insight and mathematical/statistical detail
- Very good examples

- William Greene, “Econometric Analysis”
- more topics
- Somewhat dated examples

- www.bsos.umd.edu/econ/evans/jpsm.html
- Contains
- These notes
- All STATA programs and data sets
- A couple of “Introduction to STATA” handouts
- Links to some useful web sites

- “Regression Models for Categorical Dependent Variables Using STATA”
- J. Scott Long and Jeremy Freese

- Available for sale from STATA website for $52 (www.stata.com)
- Post-estimation subroutines that translate results
- Do not need to buy the book to use the subroutines

- In STATA command line type
- net search spost

Spostado from http://www.indiana.edu/~jslsoc/stata

Part II

A brief introduction to STATA

- Very fast, convenient, well-documented, cheap and flexible statistical package
- Excellent for cross-section/panel data projects, not as great for time series but getting better
- Not as easy to manipulate large data sets from flat files as SAS
- I usually clean data in SAS, estimate models in STATA

- Key characteristic of STATA
- All data must be loaded into RAM
- Computations are very fast
- But, size of the project is limited by available memory

- Results can be generated two different ways
- Command line
- Write a program, (*.do) then submit from the command line

- cps87_or.do
- Program gets you to the point where can
- Load data into memory
- Construct new variables
- Get simple statistics
- Run a basic regression
- Store the results on a disk

- Random sample of data from 1987 Current Population Survey outgoing rotation group
- Sample selection
- Males
- 21-64
- Working 30+hours/week

- 19,906 observations

- Hardest thing to learn/do: get data from some other source and get it into STATA data set
- We skip over that part
- All the data sets are loaded into a STATA data file that can be called by saying:
use data file name

- * this line defines the semicolon as the ;
- * end of line delimiter;
- # delimit ;
- * set memork for 10 meg;
- set memory 10m;
- * write results to a log file;
- * the replace options writes over old;
- * log files;
- log using cps87_or.log,replace;
- * open stata data set;
- use c:\bill\stata\cps87_or;
- * list variables and labels in data set;
- desc;

- ------------------------------------------------------------------------------
- > -
- storage display value
- variable name type format label variable label
- ------------------------------------------------------------------------------
- > -
- age float %9.0g age in years
- race float %9.0g 1=white, non-hisp, 2=place,
- n.h, 3=hisp
- educ float %9.0g years of education
- unionm float %9.0g 1=union member, 2=otherwise
- smsa float %9.0g 1=live in 19 largest smsa,
- 2=other smsa, 3=non smsa
- region float %9.0g 1=east, 2=midwest, 3=south,
- 4=west
- earnwke float %9.0g usual weekly earnings
- ------------------------------------------------------------------------------

- Use ‘gen’ command for generate new variables
- Syntax
- gen new variable name=math statement

- Easily construct new variables via
- Algebraic operations
- Math/trig functions (ln, exp, etc.)
- Logical operators (when true, =1, when false, =0)

- * generate new variables;
- * lines 1-2 illustrate basic math functoins;
- * lines 3-4 line illustrate logical operators;
- * line 5 illustrate the OR statement;
- * line 6 illustrates the AND statement;
- * after you construct new variables, compress the data again;
- gen age2=age*age;
- gen earnwkl=ln(earnwke);
- gen union=unionm==1;
- gen topcode=earnwke==999;
- gen nonwhite=((race==2)|(race==3));
- gen big_ne=((region==1)&(smsa==1));

- desc -- describes variables in the data set
- sum – gets summary statistics
- tab – produces frequencies (tables) of discrete variables

- * get descriptive statistics;
- sum;
- * get detailed descriptics for continuous variables;
- sum earnwke, detail;
- * get frequencies of discrete variables;
- tabulate unionm;
- tabulate race;
- * get two-way table of frequencies;
- tabulate region smsa, row column cell;

- Variable | Obs Mean Std. Dev. Min Max
- -------------+--------------------------------------------------------
- age | 19906 37.96619 11.15348 21 64
- race | 19906 1.199136 .525493 1 3
- educ | 19906 13.16126 2.795234 0 18
- unionm | 19906 1.769065 .4214418 1 2
- smsa | 19906 1.908369 .7955814 1 3
- -------------+--------------------------------------------------------

- usual weekly earnings
- -------------------------------------------------------------
- Percentiles Smallest
- 1% 128 60
- 5% 178 60
- 10% 210 60 Obs 19906
- 25% 300 63 Sum of Wgt. 19906
- 50% 449 Mean 488.264
- Largest Std. Dev. 236.4713
- 75% 615 999
- 90% 865 999 Variance 55918.7
- 95% 999 999 Skewness .668646
- 99% 999 999 Kurtosis 2.632356

- 1=union |
- member, |
- 2=otherwise | Freq. Percent Cum.
- ------------+-----------------------------------
- 1 | 4,597 23.09 23.09
- 2 | 15,309 76.91 100.00
- ------------+-----------------------------------
- Total | 19,906 100.00

- 1=east, |
- 2=midwest, | 1=live in 19 largest smsa,
- 3=south, | 2=other smsa, 3=non smsa
- 4=west | 1 2 3 | Total
- -----------+---------------------------------+----------
- 1 | 2,806 1,349 842 | 4,997
- | 56.15 27.00 16.85 | 100.00
- | 38.46 18.89 15.39 | 25.10
- | 14.10 6.78 4.23 | 25.10
- -----------+---------------------------------+----------
- 2 | 1,501 1,742 1,592 | 4,835
- | 31.04 36.03 32.93 | 100.00
- | 20.58 24.40 29.10 | 24.29
- | 7.54 8.75 8.00 | 24.29
- -----------+---------------------------------+----------
- 3 | 1,501 2,542 1,904 | 5,947
- | 25.24 42.74 32.02 | 100.00
- | 20.58 35.60 34.80 | 29.88
- | 7.54 12.77 9.56 | 29.88
- -----------+---------------------------------+----------
- 4 | 1,487 1,507 1,133 | 4,127
- | 36.03 36.52 27.45 | 100.00
- | 20.38 21.11 20.71 | 20.73
- | 7.47 7.57 5.69 | 20.73
- -----------+---------------------------------+----------
- Total | 7,295 7,140 5,471 | 19,906
- | 36.65 35.87 27.48 | 100.00
- | 100.00 100.00 100.00 | 100.00
- | 36.65 35.87 27.48 | 100.00

- Syntax
reg dependent-variable independent-variables

- Example from program
*run simple regression;

reg earnwkl age age2 educ nonwhite union;

- Source | SS df MS Number of obs = 19906
- -------------+------------------------------ F( 5, 19900) = 1775.70
- Model | 1616.39963 5 323.279927 Prob > F = 0.0000
- Residual | 3622.93905 19900 .182057239 R-squared = 0.3085
- -------------+------------------------------ Adj R-squared = 0.3083
- Total | 5239.33869 19905 .263217216 Root MSE = .42668
- ------------------------------------------------------------------------------
- earnwkl | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- age | .0679808 .0020033 33.93 0.000 .0640542 .0719075
- age2 | -.0006778 .0000245 -27.69 0.000 -.0007258 -.0006299
- educ | .069219 .0011256 61.50 0.000 .0670127 .0714252
- nonwhite | -.1716133 .0089118 -19.26 0.000 -.1890812 -.1541453
- union | .1301547 .0072923 17.85 0.000 .1158613 .1444481
- _cons | 3.630805 .0394126 92.12 0.000 3.553553 3.708057
- ------------------------------------------------------------------------------

- R2 = .3085
- Variables explain 31% of the variation in log weekly earnings

- F(5,19900)
- Tests the hypothesis that all covariates (except constant) are jointly zero

- Y = β0 + β1Xi + εi
- dY/dX = β1
- But in this case Y=ln(W) where W weekly wages
- dln(W)/dX = (dW/W)/dX = β1
- Percentage change in wages given a change in x

- For each additional year of education, wages increase by 6.9%
- Non whites earn 17.2% less than whites
- Union members earn 13% more than nonunion members

Part III

Some notes about probability distributions

- Random variables with infinite number of possible values
- Examples -- units of measure (time, weight, distance)
- Many discrete outcomes can be treated as continuous, e.g., SAT scores

- The Probability Density Function (PDF)
- The PDF for a random variable x is defined as f(x), where
f(x) $ 0

If(x)dx = 1

- Calculus review: The integral of a function gives the “area under the curve”

- Suppose x is a “measure” like distance or time
- 0 # x # 4
- We may be interested in the Pr(x#a) ?

What if we consider all values?

- Note that Pr(x # b) + Pr(x>b) =1
- Pr(x>b) = 1 – Pr(x # b)
- Many times, it is easier to work with compliments

- The PDF is described by lower case such as f(x)
- The CDF is defined as upper case such as F(a)

- Most frequently used continuous distribution
- Symmetric “bell-shaped” distribution
- As we will show, the normal has useful properties
- Many variables we observe in the real world look normally distributed.
- Can translate normal into ‘standard normal’

- IQ scores
- SAT scores
- Heights of females
- Log income
- Average gestation (weeks of pregnancy)
- As we will show in a few weeks – sample means are normally distributed!!!

- PDF:
- For -# z #

- (z) is the standard normal PDF evaluated at z
- [a] = Pr(z a)

- Notice that:
- Normal is symmetric: (a) = (-a)
- Normal is “unimodal”
- Median=mean
- Area under curve=1
- Almost all area is between (-3,3)

- Evaluations of the CDF are done with
- Statistical functions (excel, SAS, etc)
- Tables

- Pr(z -0.98) = [-0.98] = 0.1635

- Pr(z 1.41) = [1.41] = 0.9207

- Pr(x>1.17) = 1 – Pr(z 1.17) = 1- [1.17]
- = 1 – 0.8790 = 0.1210

- Pr(0.1 z 1.9)
= Pr(z 1.9) – Pr(z 0.1)

= M(1.9) - M(0.1) = 0.9713 - 0.5398

= 0.4315

- Pr(z A) = [A]
- Pr(z > A) = 1 - [A]
- Pr(z - A) = [-A]
- Pr(z > -A) = 1 - [-A] = [A]

Section IV

Maximum likelihood estimation

- Observe n independent outcomes, all drawn from the same distribution
- (y1, y2, y3….yn)
- yi is drawn from f(yi; θ) where θ is an unknown parameter for the PDF f
- Recall definition of indepedence. If a and b and independent, Prob(a and b) = Pr(a)Pr(B)

- Because all the draws are independent, the probability these particular n values of Y would be drawn at random is called the ‘likelihood function’ and it equals
- L = Pr(y1)Pr(y2)…Pr(yn)
- L = f(y1; θ)f(y2; θ)…..f(y3; θ)

- MLE: pick a value for θ that best represents the chance these n values of y would have been generated randomly
- To maximize L, maximize a monotonic function of L
- Recall ln(abcd)=ln(a)+ln(b)+ln(c)+ln(d)

- Max L = ln(L) = ln[f(y1; θ)] +ln[f(y2; θ)] +
….. ln[f(yn; θ) = Σi ln[f(yi; θ)]

- Pick θ so that L is maximized
- dL/dθ = 0

L

θ

θ1

θ2

- Suppose y measures ‘counts’ such as doctor visits.
- yi is drawn from a Poisson distribution
- f(yi;λ) =e-λλyi/yi! For λ>0
- E[yi]= Var[yi] = λ

- Given n observations, (y1, y2, y3….yn)
- Pick value of λ that maximizes L
- Max L =Σi ln[f(yi; θ)] = Σi ln[e-λλyi/yi!]
= Σi [– λ + yiln(λ) – ln(yi!)]

= -n λ + ln(λ) Σi yi – Σi ln(yi!)

- L = -n λ + ln(λ) Σi yi – Σi ln(yi!)
- dL/dθ = -n + (1/ λ )Σi yi = 0
- Solve for λ
- λ = Σi yi /n = = sample mean of y

- In most cases however, cannot find a ‘closed form’ solution for the parameter in ln[f(yi; θ)]
- Must ‘search’ over all possible solutions
- How does the search work?
- Start with candidate value of θ.
- Calculate dL/dθ

- If dL/dθ > 0, increasing θ will increase L so we increase θ some
- If dL/dθ < 0, decreasing θ will increase L so we decrease θ some
- Keep changing θ until dL/dθ = 0
- How far you ‘step’ when you change θ is determined by a number of different factors

L

dL/dθ > 0

θ1

θ

L

dL/dθ < 0

θ3

θ

- Sometimes call efficient estimation. Can never generate a smaller variance than one obtained by MLE
- Parameters estimates are distributed as a normal distribution when samples sizes are large
- Therefore, if we divide the parameter by its standard error, should be normally distributed with a mean zero and variance 1 if the null (=0) is correct

Section 5

Dichotomous outcomes

- Suppose data is discrete but there are only 2 outcomes
- Examples
- Graduate high school or not
- Patient dies or not
- Working or not
- Smoker or not

- In data, yi=1 if yes, yi =0 if no

- There are only two outcomes
- Research question: What factors impact whether the event occurs?
- To answer, will model the probability the outcome occurs
- Pr(Yi=1) when yi=1 or
- Pr(Yi=0) = 1- Pr(Yi=1) when yi=0

- Think of the problem from a MLE perspective
- Likelihood for i’th observation
- Li= Pr(Yi=1)Yi [1 - Pr(Yi=1)](1-Yi)
- When yi=1, only relevant part is Pr(Yi=1)
- When yi=0, only relevant part is [1 - Pr(Yi=1)]

- L = Σi ln[Li] =
= Σi {yi ln[Pr(yi=1)] + (1-yi)ln[Pr(yi=0)] }

- Notice that up to this point, the model is generic. The log likelihood function will determined by the assumptions concerning how we determine Pr(yi=1)

- There is some process (biological, social, decision theoretic, etc) that determines the outcome y
- Some of the variables impacting are observed, some are not
- Requires that we model how these factors impact the probabilities
- Model from a ‘latent variable’ perspective

- Consider a women’s decision to work
- yi* = the person’s net benefit to work
- Two components of yi*
- Characteristics that we can measure
- Education, age, income of spouse, prices of child care

- Some we cannot measure
- How much you like spending time with your kids
- how much you like/hate your job

- Characteristics that we can measure

- We aggregate these two components into one equation
- yi* = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi
= xi β + εi

- xi β (measurable characteristics but with uncertain weights)
- εi random unmeasured characteristics

(if net benefits are positive)

yi=1 if yi*>0

yi=0 if yi*≤0

- yi=1 if yi*>0
- yi* = xi β + εi > 0 only if
- εi > - xi β

- yi* = xi β + εi ≤ 0 only if
- εi ≤ - xi β

- When we look at certain people, we have expectations about whether y should equal 1 or 0
- These expectations do not always hold true
- The error ε represents deviations from what we expect
- Go back to the work example, suppose xi β is ‘big.’ We observe a woman with:
- High wages
- Low husband’s income
- Low cost of child care

- We would expect this person to work, UNLESS, there is some unmeasured variable that counteracts this
- For example:
- Suppose a mom really likes spending time with her kids, or she hates her job.
- The unmeasured benefit of working is then a big negative coefficient εi

- If we observe them working, there are a certain range of values that εi must have been in excess of
- yi=1 if εi > - xi β

- yi=0 if εi ≤ - xi β

- The estimation procedure used is determined by the assumed distribution of ε
- What is the probability we observe someone with y=1?
- Use definition of the CDF
- Pr(yi=1) = Pr(yi*>0) = Pr(εi > - xi β)
= 1 – F(-xi β)

- What is the probability we observe someone with y=0?
- Use definition of the CDF
- Pr(yi=0) = Pr(yi*≤ 0) = Pr(εi ≤ - xi β) = F(-xi β)

- Two standard models: ε is either
- normal or
- logistic

- ε is distributed as a standard normal
- Mean zero
- Variance 1

- Evaluate probability (y=1)
- Pr(yi=1) = Pr(εi > - xi β) = 1 – Ф(-xi β)
- Given symmetry: 1 – Ф(-xi β) = Ф(xi β)

- Evaluate probability (y=0)
- Pr(yi=0) = Pr(εi ≤ - xi β) = Ф(-xi β)
- Given symmetry: Ф(-xi β) = 1 - Ф(xi β)

- Summary
- Pr(yi=1) = Ф(xi β)
- Pr(yi=0) = 1 -Ф(xi β)

- Notice that Ф(a) is increasing a. Therefore, is one of the x’s increases the probability of observing y, we would expect the coefficient on that variable to be (+)

- The standard normal assumption (variance=1) is not critical
- In practice, the variance may be not equal t 1, but given the math of the problem, we cannot identify the variance. It is absorbed into parameter estimates

- CDF: F(a) = exp(a)/(1+exp(a))
- Symmetric, unimodal distribution
- Looks a lot like the normal
- Incredibly easy to evaluate the CDF and PDF
- Mean of zero, variance > 1 (more variance than normal)

- Evaluate probability (y=1)
- Pr(yi=1) = Pr(εi > - xi β) = 1 – F(-xi β)
- Given symmetry: 1 – F(-xi β) = F(xi β)
- F(xi β) = exp(xi β)/(1+exp(xi β))

- Evaluate probability (y=0)
- Pr(yi=0) = Pr(εi ≤ - xi β) = F(-xi β)
- Given symmetry: F(-xi β) = 1 - F(xi β)
- 1 - F(xi β) = 1 /(1+exp(xi β))

- When εi is a logistic distribution
- Pr(yi =1) = exp(xi β)/(1+exp(xi β))
- Pr(yi=0) = 1/(1+exp(xi β))

- Smoking supplements to 1991 and 1993 National Health Interview Survey
- Asked all respondents whether they currently smoke
- Asked workers about workplace tobacco policies
- Sample: workers
- Key variables: current smoking and whether they faced by workplace ban

- Data: workplace1.dta
- Sample program: workplace1.doc
- Results: workplace1.log

- . desc;
- storage display value
- variable name type format label variable label
- ------------------------------------------------------------------------
- > -
- smoker byte %9.0g is current smoking
- worka byte %9.0g has workplace smoking bans
- age byte %9.0g age in years
- male byte %9.0g male
- black byte %9.0g black
- hispanic byte %9.0g hispanic
- incomel float %9.0g log income
- hsgrad byte %9.0g is hs graduate
- somecol byte %9.0g has some college
- college float %9.0g
- -----------------------------------------------------------------------

- sum;
- Variable | Obs Mean Std. Dev. Min Max
- -------------+--------------------------------------------------------
- smoker | 16258 .25163 .433963 0 1
- worka | 16258 .6851396 .4644745 0 1
- age | 16258 38.54742 11.96189 18 87
- male | 16258 .3947595 .488814 0 1
- black | 16258 .1119449 .3153083 0 1
- -------------+--------------------------------------------------------
- hispanic | 16258 .0607086 .2388023 0 1
- incomel | 16258 10.42097 .7624525 6.214608 11.22524
- hsgrad | 16258 .3355271 .4721889 0 1
- somecol | 16258 .2685447 .4432161 0 1
- college | 16258 .3293763 .4700012 0 1

- probit smoker age incomel male black hispanic hsgrad somecol college worka;
- The first variable after ‘probit’ is the discrete outcome, the rest of the variables are the independent variables
- Includes a constant as a default

- logit smoker age incomel male black hispanic hsgrad somecol college worka;
- Same as probit, just change the first word

- reg smoker age incomel male black hispanic hsgrad somecol college worka, robust;
- Simple regression.
- Standard errors are incorrect (heteroskedasticity)
- robust option produces standard errors with arbitrary form of heteroskedasticity

- Probit estimates Number of obs = 16258
- LR chi2(9) = 819.44
- Prob > chi2 = 0.0000
- Log likelihood = -8761.7208 Pseudo R2 = 0.0447
- ------------------------------------------------------------------------------
- smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- age | -.0012684 .0009316 -1.36 0.173 -.0030943 .0005574
- incomel | -.092812 .0151496 -6.13 0.000 -.1225047 -.0631193
- male | .0533213 .0229297 2.33 0.020 .0083799 .0982627
- black | -.1060518 .034918 -3.04 0.002 -.17449 -.0376137
- hispanic | -.2281468 .0475128 -4.80 0.000 -.3212701 -.1350235
- hsgrad | -.1748765 .0436392 -4.01 0.000 -.2604078 -.0893453
- somecol | -.363869 .0451757 -8.05 0.000 -.4524118 -.2753262
- college | -.7689528 .0466418 -16.49 0.000 -.860369 -.6775366
- worka | -.2093287 .0231425 -9.05 0.000 -.2546873 -.1639702
- _cons | .870543 .154056 5.65 0.000 .5685989 1.172487
- ------------------------------------------------------------------------------

- Regression (OLS)
- minimize sum of squared errors
- Or, maximize R2
- The model is designed to maximize predictive capacity

- Not the case with Probit/Logit
- MLE models pick distribution parameters so as best describe the data generating process
- May or may not ‘predict’ the outcome well

- LLk log likelihood with all variables
- LL1 log likelihood with only a constant
- 0 > LLk > LL1 so | LLk | < |LL1|
- Pseudo R2 = 1 - |LL1/LLk|
- Bounded between 0-1
- Not anything like an R2 from a regression

- Let b be the estimated value of β
- For any candidate vector of xi , we can predict probabilities, Pi
- Pi = Ф(xib)
- Once you have Pi, pick a threshold value, T, so that you predict
- Yp = 1 if Pi > T
- Yp = 0 if Pi ≤ T

- Question: what value to pick for T?
- Can pick .5
- Intuitive. More likely to engage in the activity than to not engage in it
- However, when the is small, this criteria does a poor job of predicting Yi=1
- However, when the is close to 1, this criteria does a poor job of picking Yi=0

- *predict probability of smoking;
- predict pred_prob_smoke;
- * get detailed descriptive data about predicted prob;
- sum pred_prob, detail;
- * predict binary outcome with 50% cutoff;
- gen pred_smoke1=pred_prob_smoke>=.5;
- label variable pred_smoke1 "predicted smoking, 50% cutoff";
- * compare actual values;
- tab smoker pred_smoke1, row col cell;

- . sum pred_prob, detail;
- Pr(smoker)
- -------------------------------------------------------------
- Percentiles Smallest
- 1% .0959301 .0615221
- 5% .1155022 .0622963
- 10% .1237434 .0633929 Obs 16258
- 25% .1620851 .0733495 Sum of Wgt. 16258
- 50% .2569962 Mean .2516653
- Largest Std. Dev. .0960007
- 75% .3187975 .5619798
- 90% .3795704 .5655878 Variance .0092161
- 95% .4039573 .5684112 Skewness .1520254
- 99% .4672697 .6203823 Kurtosis 2.149247

- Notice two things
- Sample mean of the predicted probabilities is close to the sample mean outcome
- 99% of the probabilities are less than .5
- Should predict few smokers if use a 50% cutoff

- | predicted smoking,
- is current | 50% cutoff
- smoking | 0 1 | Total
- -----------+----------------------+----------
- 0 | 12,153 14 | 12,167
- | 99.88 0.12 | 100.00
- | 74.93 35.90 | 74.84
- | 74.75 0.09 | 74.84
- -----------+----------------------+----------
- 1 | 4,066 25 | 4,091
- | 99.39 0.61 | 100.00
- | 25.07 64.10 | 25.16
- | 25.01 0.15 | 25.16
- -----------+----------------------+----------
- Total | 16,219 39 | 16,258
- | 99.76 0.24 | 100.00
- | 100.00 100.00 | 100.00
- | 99.76 0.24 | 100.00

- Check on-diagonal elements.
- The last number in each 2x2 element is the fraction in the cell
- The model correctly predicts 74.75 + 0.15 = 74.90% of the obs
- It only predicts a small fraction of smokers

- Do not be amazed by the 75% percent correct prediction
- If you said everyone has a chance of smoking (a case of no covariates), you would be correct Max[(,(1-)] percent of the time

- In this case, 25.16% smoke.
- If everyone had the same chance of smoking, we would assign everyone Pr(y=1) = .2516
- We would be correct for the 1 - .2516 = 0.7484 people who do not smoke

- MLE models are not designed to maximize prediction
- Should not be surprised they do not predict well
- In this case, not particularly good measures of predictive capacity

- Pr(yi=1) = Φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
- Suppose that x1i is a continuous variable
- d Pr(yi=1) /d x1i = ?
- What is the change in the probability of an event give a change in x1i?

- d Pr(yi=1) /d x1i
- = β1φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
- Notice two things. Marginal effect is a function of the other parameters and the values of x.

- Pr(yi=1) = Φ[β0 + x1iβ1+ x2iβ2+… xkiβk]
- Suppose that x2i is a dummy variable (1 if yes, 0 if no)
- Marginal effect makes no sense, cannot change x2i by a little amount. It is either 1 or 0.
- Redefine the variable of interest. Compare outcomes with and without x2i

- y1 = Pr(yi=1 | x2i=1)
= Φ[β0 + x1iβ1+ β2 + x3iβ3 +… ]

- y0 = Pr(yi=1 | x2i=0)
= Φ[β0 + x1iβ1+ x3iβ3 … ]

Marginal effect = y1 – y0.

Difference in probabilities with and without x2i?

- Marginal effects for continuous variables, and Change in probabilities for dichotomous outcomes, STATA picks sample means for X’s

- mfx compute;
- Must come after the outcome when estimates are still active in program.

- Marginal effects after probit
- y = Pr(smoker) (predict)
- = .24093439
- ------------------------------------------------------------------------------
- variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
- ---------+--------------------------------------------------------------------
- age | -.0003951 .00029 -1.36 0.173 -.000964 .000174 38.5474
- incomel | -.0289139 .00472 -6.13 0.000 -.03816 -.019668 10.421
- male*| .0166757 .0072 2.32 0.021 .002568 .030783 .39476
- black*| -.0320621 .01023 -3.13 0.002 -.052111 -.012013 .111945
- hispanic*| -.0658551 .01259 -5.23 0.000 -.090536 -.041174 .060709
- hsgrad*| -.053335 .01302 -4.10 0.000 -.07885 -.02782 .335527
- somecol*| -.1062358 .01228 -8.65 0.000 -.130308 -.082164 .268545
- college*| -.2149199 .01146 -18.76 0.000 -.237378 -.192462 .329376
- worka*| -.0668959 .00756 -8.84 0.000 -.08172 -.052072 .68514
- ------------------------------------------------------------------------------
- (*) dy/dx is for discrete change of dummy variable from 0 to 1

- 10% increase in income will reduce smoking by 2.9 percentage points
- 10 year increase in age will decrease smoking rates .4 percentage points
- Those with a college degree are 21.5 percentage points less likely to smoke
- Those that face a workplace smoking ban have 6.7 percentage point lower probability of smoking

- Do not confuse percentage point and percent differences
- A 6.7 percentage point drop is 29% of the sample mean of 24 percent.
- Blacks have smoking rates that are 3.2 percentage points lower than others, which is 13 percent of the sample mean

- Can generate marginal effects for a specific x
- prchange, x(age=40 black=0 hispanic=0 hsgrad=0 somecol=0 worka=0);
- If an x is not specified, STATA will use the sample mean (e.g., log income in this case)
- Make sure when you specify a particular dummy variable (=1) you set the rest to zero

- probit: Changes in Predicted Probabilities for smoker
- min->max 0->1 -+1/2 -+sd/2 MargEfct
- age -0.0323 -0.0005 -0.0005 -0.0056 -0.0005
- incomel -0.1795 -0.0320 -0.0344 -0.0263 -0.0345
- male 0.0198 0.0198 0.0198 0.0097 0.0198
- black -0.0385 -0.0385 -0.0394 -0.0124 -0.0394
- hispanic -0.0804 -0.0804 -0.0845 -0.0202 -0.0847
- hsgrad -0.0625 -0.0625 -0.0648 -0.0306 -0.0649
- somecol -0.1235 -0.1235 -0.1344 -0.0598 -0.1351
- college -0.2644 -0.2644 -0.2795 -0.1335 -0.2854
- worka -0.0742 -0.0742 -0.0776 -0.0361 -0.0777

- In large samples, MLE estimates are normally distributed
- Null hypothesis, βj=0
- If the null is true and the sample is larges, βj is distributed as a normal with variance σj2.
- Using notes from before, if we divide βj by the standard deviation, we get standard normal

- βj/se(βj) should be N(0,1)
- βj/se(βj) = z-score
- 95% of the distribution of a N(0,1) is between -1.96, 1.96
- Reject null of the z-score > |1.96|
- Only age is statistically insignificant (cannot reject null)

- Normal and logit CDF look:
- Similar in the mid point of the distribution
- Different in the tails

- You obtain more observations in the tails of the distribution when
- Samples sizes are large
- approaches 1 or 0

- These situations will produce more differences in estimates

- Outcome, y=1 or 0
- Treatment, x=1 or 0
- Other covariates, x
- Context,
- x = whether a baby is born with a low weight birth
- x = whether the mom smoked or not during pregnancy

- Risk ratio
RR = Prob(y=1|x=1)/Prob(y=1|x=0)

Differences in the probability of an event when x is and is not observed

How much does smoking elevate the chance your child will be a low weight birth

- Let Yyx be the probability y=1 or 0 given x=1 or 0
- Think of the risk ratio the following way
- Y11 is the probability Y=1 when X=1
- Y10 is the probability Y=1 when X=0

- Odds Ratio
OR=A/B = [Y11/Y01]/[Y10/Y00]

A = [Pr(Y=1|X=1)/Pr(Y=0|X=1)]

= odds of Y occurring if you are a smoker

B = [Pr(Y=1|X=0)/Pr(Y=0|X=0)]

= odds of y happening if you are not a smoker

What are the relative odds of Y happening if you do or do not experience X

- Suppose Pr(Yi =1) = F(βo+ β1Xi + β2Z) and F is the logistic function
- Can show that
- OR = exp(β1) = e β1
- This number is typically reported by most statistical packages

- Details
- Y11 = exp(βo+ β1 + β2Z) /(1+ exp(βo+ β1+ β2Z) )
- Y10 = exp(βo+ β2Z)/(1+ exp(βo+β2Z))
- Y01 = 1 /(1+ exp(βo+ β1 + β2Z) )
- Y00 = 1/(1+ exp(βo+β2Z)
- [Y11/Y01] = exp(βo+ β1 + β2Z)
- [Y10/Y00] = exp(βo+ β2Z)
- OR=A/B = [Y11/Y01]/[Y10/Y00]
= exp(βo+ β1 + β2Z)/ exp(βo + β2Z)

= exp(β1)

- Suppose Y is rare, close to 0
- Pr(Y=0|X=1) and Pr(Y=0|X=0) are both close to 1, so they cancel

- Therefore, when is close to 0
- Odds Ratio = Risk Ratio

- Why is this nice?

- Average outcome in the population
- = (1-) Y10 + Y11 = (1- )Y10 + (RR)Y10
- Average outcomes are a weighted average of outcomes for X=0 and X=1
- What would the average outcome be in the absence of X (e.g., reduce smoking rates to 0)
- Ya = Y10

- PAR
- Fraction of outcome attributed to X
- The difference between the current rate and the rate that would exist without X, divided by the current rate
- PAR = ( – Ya)/
= (RR – 1)/[(1-) + RR]

- 6% births are low weight
- < 2500 grams (
- Average birth is 3300 grams (5.5 lbs)

- Maternal smoking during pregnancy has been identified as a key cofactor
- 13% of mothers smoke
- This number was falling about 1 percentage point per year during 1980s/90s
- Doubles chance of low weight birth

- Census of all births (4 million/year)
- Annual files starting in the 60s
- Information about
- Baby (birth weight, length, date, sex, plurality, birth injuries)
- Demographics (age, race, marital, educ of mom)
- Birth (who delivered, method of delivery)
- Health of mom (smoke/drank during preg, weight gain)

- Smoking not available from CA or NY
- ~3 million usable observations
- I pulled .5% random sample from 1995
- About 12,500 obs
- Variables: birthweight (grams), smoked, married, 4-level race, 5 level education, mothers age at birth

- ------------------------------------------------------------------------------
- > -
- storage display value
- variable name type format label variable label
- ------------------------------------------------------------------------------
- > -
- birthw int %9.0g birth weight in grams
- smoked byte %9.0g =1 if mom smoked during
- pregnancy
- age byte %9.0g moms age at birth
- married byte %9.0g =1 if married
- race4 byte %9.0g 1=white,2=black,3=asian,4=other
- educ5 byte %9.0g 1=0-8, 2=9-11, 3=12, 4=13-15,
- 5=16+
- visits byte %9.0g prenatal visits
- ------------------------------------------------------------------------------

- dummy |
- variable, |
- =1 | =1 if mom smoked
- ifBW<2500 | during pregnancy
- grams | 0 1 | Total
- -----------+----------------------+----------
- 0 | 11,626 1,745 | 13,371
- | 86.95 13.05 | 100.00
- | 94.64 89.72 | 93.96
- | 81.70 12.26 | 93.96
- -----------+----------------------+----------
- 1 | 659 200 | 859
- | 76.72 23.28 | 100.00
- | 5.36 10.28 | 6.04
- | 4.63 1.41 | 6.04
- -----------+----------------------+----------
- Total | 12,285 1,945 | 14,230
- | 86.33 13.67 | 100.00
- | 100.00 100.00 | 100.00
- | 86.33 13.67 | 100.00

- Notice a few things
- 13.7% of women smoke
- 6% have low weight birth

- Pr(LBW | Smoke) =10.28%
- Pr(LBW |~ Smoke) = 5.36%
- RR
= Pr(LBW | Smoke)/ Pr(LBW |~ Smoke)

= 0.1028/0.0536 = 1.92

- Log likelihood = -3136.9912 Pseudo R2 = 0.0330
- ------------------------------------------------------------------------------
- lowbw | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- smoked | .6740651 .0897869 7.51 0.000 .4980861 .8500441
- age | .0080537 .006791 1.19 0.236 -.0052564 .0213638
- married | -.3954044 .0882471 -4.48 0.000 -.5683654 -.2224433
- _Ieduc5_2 | -.1949335 .1626502 -1.20 0.231 -.5137221 .1238551
- _Ieduc5_3 | -.1925099 .1543239 -1.25 0.212 -.4949791 .1099594
- _Ieduc5_4 | -.4057382 .1676759 -2.42 0.016 -.7343769 -.0770994
- _Ieduc5_5 | -.3569715 .1780322 -2.01 0.045 -.7059081 -.0080349
- _Irace4_2 | .7072894 .0875125 8.08 0.000 .5357681 .8788107
- _Irace4_3 | .386623 .307062 1.26 0.208 -.2152075 .9884535
- _Irace4_4 | .3095536 .2047899 1.51 0.131 -.0918271 .7109344
- _cons | -2.755971 .2104916 -13.09 0.000 -3.168527 -2.343415
- ------------------------------------------------------------------------------

- Smoked
- exp(0.674) = 1.96
- Smokers are twice as likely to have a low weight birth

- _Irace4_2 (Blacks)
- exp(0.707) = 2.02
- Blacks are twice as likely to have a low weight birth

- Logistic y x1 x2;
- In this case
- xi: logistic lowbw smoked age married i.educ5 i.race4;

- Log likelihood = -3136.9912 Pseudo R2 = 0.0330
- ------------------------------------------------------------------------------
- lowbw | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- smoked | 1.962198 .1761796 7.51 0.000 1.645569 2.33975
- age | 1.008086 .0068459 1.19 0.236 .9947574 1.021594
- married | .6734077 .0594262 -4.48 0.000 .5664506 .8005604
- _Ieduc5_2 | .8228894 .1338431 -1.20 0.231 .5982646 1.131852
- _Ieduc5_3 | .8248862 .1272996 -1.25 0.212 .6095837 1.116233
- _Ieduc5_4 | .6664847 .1117534 -2.42 0.016 .4798043 .9257979
- _Ieduc5_5 | .6997924 .1245856 -2.01 0.045 .4936601 .9919973
- _Irace4_2 | 2.028485 .1775178 8.08 0.000 1.70876 2.408034
- _Irace4_3 | 1.472001 .4519957 1.26 0.208 .8063741 2.687076
- _Irace4_4 | 1.362817 .2790911 1.51 0.131 .9122628 2.035893
- ------------------------------------------------------------------------------

- PAR = (RR – 1)/[(1-) + RR]
- = 0.137
- RR = 1.96
- PAR = 0.116
- 11.6% of low weight births attributed to maternal smoking

- MLE are asymptotically normally distributed, one of the properties of MLE
- Therefore, standard t-tests of hypothesis will work as long as samples are ‘large’
- What ‘large’ means is open to question
- What to do when samples are ‘small’ – table for a moment

- Suppose you have a probit model
- Φ[β0 + x1iβ1+ x2iβ2 + x3iβ3 +… ]

- N observations
- K parameters
- J restrictions (count the equals signs, j=4)

- Based on the fact that the parameters are distributed asymptotically normal
- Probability theory review
- Suppose you have m draws from a standard normal distribution (zi)
- M = z12 + z22 + …. zm2
- M is distributed as a Chi-square with m degrees of freedom

- Wald test constructs a ‘quadratic form’ suggested by the test you want to perform
- This combination, because it contains squares of the true parameters, should, if the hypothesis is true, be distributed as a Chi square with J degrees of freedom.
- If the test statistic is ‘large’, relative to the degrees of freedom of the test, we reject, because there is a low probability we would have drawn that value at random from the distribution

- All stats books will report the ‘percentiles’ of a chi-square
- Vertical axis (degrees of freedom)
- Horizontal axis (percentiles)
- Entry is the value where ‘percentile’ of the distribution falls below

- Example: Suppose 4 restrictions
- 95% of a chi-square distribution falls below 9.488.
- So there is only a 5% a number drawn at random will exceed 9.488
- If your test statistic is below, cannot reject null
- If your test statistics is above, reject null

- Default test in MLE models
- Easy to do. Look at program
- test hsgrad somecol college

- . test hsgrad somecol college;
- ( 1) hsgrad = 0
- ( 2) somecol = 0
- ( 3) college = 0
- chi2( 3) = 504.78
- Prob > chi2 = 0.0000

- Notice the higher value of the test statistic. There is a low chance that a variable, drawn at random from a ch-square with three degrees of freedom will be this large.
- Reject null

- * how to run the same tests with a -2 log like test;
- * estimate the unresticted model and save the estimates ;
- * in urmodel;
- probit smoker age incomel male black hispanic
- hsgrad somecol college worka;
- estimates store urmodel;
- * estimate the restricted model. save results in rmodel;
- probit smoker age incomel male black hispanic
- worka;
- estimates store rmodel;
- lrtest urmodel rmodel;

- I prefer -2 log likelihood test
- Estimates the restricted and unrestricted model
- Therefore, has more power than a Wald test

- In most cases, they give the same ‘decision’ (reject/not reject)

Section VI

Categorical Data

- Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation:
- Examples:
- Self reported health status
- (excellent, very good, good, fair, poor)

- Do you agree with the following statement
- Strongly agree, agree, disagree, strongly disagree

- Self reported health status

- Can use the same type of model as in the previous section to analyze these outcomes
- Another ‘latent variable’ model
- Key to the model: there is a monotonic ordering of the qualitative responses

- Excellent, very good, good, fair, poor
- Coded as 1, 2, 3, 4, 5 on National Health Interview Survey
- We will code as 5,4,3,2,1 (easier to think of this way)
- Asked on every major health survey
- Important predictor of health outcomes, e.g. mortality
- Key question: what predicts health status?

- Important to note – the numbers 1-5 mean nothing in terms of their value, just an ordering to show you the lowest to highest
- The example below is easily adapted to include categorical variables with any number of outcomes

- yi* = latent index of reported health
- The latent index measures your own scale of health. Once yi* crosses a certain value you report poor, then good, then very good, then excellent health

- yi = (1,2,3,4,5) for (fair, poor, VG, G, excel)
- Interval decision rule
- yi=1 if yi* ≤ u1
- yi=2 if u1 < yi* ≤ u2
- yi=3 if u2 < yi* ≤ u3
- yi=4 if u3 < yi* ≤ u4
- yi=5 if yi* > u4

- As with logit and probit models, we will assume yi* is a function of observed and unobserved variables
- yi* = β0 + x1i β1 + x2i β2 …. xki βk + εi
- yi* = xi β + εi

- The threshold values (u1, u2, u3, u4) are unknown. We do not know the value of the index necessary to push you from very good to excellent.
- In theory, the threshold values are different for everyone
- Computer will not only estimate the β’s, but also the thresholds – average across people

- As with probit and logit, the model will be determined by the assumed distribution of ε
- In practice, most people pick nornal, generating an ‘ordered probit’ (I have no idea why)
- We will generate the math for the probit version

- Lets do the outliers, Pr(yi=1) and Pr(yi=5) first
- Pr(yi=1)
- = Pr(yi* ≤ u1)
- = Pr(xi β +εi ≤ u1 )
- =Pr(εi ≤ u1 - xi β)
- = Φ[u1 - xi β] = 1- Φ[xi β – u1]

- Pr(yi=5)
- = Pr(yi* > u4)
- = Pr(xi β +εi > u4 )
- =Pr(εi > u4 - xi β)
- = 1 - Φ[u4 - xi β] = Φ[xi β – u4]

- Pr(yi=3) = Pr(u2 < yi* ≤ u3)
= Pr(yi* ≤ u3) – Pr(yi* ≤ u2)

= Pr(xi β +εi≤ u3) – Pr(xi β +εi≤ u2)

= Pr(εi≤ u3- xi β) - Pr(εi≤ u2 - xi β)

= Φ[u3- xi β] - Φ[u2 - xi β]

= 1 - Φ[xi β - u3] – 1 + Φ[xi β - u2]

= Φ[xi β - u2] - Φ[xi β - u3]

- Pr(yi=1) = 1- Φ[xi β – u1]
- Pr(yi=2) = Φ[xi β – u1] - Φ[xi β – u2]
- Pr(yi=3) = Φ[xi β – u2] - Φ[xi β – u3]
- Pr(yi=4) = Φ[xi β – u3] - Φ[xi β – u4]
- Pr(yi=5) = Φ[xi β – u4]

- There are 5 possible choices for each person
- Only 1 is observed
- L = Σi ln[Pr(yi=k)] for k

- Cancer control supplement to 1994 National Health Interview Survey
- Question: what observed characteristics predict self reported health (1-5 scale)
- 1=poor, 5=excellent
- Key covariates: income, education, age, current and former smoking status
- Programs
- sr_health_status.do, .dta, .log

- desc;
- male byte %9.0g =1 if male
- age byte %9.0g age in years
- educ byte %9.0g years of education
- smoke byte %9.0g current smoker
- smoke5 byte %9.0g smoked in past 5 years
- black float %9.0g =1 if respondent is black
- othrace float %9.0g =1 if other race (white is ref)
- sr_health float %9.0g 1-5 self reported health,
- 5=excel, 1=poor
- famincl float %9.0g log family income

- tab sr_health;
- 1-5 self |
- reported |
- health, |
- 5=excel, |
- 1=poor | Freq. Percent Cum.
- ------------+-----------------------------------
- 1 | 342 2.65 2.65
- 2 | 991 7.68 10.33
- 3 | 3,068 23.78 34.12
- 4 | 3,855 29.88 64.00
- 5 | 4,644 36.00 100.00
- ------------+-----------------------------------
- Total | 12,900 100.00

- oprobit sr_health male age educ famincl black othrace smoke smoke5;

- Ordered probit estimates Number of obs = 12900
- LR chi2(8) = 2379.61
- Prob > chi2 = 0.0000
- Log likelihood = -16401.987 Pseudo R2 = 0.0676
- ------------------------------------------------------------------------------
- sr_health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- male | .1281241 .0195747 6.55 0.000 .0897583 .1664899
- age | -.0202308 .0008499 -23.80 0.000 -.0218966 -.018565
- educ | .0827086 .0038547 21.46 0.000 .0751535 .0902637
- famincl | .2398957 .0112206 21.38 0.000 .2179037 .2618878
- black | -.221508 .029528 -7.50 0.000 -.2793818 -.1636341
- othrace | -.2425083 .0480047 -5.05 0.000 -.3365958 -.1484208
- smoke | -.2086096 .0219779 -9.49 0.000 -.2516855 -.1655337
- smoke5 | -.1529619 .0357995 -4.27 0.000 -.2231277 -.0827961
- -------------+----------------------------------------------------------------
- _cut1 | .4858634 .113179 (Ancillary parameters)
- _cut2 | 1.269036 .11282
- _cut3 | 2.247251 .1138171
- _cut4 | 3.094606 .1145781
- ------------------------------------------------------------------------------

- Marginal effects/changes in probabilities are now a function of 2 things
- Point of expansion (x’s)
- Frame of reference for outcome (y)

- STATA
- Picks mean values for x’s
- You pick the value of y

- Consider y=5
- d Pr(yi=5)/dxi
= d Φ[xi β – u4]/dxi = βφ[xi β – u4]

- Consider y=3
- d Pr(yi=3)/dxi = βφ[xi β – u3] - βφ[xi β – u4]

- xiβ = β0 + x1i β1 + x2i β2 …. xki βk
- X2i is yes or no (1 or 0)

- ΔPr(yi=5) =
- Φ[β0 + x1i β1 + β2 + x3i β3 +.. xki βk]
- Φ[β0 + x1i β1 + x3i β3 …. xki βk]

- Change in the probabilities when x2i=1 and x2i=0

- mfx compute, predict(outcome(5));

- mfx compute, predict(outcome(5));
- Marginal effects after oprobit
- y = Pr(sr_health==5) (predict, outcome(5))
- = .34103717
- ------------------------------------------------------------------------------
- variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
- ---------+--------------------------------------------------------------------
- male*| .0471251 .00722 6.53 0.000 .03298 .06127 .438062
- age | -.0074214 .00031 -23.77 0.000 -.008033 -.00681 39.8412
- educ | .0303405 .00142 21.42 0.000 .027565 .033116 13.2402
- famincl | .0880025 .00412 21.37 0.000 .07993 .096075 10.2131
- black*| -.0781411 .00996 -7.84 0.000 -.097665 -.058617 .124264
- othrace*| -.0843227 .01567 -5.38 0.000 -.115043 -.053602 .04124
- smoke*| -.0749785 .00773 -9.71 0.000 -.09012 -.059837 .289147
- smoke5*| -.0545062 .01235 -4.41 0.000 -.078719 -.030294 .081395
- ------------------------------------------------------------------------------
- (*) dy/dx is for discrete change of dummy variable from 0 to 1

- Males are 4.7 percentage points more likely to report excellent
- Each year of age decreases chance of reporting excellent by 0.7 percentage points
- Current smokers are 7.5 percentage points less likely to report excellent health

- Wald tests/-2 log likelihood tests are done the exact same was as in PROBIT and LOGIT
- Tests of individual parameters are done the same way (z-score)

- Use PRCHANGE to calculate marginal effect for a specific person
prchange, x(age=40 black=0 othrace=0 smoke=0 smoke5=0 educ=16);

- When a variable is NOT specified (famincl), STATA takes the sample mean.

- PRCHANGE will produce results for all outcomes
- male
- Avg|Chg| 1 2 3 4
- 0->1 .0203868 -.0020257 -.00886671 -.02677558 -.01329902
- 5
- 0->1 .05096698

- age
- Avg|Chg| 1 2 3 4
- Min->Max .13358317 .0184785 .06797072 .17686112 .07064757
- -+1/2 .00321942 .00032518 .00141642 .00424452 .00206241
- -+sd/2 .03728014 .00382077 .01648743 .04910323 .0237889
- MargEfct .00321947 .00032515 .00141639 .00424462 .00206252

Section VII

Count Data Models

- Many outcomes of interest are integer counts
- Doctor visits
- Low work days
- Cigarettes smoked per day
- Missed school days

- OLS models can easily handle some integer models

- Example
- SAT scores are essentially integer values
- Few at ‘tails’
- Distribution is fairly continuous
- OLS models well

- In contrast, suppose
- High fraction of zeros
- Small positive values

- OLS models will
- Predict negative values
- Do a poor job of predicting the mass of observations at zero

- Example
- Dr visits in past year, Medicare patients(65+)
- 1987 National Medical Expenditure Survey
- Top code (for now) at 10
- 17% have no visits

- visits | Freq. Percent Cum.
- ------------+-----------------------------------
- 0 | 915 17.18 17.18
- 1 | 601 11.28 28.46
- 2 | 533 10.01 38.46
- 3 | 503 9.44 47.91
- 4 | 450 8.45 56.35
- 5 | 391 7.34 63.69
- 6 | 319 5.99 69.68
- 7 | 258 4.84 74.53
- 8 | 216 4.05 78.58
- 9 | 192 3.60 82.19
- 10 | 949 17.81 100.00
- ------------+-----------------------------------
- Total | 5,327 100.00

- yi is drawn from a Poisson distribution
- Poisson parameter varies across observations
- f(yi;λi) =e-λi λiyi/yi! For λi>0
- E[yi]= Var[yi] = λi = f(xi, β)

- λi must be positive at all times
- Therefore, we CANNOT let λi = xiβ
- Let λi = exp(xiβ)
- ln(λi) = (xiβ)

- d ln(λi)/dxi = β
- Remember that d ln(λi) = dλi/λi
- Interpret β as the percentage change in mean outcomes for a change in x

- Variance grows with the mean
- E[yi]= Var[yi] = λi = f(xi, β)

- Most data sets have over dispersion, where the variance grows faster than the mean
- In dr. visits sample, = 5.6, s=6.7
- Impose Mean=Var, severe restriction and you tend to reduce standard errors

- Where γi = exp(xiβ) and δ ≥ 0
- E[yi] = δγi = δexp(xiβ)
- Var[yi] = δ (1+δ) γi
- Var[yi]/ E[yi] = (1+δ)

- δ must always be ≥ 0
- In this case, the variance grows faster than the mean
- If δ=0, the model collapses into the Poisson
- Always estimate negative binomial
- If you cannot reject the null that δ=0, report the Poisson estimates

- Notice that ln(E[yi]) = ln(δ) + ln(γi), so
- d ln(E[yi]) /dxi = β
- Parameters have the same interpretation as in the Poisson model

- POISSON estimates a MLE model for poisson
- Syntax
POISSON y independent variables

- Syntax
- NBREG estimates MLE negative binomial
- Syntax
NBREG y independent variables

- Syntax

- Those with CHRONIC condition have 50% more mean MD visits
- Those in EXCELent health have 78% fewer MD visits
- BLACKS have 33% fewer visits than whites
- Income elasticity is 0.021, 10% increase in income generates a 2.1% increase in visits

- Interpret results the same was as Poisson
- Look at coefficient/standard error on delta
- Ho: delta = 0 (Poisson model is correct)
- In this case, delta = 5.21 standard error is 0.15, easily reject null.
- Var/Mean = 1+delta = 6.21, Poisson is mis-specificed, should see very small standard errors in the wrong model

Section VIII

Duration Data

- Sometimes we have data on length of time of a particular event or ‘spells’
- Time until death
- Time on unemployment
- Time to complete a PhD

- Techniques we will discuss were originally used to examine lifespan of objects like light bulbs or machines. These models are often referred to as “time to failure”

- T is a random variable that indicates duration (time til death, find a new job, etc)
- t is the realization of that variable
- f(t) is a PDF that describes the process that determines the time to failure
- CDF is F(t) represents the probability an event will happen by time t

- F(t) represents the probability that the event happens by ‘t’.
- What is the probability a person will die on or before the 65th birthday?

- Survivor function, what is the chance you live past (t)
- S(t) = 1 – F(t)
- If 10% of a cohort dies by their 65th birthday, 90% will die sometime after their 65th birthday

- Hazard function, h(t)
- What is the probability the spell will end at time t, given that it has already lasted t
- What is the chance you find a new job in month 12 given that you’ve been unemployed for 12 months already

- PDF, CDF (Failure function), survivor function and hazard function are all related
- λ(t) = f(t)/S(t) = f(t)/(1-F(t))
- We focus on the ‘hazard’ rate because its relationship to time indicates ‘duration dependence’

- Example: suppose the longer someone is out of work, the lower the chance they will exit unemployment – ‘damaged goods’
- This is an example of duration dependence, the probability of exiting a state of the world is a function of the length

- Mathematically
- d λ(t) /dt = 0 then there is no duration dep.
- d λ(t) /dt > 0 there is + duration dependence
the probability the spell will end

increases with time

- d λ(t) /dt < 0 there is – duration dependence
the probability the spell will end

decreases over time

- Your choice, is to pick values for f(t) that have +, - or no duration dependence

- Exponential
- λ(t)= λ
- Hazard is the same over time, a ‘memory less’ process

- Weibull
- F(t) = 1 – exp(-γtρ) where ρ,γ > 0
- λ(t) = ργρ-1
- if ρ>1, increasing hazard
- if ρ<1, decreasing hazard
- if ρ=1, exponential

- Others: Lognormal, log-logistic, Gompertz.
- Little more difficult – can examine when you get comfortable with Weibull

- Most data sets have ‘censored’ spells
- Follow people over time
- All will eventually die, but some do not in your period of analysis
- Incomplete spells or censored data

- Must build into the log likelihood function

- Let ti be the duration we observe for all people
- Some people die, and their they lived until period ti
- Others are observed for ti periods, but do not
- Let di=1 if data is complete spell
- di=1 if incomplete

- Recall, that f(s) is the PDF for someone who dies at period s
- F(t) is the probability you die by t
- 1-F(t) = the probability you die after (t)

- If di=1 then we observe f(ti), someone who died in period ti
- If di=0 then someone lived past period ti and the probability of that is [1-F(ti)]
- L = Σi {di ln[f(ti)] + (1-di)ln[1-F(ti)]}

- Look at exponential
- λ(t)= λ
- Allow this to vary across people
- λi(t)= λi
- But like Poisson, λi is always positive
- Let λi = exp(β0 + x1i β1 + x2i β2 …. xki βk)

- In the Weibull
- λ(t) = αγtα-1
- Allow it to vary across people
- λi(t) = αγi tα-1
- γi = exp(β0 + x1i β1 + x2i β2 …. xki βk)

- This is the same for both Weibull and Exponential
- In Weibull, λ(ti ) = αγitα-1
- Suppose x1i is a dummy variable
- When xi1=1, then
- γi1 = eβ0 + β1 + x2i β2 …. xki βk
- When xi1=0, then
- γi0 = eβ0 + x2i β2 …. xki βk

- When you construct the ratio of γi1/ γi0, all the others parameters cancel, so
- (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
- Percentage change in the hazard when x1i turns from 0 to 1.
- STATA prints out eβ1, just subtract 1

- Suppose we increase x2i by 1 unit
- γi1 = exp(β0 + β1x1i + x2iβ2 …. xkiβk)
- γi2 = exp(β0 + β1 (x1i+1) + x2iβ2 …. xkiβk)
- Can show that
- (αγi1 tα-1 – αγi0 tα-1 ) / αγi0 tα-1 = eβ1 -1
- = exp(β2) – 1
- Percentage change in the hazard for 1 unit increase in x

- NHIS
- annual survey of 60K households
- Data on individuals
- Self-reported healthm DR visits, lost workdays, etc.

- MCOD
- Linked NHIS respondents from 1986-1994 to National Death Index through Dec 31, 1995
- Identified whether respondent died and of what cause

- Our sample
- Males, 50-70, who were married at the time of the survey
- 1987-1989 surveys
- Give everyone 5 years (60 months) of followup

- max_mths maximum months in the survey.
- Diedin5 respondent died during the 5 years of followup
- Note if diedn5=0, the max_mths=60. Diedin5 identifies whether the data is censored or not.

- Variable | Obs Mean Std. Dev. Min Max
- -------------+--------------------------------------------------------
- age_s_yrs | 26654 59.42586 5.962435 50 70
- max_mths | 26654 56.49077 11.15384 0 60
- black | 26654 .0928566 .2902368 0 1
- hispanic | 26654 .0454716 .20834 0 1
- income | 26654 3.592181 1.327325 1 5
- -------------+--------------------------------------------------------
- educ | 26654 2.766677 .961846 1 4
- diedin5 | 26654 .1226082 .3279931 0 1

- Need to identify which is the duration data
stset length, failure(failvar)

- Length=duration variable
- Failvar=1 when durations end in failure, =0 for censored values

- In our case
- stset max_mths, failure(diedin5)

- Tabular presentation of results
sts list

- Graphical presentation
sts graph

- Results by subgroup
sts graph, by(educ)

- Basic syntax
- streg covariates,d(distribution)
- streg age_s_yrs black hispanic _Ie* _Ii*, d(weibull);
- In this model, STATA will print out exp(β)
- If you want the coefficients, add ‘nohr’ option (no hazard ratio)

- No. of subjects = 26631 Number of obs = 26631
- No. of failures = 3245
- Time at risk = 1505705
- LR chi2(10) = 595.74
- Log likelihood = -12425.055 Prob > chi2 = 0.0000
- ------------------------------------------------------------------------------
- _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- age_s_yrs | .0452588 .0031592 14.33 0.000 .0390669 .0514508
- black | .4770152 .0511122 9.33 0.000 .3768371 .5771932
- hispanic | .1333552 .082156 1.62 0.105 -.0276676 .294378
- _Ieduc_2 | .0093353 .0591918 0.16 0.875 -.1066786 .1253492
- _Ieduc_3 | -.072163 .0503131 -1.43 0.151 -.1707748 .0264488
- _Ieduc_4 | -.1301173 .0657131 -1.98 0.048 -.2589126 -.0013221
- _Iincome_2 | -.1867752 .0650604 -2.87 0.004 -.3142914 -.0592591
- _Iincome_3 | -.3268927 .0688635 -4.75 0.000 -.4618627 -.1919227
- _Iincome_4 | -.5166137 .0769202 -6.72 0.000 -.6673747 -.3658528
- _Iincome_5 | -.5425447 .0722025 -7.51 0.000 -.684059 -.4010303
- _cons | -9.201724 .2266475 -40.60 0.000 -9.645945 -8.757503
- -------------+----------------------------------------------------------------
- /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901
- -------------+----------------------------------------------------------------
- p | 1.171789 .020183 1.132891 1.212022
- 1/p | .8533961 .014699 .8250675 .8826974
- ------------------------------------------------------------------------------

- The sign of the parameters is informative
- Hazard increasing in age
- Blacks, hispanics have higher mortality rates
- Hazard decreases with income and age

- The parameter ρ= 1.17.
- Check 95% confidence interval (1.13, 1.21). Can reject null p=1 (exponential)
- Hazard is increasing over time

- No. of subjects = 26631 Number of obs = 26631
- No. of failures = 3245
- Time at risk = 1505705
- LR chi2(10) = 595.74
- Log likelihood = -12425.055 Prob > chi2 = 0.0000
- ------------------------------------------------------------------------------
- _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- age_s_yrs | 1.046299 .0033055 14.33 0.000 1.03984 1.052797
- black | 1.611258 .082355 9.33 0.000 1.457667 1.781032
- hispanic | 1.142656 .093876 1.62 0.105 .9727116 1.342291
- _Ieduc_2 | 1.009379 .059747 0.16 0.875 .8988145 1.133544
- _Ieduc_3 | .9303792 .0468103 -1.43 0.151 .8430114 1.026802
- _Ieduc_4 | .8779924 .0576956 -1.98 0.048 .7718905 .9986788
- _Iincome_2 | .8296302 .0539761 -2.87 0.004 .7303062 .9424625
- _Iincome_3 | .7211611 .0496617 -4.75 0.000 .6301089 .8253706
- _Iincome_4 | .5965372 .0458858 -6.72 0.000 .5130537 .6936049
- _Iincome_5 | .5812672 .041969 -7.51 0.000 .5045648 .6696297
- -------------+----------------------------------------------------------------
- /ln_p | .1585315 .0172241 9.20 0.000 .1247729 .1922901
- -------------+----------------------------------------------------------------
- p | 1.171789 .020183 1.132891 1.212022
- 1/p | .8533961 .014699 .8250675 .8826974
- ------------------------------------------------------------------------------

- Age: every year hazard increases by 4.6%
- Black: have 61% greater hazard than whites
- Hispanics: 14% greater hazard than non-hispanics
- Educ 2, 3, 4 are some 9-11, 12-15 and 16+ years of school

- Educ 3: those with 12-15 years of educ have .93 – 1 = -0.07 or a 7% lower hazard than those with <9 years of school
- Educ 4: those with a college degree have 0.88 – 1 = -0.12 or a 12% lower hazard than those with <9 years of school

- Income 2-5 are dummies for people with $10-$20K, $20-$30K, $30-$40K, >$40K
- Income 2: Those with $10-$20K have 0.83 – 1 = -0.17 or a 17% lower hazard than those with income <$10K
- Income 5, those with >$40K in income have a 0.58 – 1 = -0.42 or a 42% lower hazard than those with income <$10K

- Time varying covariates
- Competing risk models
- Die from multiple causes

- Cox proportional hazard model
- Heterogeneity in baseline hazard