- 79 Views
- Uploaded on
- Presentation posted in: General

Matt Dull ( [email protected] ) Center for Public Administration & Policy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

“Missing” Something? Using the Heckman Selection Model in Policy Research GWIPP Policy Research Methods WorkshopMarch 3, 2010

Matt Dull

Center for Public Administration & Policy

- Selection bias is a pervasive problem in policy research.
- This presentation offers a nontechnical introduction to models designed to correct sample selection bias in a regression context.
- I’ll describe how variations on Heckman’s (1976) classic model, designed to correct for bias due to missingness in a regression model dependent variable, produces unbiased parameter estimates and yields potentially rich opportunities for (cautious) inference.

- I’ll show how the full maximum likelihood Heckman model is implemented in Stata.
- I’ll describe two applications from my own research, where theory predicts missingness in the dependent variable and variations on the Heckman model yield substantively interesting results.
- The first application comes from analysis of survey data with a large number of “I don’t know” or “No-basis to judge” responses;
- The second looks at the allocation of resources through a federal competitive grant program.

- Anyone who performs statistical analysis eventually encounters problems of missing data.
- In Stata and other statistical packages the default strategy for dealing with missing observations is listwise deletion. Cases with missing values are dropped from the analysis.
- There are advantages to this strategy. It is simple, can be applied to any kind of statistical analysis, and under a range of circumstances yields unbiased estimates (Allison 2001).

- There are also some clear disadvantages to listwise deletion.
- Listwise deletion wastes information, often resulting in the loss of a substantial number of observations.
- If missingness in the dependent variable does not meet fairly strict assumptions for randomness, listwise deletion yields biased parameter estimates.
- The assumption that data on the dependent variable are “missing at random” is defined in quite precise terms in Allison (2001) and Rubin (1976). For today’s purposes it is enough to say that if missingness on Y is related the value of Y controlling for other variables in the model.

- Contemporary survey researchers frame the decision to register “no basis to judge” (NB) or other non-response variants such as “don’t know” or “no opinion” as a function of three factors: cognitive availability or whether a clear answer can be easily retrieved; a judgment about the adequacy of an answer given expectations; and communicative intent or motivation (Beatty and Herrmann 2002).
- NB respondents may feel uncertain or believe they lack necessary information, and in this sense the NB category enhances the validity of the measure.

- Or, an NB response may instead indicate ambivalence; the respondent may feel less uncertain than conflicted about the prospects and usefulness of reform. NB respondents may also wish to avoid sending an undesirable or unflattering signal.
- Or, they may engage in “survey satisficing,” responding NB to avoid the effort of synthesizing opinions for which they have all the necessary ingredients (Krosnick 2002; Krosnick et al. 2002).

tab gpra_answer

gpra_answer | Freq. Percent Cum.

------------+-----------------------------------

0 | 1,064 42.44 42.44

1 | 1,443 57.56 100.00

------------+-----------------------------------

Total | 2,507 100.00

probitgpra_answer leadership conflict_index

Iteration 0: log likelihood = -1614.6834

Iteration 1: log likelihood = -1597.7919

Iteration 2: log likelihood = -1597.7897

Probit regression Number of obs = 2387

LR chi2(2) = 33.79

Prob > chi2 = 0.0000

Log likelihood = -1597.7897 Pseudo R2 = 0.0105

------------------------------------------------------------------------------

gpra_answer | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

leadership | .1476403 .0254997 5.79 0.000 .0976619 .1976187

conflict_i~x | .0512077 .0220635 2.32 0.020 .0079641 .0944513

_cons | -.4254927 .1288101 -3.30 0.001 -.6779558 -.1730296

------------------------------------------------------------------------------

The Heckman technique estimates a two stage model:

First, a selection equation with a dichotomous dependent variable equaling 1 for observed and 0 for missing values

Second, an outcome equation predicting the model’s dependent variable. The second stage includes an additional variable – the inverse Mills ratio – derived from the probit estimate.

Kennedy (1998) states the two-stage Heckman model does not perform well:

- When the errors are not distributed normally;
- The sample size is small;
- The amount of censoring is small;
- The correlation between errors of the regression and selection equations is small; and the degree of collinearity between the explanatory variables in the regression and selection models is high.
NOTE: The heckman and heckprob commands in Stata do not estimate Heckman’s original two-stage model, but full maximum likelihood censored regression and censored probit models.

Lambda – The residuals produced by the first-stage estimates generate a new variable, the Inverse Mill’s Ratio or Lambda, which is included as a control variable in the second-stage equation.

Rho – Correlation between the errors in the two equations. If rho =0 the likelihood function can be split into two parts: a probit for the probability of being selected and an OLS regression for the expected value of Y in the selected subsample.

Sigma – The error from the outcome equation.

Censoring

heckman USE_PM_WMIS conflict_indexhcongdata_index leadership resources know_gpra employ super_year , select (gpra_answer = conflict_indexhcongdata_indexclim_lead resources know_gpra employ super_yeargpra_inv_datagpra_inv_measuregpra_inv_goals head) nshazard(NS_Use) robust

- Iteration 0: log pseudolikelihood = -4059.8985
- Iteration 1: log pseudolikelihood = -4057.3885
- Iteration 2: log pseudolikelihood = -4057.1623
- Iteration 3: log pseudolikelihood = -4057.1623
- Heckman selection model Number of obs = 1778
- (regression model with sample selection) Censored obs = 670
- Uncensored obs = 1108
- Wald chi2(8) = 242.08
- Log pseudolikelihood = -4057.162 Prob > chi2 = 0.0000
- ------------------------------------------------------------------------------
- | Robust
- | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- -------------+----------------------------------------------------------------
- USE_PM_WMIS |
- conflict_i~x | .1928649 .168657 1.14 0.253 -.1376969 .5234266
- hcong | .3237611 .1536926 2.11 0.035 .0225292 .624993
- data_index | -1.477338 .1812304 -8.15 0.000 -1.832543 -1.122133
- leadership | 1.622898 .20833 7.79 0.000 1.214578 2.031217
- resources | -.2301818 .1710368 -1.35 0.178 -.5654079 .1050442
- know_gpra | -1.000888 .380547 -2.63 0.009 -1.746746 -.2550295
- employ | .4258168 .1341343 3.17 0.002 .1629184 .6887152
- super_year | .0006872 .1701473 0.00 0.997 -.3327954 .3341699
- _cons | 23.39185 2.590072 9.03 0.000 18.3154 28.4683

Outcome Equation

Selection Equation

- -------------+----------------------------------------------------------------
- gpra_answer |
- conflict_i~x | -.0245066 .0357184 -0.69 0.493 -.0945134 .0455002
- hcong | .0300361 .030737 0.98 0.328 -.0302072 .0902794
- data_index | .1263006 .0402627 3.14 0.002 .0473871 .2052141
- clim_lead | -.0007559 .0400518 -0.02 0.985 -.079256 .0777442
- resources | .0215541 .0376913 0.57 0.567 -.0523195 .0954276
- know_gpra | .7684442 .0442381 17.37 0.000 .6817391 .8551492
- employ | -.0048647 .0388891 -0.13 0.900 -.0810858 .0713564
- super_year | .0757006 .0375131 2.02 0.044 .0021764 .1492249
- gpra_inv_d~a | .1376488 .0954819 1.44 0.149 -.0494924 .3247899
- gpra_inv_m~e | .4304551 .1000318 4.30 0.000 .2343964 .6265138
- gpra_inv_g~s | .4039991 .0995959 4.06 0.000 .2087947 .5992034
- head | -.2251226 .0770209 -2.92 0.003 -.3760809 -.0741644
- _cons | -3.210954 .3191416 -10.06 0.000 -3.83646 -2.585448
- -------------+----------------------------------------------------------------
- /athrho | -.8394572 .1788731 -4.69 0.000 -1.190042 -.4888723
- /lnsigma | 1.726785 .0363139 47.55 0.000 1.655611 1.797959
- -------------+----------------------------------------------------------------
- rho | -.6855214 .0948136 -.8305919 -.4533209
- sigma | 5.62255 .2041765 5.23628 6.037313
- lambda | -3.854378 .6499205 -5.128199 -2.580558
- ------------------------------------------------------------------------------
- Wald test of indep. eqns. (rho = 0): chi2(1) = 22.02 Prob > chi2 = 0.0000
- ------------------------------------------------------------------------------

rho is significant !

Sweeney notes: “If a variable appears ONLY in the outcome equation the coefficient on it can be interpreted as the marginal effect of a one unit change in that variable on Y. If, on the other hand, the variable appears in both the selection and outcome equations the coefficient in the outcome equation is affected by its presence in the selection equation as well”