Econometrics Course: Endogeneity & Simultaneity

1 / 58

# Econometrics Course: Endogeneity & Simultaneity - PowerPoint PPT Presentation

Econometrics Course: Endogeneity &amp; Simultaneity. Mark W. Smith. Overview. Endogeneity Sources Responses Omitted Variables Measurement Error Proxy Variables Method of Instrumental Variables Properties Validity and strength of instruments. Definition of Endogeneity.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Econometrics Course: Endogeneity & Simultaneity' - rue

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Econometrics Course:Endogeneity & Simultaneity

Mark W. Smith

Overview
• Endogeneity
• Sources
• Responses
• Omitted Variables
• Measurement Error
• Proxy Variables
• Method of Instrumental Variables
• Properties
• Validity and strength of instruments
Definition of Endogeneity

Suppose we have a regression equation

y = a + b1x1 + b2x2 + e

The variable x1 is endogenous if it is correlated with e.

Note that this is related to, but not identical to, the heuristic definition that “x1 is determined within the model.”

Sources of Endogeneity

1. Omitted variables

If the true model underlying the data is

y = a + b1x1 + b2x2+ b3x3 + n

but you estimate the model

y = a + b1x1 + b2x2 + e

then variable x1 will be endogenous if it is correlated with x3. Why? Because e = f (n, x3).

Sources of Endogeneity

2. Measurement error

Suppose the true model underlying the data is

y = a + b1x1 + b2x2 + e

but you estimate the model

y = a + b1x1 + b2x2* + e

where (x2* = x2 + j).

Sources of Endogeneity

2. Measurement error - continued

Variable x2 will be endogenous ifj depends on x2.

Example: Suppose that x2 measures hospital size

(no. of beds), and that the measurement error is greater for larger hospitals. Then as x2 grows, so does j. Thus e is correlated with x2, causing endogeneity.

Sources of Endogeneity

2. Measurement error - continued

Rearranging the equation, we have

y = a + b1x1 + b2x2* + e

y = a + b1x1 + b2(x2 + j) + e

y = a + b1x1 + b2x2 + (e +b2 j)

If j = f(x2) then error term is correlated with x2,causing endogeneity.

Sources of Endogeneity

3. Simultaneity

A system of simultaneous equations occurs when two or more left-hand side variables are functions of each other (there are other ways of stating it, too):

y1 = a + b1x1 + g2y2 + e

y2 = a + g1x1 + g2y1 + e

Sources of Endogeneity

3. Simultaneity

With some algebra you can rewrite these two equations in “reduced form” as a single equation with an endogenous regressor.

Pretesting for Endogeneity

The most famous test is Hausman (1978). Many others are described in Nakamura and Nakamura (1998).

Idea: the method of instrumental variables (IV) uses two-stage least squares (2SLS). If there is no endogeneity, it is more efficient to use OLS. If there is endogeneity, OLS is inconsistent and so 2SLS is best.

Pretesting for Endogeneity

Problem: the tests all have low power, particularly when 2SLS would cause a significant loss of efficiency.

In practice, many people use a Hausman test, fail to reject the null hypothesis of no endogeneity, and then use OLS.

A more statistically reliable approach is to base judgments of endogeneity on how the system under study works.

Responses to Endogeneity

What if you are unsure whether a variable is endogenous?

Approach #1: ignore it

Approach #2: use instrumental variables (IV) -- described later -- for every possibly endogenous variable

Approach #3: subtract out the variable using time-series (panel) data

Responses to Endogeneity

Approach #1: ignore it

-- Not advisable: true endogeneity causes OLS to be inconsistent

Approach #2: use IV on every possibly endogenous variable

-- Not advisable: it will cause a loss of efficiency (and hence wider confidence intervals) and may lead to bias.

Responses to Endogeneity

Approach #3: Difference it out

Suppose that the endogeneity is fixed over time, such as measurement error or an omitted variable. Further, suppose that observe data in two time periods.

A difference-in-difference (DD) model can be used: subtract values at time 1 (“before”) from values at time 2 (“after”) and the endogenous variable will drop out.

Responses to Endogeneity

Approach #3: Difference it out -- continued

Limitations:

- DD models will not eliminate selection bias.

- DD models only eliminate fixed variables; sometimes endogenous variables change values over time

Dealing with Omitted Variables

The investigatorshould have a conceptual model of the process under study. Guided by this understanding, there are a few options for dealing with omitted variables.

1. Find additional data so that every relevant variable is included.

2. Ignore it

- Acceptable only if omitted variable is uncorrelated with all included variables; otherwise the coefficient estimates will be biased up or down.

Dealing with Omitted Variables

3. Find proxy variable

Suppose the following:

y is the outcome

q is the omitted variable

z is the proxy for q

What properties should the proxy z have?

Dealing with Omitted Variables

a. Proxy z should be strongly correlated with q.

b. Proxy z must be redundant (= ignorable)

E (y | x, q, z) = E (y | x, q)

c. Omitted q must be uncorrelated with other regressors conditional on z:

(corr (q , xj) = 0 | z) for each xj

Dealing with Omitted Variables

The last two mean roughly that q and z provide similar information about the outcome.

You don’t observe q, so how can you prove these conditions are met? Either argue it from theory or test the assumption using other data.

### Dealing with Measurement Error

Dealing with Measurement Error

1. Improve measurement

- DSS improved by refusing extreme outlier values

- NPPD improved by requiring more complete data

2. Argue that the degree of error is small

- Use outside data for validation

3. Argue that error is uncorrelated with included variables

Dealing with Proxy Variables

1. What if proxy variable z is correlated with a regressor x?

OLS is inconsistent, but one can hope and argue that the inconsistency is less than if z is omitted.

Dealing with Proxy Variables

2. Consider using a lagged dependent variable as a proxy variable.

Example: If you believe that omitted variable qt strongly affects outcome yt, then a lagged value of y (such as yt-2) is probably correlated with qt as well.

Problem: yt-2 may be correlated with other x’s as well, leading to inconsistency.

Dealing with Proxy Variables

3. Consider using multiple proxy variables for a single omitted variable.

How? Simply put all proxy variables in the equation.

Note: they all must meet the requirements for proxies.

Dealing with Proxy Variables

4. What if omitted variable q interacts with a regressor x?

y = a + b1x+ b2q + b3qx + e

 dy/dx = b1+ b3q

marginal effect of x on y involves q, which is unobserved

Dealing with Proxy Variables

Demean z: take every value of z and subtract out the grand (overall) average value. Call it zd.

y = a + b1x+ b2zd + b3zdx + e

 dy/dx = b1+ b3zd

= b1 because E[zd] = 0

Method of Instrumental Variables

Often used to deal with simultaneity.

More generally, IV applies whenever a regressor x is correlated with the error term e.

IV Definition

Model: y = a + b1x1 + b2x2 + e

Suppose that x2 is endogenous to y. An instrumental variable is one that

(a) is correlated with the endogenous variable x2

(b) is uncorrelated with error term e

(c) should not enter the main equation (i.e., does not

explain y)

Two-Stage Least Squares

Two-stage least squares (2SLS) approach

Stage 1:

Predict x2 as a function of all other variables plus an IV (call it z):

x2 = a + g1x1 + g2z + n

Create predicted values of x2 – call them x2p

Two-Stage Least Squares

Two-stage least squares (2SLS) approach

Stage 2:

Predict y as a function of x2p and all other variables (but not z):

y = a + b1x1 + b2 x2p + e

Note: adjust the standard errors to account for the fact that x2pis predicted.

Two-Stage Residual Inclusion

2SLS is only consistentwhen the Stage 2 equation is linear.

If Stage 2 is nonlinear, use the two-stage residual inclusion (2SRI) method:

- Stage 1 as in 2SLS, leading to predicted x2p

- Develop residuals v = x2 - x2p

Two-Stage Residual Inclusion

- Stage 2:

Predict y as a function of x1, x2 (not x2p) and the new residuals v:

y = f (a + b1x1 + b2 x2+b3v)+ e

where f(.) is a nonlinear function.

Note that if Stage 2 is linear, then 2SRI yields the same results as 2SLS.

Multiple IVs

What if you have multiple endogenous variables?

1. The number of IVs must equal or exceed the number of endogenous variables

2. Estimate a separate 1st-stage regression for each endogenous variable

3. Every 1st-stage regression should contain all IVs

IV Issues

Two issues plague the IV method:

1. No IV is available

2. A potential IV is found, but its quality is uncertain

IV Issues

What if there is no IV?

State that no IV exists and forge ahead anyway, arguing that any bias in OLS is likely to be small.

• Argue that the endogeneity is weak on theoretical grounds.
• Argue that external data indicate that the bias from OLS is likely to be small.
IV Properties

What if you have an IV of unknown quality?

Two characteristics mark a good IV:

1. Validity

2. Strength

IV Validity

Validity has several components:

a. Non-zero correlation with x2

b. Uncorrelated with error term e

c. Uncorrelated with y except through x2

d. Monotonicity: as z increases, x2 increases

IV Validity

There are several ways to show validity of an IV:

• Non-zero correlation with the endogenous variable can be shown directly.
• Robustness: do alternative IVs yield similar results?
• Non-correlation with the outcome variable of the 2nd

stage. This point must be argued from theory, an understanding of how the system under study works.

IV Validity

Warning: one cannot simply add a candidate IV to the main model (i.e., the 2nd stage) to see whether it is significant. The result is biased.

BUT

If there are multiple IVs, one can use a test of over-identifying restrictions.

IV Validity

Overidentification: number of candidate IVs exceeds number of endogenous variables.

Suppose that

(a) You have one endogenous variable and three candidate IVs

(b) You know that one of the IVs is truly valid.

Use the known-valid IV in the 1st stage and put the remaining two IVs in the 2nd stage.

IV Validity

Over-identification test, continued

If the two remaining IVs are jointly insignificant in the 2nd stage, then this supports their use as alternative IVs.

Problem: this only works if the IV(s) in the 1st stage are truly valid – and you don’t know that!

IV Validity

Over-identification test, continued

Partial solution: use Sargan’s (1984) test, which assumes only that one or more of your IVs are valid –you don’t have to specify which. This method fails only if none of the IVs is valid.

In the end, you must argue for validity on conceptual grounds at a minimum.

IV Validity

Conceptual arguments:

1. Explain why z should influence x2

2. Explain why z should not influence y directly

3. Anticipate objections about omitted variables that link z to the error term e. Show that z is not related to those omitted variables, perhaps using outside data. For example, use data on non-veterans to support a claim about how veterans act.

IV Properties

Two characteristics mark a good instrumental variable:

1. Validity

2. Strength

Strong IVs

A strong instrument has a high correlation with the endogenous variable.

How strong a correlation? Staiger & Stock (1997) recommend a partial F statistic of 5 or greater.

- Run 1st stage with and without the IV.

- Compare the overall F statistics: a difference of 5 or

more is sufficient evidence of strength.

Weak IVs

If the IVs are weak,

• 2SLS and 2SRI are consistent, but there can be considerable bias even in large samples
• standard errors are too small
• 2SLS and 2SRI perform poorly
Weak IVs

What to do if IVs are weak?

If there is a single endogenous variable, use a conditional likelihood ratio (CLR) test:

* perform a regular likelihood ratio test

* available in Stata; see Stata Journal, 3, 57-70

and http://elsa.berkeley.edu/wp/marcelo.pdf by Moreira

and Poi

Weak IVs

What if there are multiple endogenous variables and only weak IVs?

A solution has not been developed … yet!

Selected References

JM Wooldridge. Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press, 2002.

A graduate-level econometrics textbook with lengthy textual descriptions of practical issues.

HS Bloom, ed. Learning more from social experiments: evolving analytic approaches. Russell Sage.

A largely non-technical exploration of how instrumental variables are found and used, with examples from welfare reform studies.

Selected References

MP Murray. Avoiding invalid instruments and coping with weak instruments. Journal of Economic Perspectives 2006;20(4): 111-132.

A superb reference with relatively few equations. Has an extensive reference list.

A Nakamura, M Nakamura. Model specification and endogeneity. Journal of Econometrics 1998;83:213-237.

Presents major endogeneity tests, explores approaches to endogeneity testing. Somewhat iconoclastic.

Selected References

M McClellan, B McNeil, J Newhouse. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality? Analysis using instrumental variables. JAMA1994;272(11):859-66

Classic paper using IV in health, but challenging to read.

J Newhouse, M McClellan. Econometrics in outcomes research: the use of instrumental variables. Ann Rev Pub Health 1998; 19:17-34.

Non-technical introduction to IV.

Selected References

J Terza, A Basu, P Rathouz. Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling. Journal of Health Economics 2008;27:531-543.

Explains two-stage residual inclusion models and contrasts them to two-stage least squares. Moderately technical.

Acknowledgements

Much of the content of this presentation is derived from Wooldridge (2002), Murray (2006), and Nakamura and Nakamura (2006).