Econometrics Course: Endogeneity & Simultaneity

Econometrics Course:Endogeneity & Simultaneity Mark W. Smith

Overview • Endogeneity • Sources • Responses • Omitted Variables • Measurement Error • Proxy Variables • Method of Instrumental Variables • Properties • Validity and strength of instruments

Definition of Endogeneity Suppose we have a regression equation y = a + b1x1 + b2x2 + e The variable x1 is endogenous if it is correlated with e. Note that this is related to, but not identical to, the heuristic definition that “x1 is determined within the model.”

Sources of Endogeneity 1. Omitted variables If the true model underlying the data is y = a + b1x1 + b2x2+ b3x3 + n but you estimate the model y = a + b1x1 + b2x2 + e then variable x1 will be endogenous if it is correlated with x3. Why? Because e = f (n, x3).

Sources of Endogeneity 2. Measurement error Suppose the true model underlying the data is y = a + b1x1 + b2x2 + e but you estimate the model y = a + b1x1 + b2x2* + e where (x2* = x2 + j).

Sources of Endogeneity 2. Measurement error - continued Variable x2 will be endogenous ifj depends on x2. Example: Suppose that x2 measures hospital size (no. of beds), and that the measurement error is greater for larger hospitals. Then as x2 grows, so does j. Thus e is correlated with x2, causing endogeneity.

Sources of Endogeneity 2. Measurement error - continued Rearranging the equation, we have y = a + b1x1 + b2x2* + e y = a + b1x1 + b2(x2 + j) + e y = a + b1x1 + b2x2 + (e +b2 j) If j = f(x2) then error term is correlated with x2,causing endogeneity.

Sources of Endogeneity 3. Simultaneity A system of simultaneous equations occurs when two or more left-hand side variables are functions of each other (there are other ways of stating it, too): y1 = a + b1x1 + g2y2 + e y2 = a + g1x1 + g2y1 + e

Sources of Endogeneity 3. Simultaneity With some algebra you can rewrite these two equations in “reduced form” as a single equation with an endogenous regressor.

Pretesting for Endogeneity The most famous test is Hausman (1978). Many others are described in Nakamura and Nakamura (1998). Idea: the method of instrumental variables (IV) uses two-stage least squares (2SLS). If there is no endogeneity, it is more efficient to use OLS. If there is endogeneity, OLS is inconsistent and so 2SLS is best.

Pretesting for Endogeneity Problem: the tests all have low power, particularly when 2SLS would cause a significant loss of efficiency. In practice, many people use a Hausman test, fail to reject the null hypothesis of no endogeneity, and then use OLS. A more statistically reliable approach is to base judgments of endogeneity on how the system under study works.

Responses to Endogeneity What if you are unsure whether a variable is endogenous? Approach #1: ignore it Approach #2: use instrumental variables (IV) -- described later -- for every possibly endogenous variable Approach #3: subtract out the variable using time-series (panel) data

Responses to Endogeneity Approach #1: ignore it -- Not advisable: true endogeneity causes OLS to be inconsistent Approach #2: use IV on every possibly endogenous variable -- Not advisable: it will cause a loss of efficiency (and hence wider confidence intervals) and may lead to bias.

Responses to Endogeneity Approach #3: Difference it out Suppose that the endogeneity is fixed over time, such as measurement error or an omitted variable. Further, suppose that observe data in two time periods. A difference-in-difference (DD) model can be used: subtract values at time 1 (“before”) from values at time 2 (“after”) and the endogenous variable will drop out.

Responses to Endogeneity Approach #3: Difference it out -- continued Limitations: - DD models will not eliminate selection bias. - DD models only eliminate fixed variables; sometimes endogenous variables change values over time

Dealing with Omitted Variables

Dealing with Omitted Variables The investigatorshould have a conceptual model of the process under study. Guided by this understanding, there are a few options for dealing with omitted variables. 1. Find additional data so that every relevant variable is included. 2. Ignore it - Acceptable only if omitted variable is uncorrelated with all included variables; otherwise the coefficient estimates will be biased up or down.

Dealing with Omitted Variables 3. Find proxy variable Suppose the following: y is the outcome q is the omitted variable z is the proxy for q What properties should the proxy z have?

Dealing with Omitted Variables a. Proxy z should be strongly correlated with q. b. Proxy z must be redundant (= ignorable) E (y | x, q, z) = E (y | x, q) c. Omitted q must be uncorrelated with other regressors conditional on z: (corr (q , xj) = 0 | z) for each xj

Dealing with Omitted Variables The last two mean roughly that q and z provide similar information about the outcome. You don’t observe q, so how can you prove these conditions are met? Either argue it from theory or test the assumption using other data.

Dealing with Measurement Error

Dealing with Measurement Error 1. Improve measurement - DSS improved by refusing extreme outlier values - NPPD improved by requiring more complete data 2. Argue that the degree of error is small - Use outside data for validation 3. Argue that error is uncorrelated with included variables

Dealing with Proxy Variables

Dealing with Proxy Variables 1. What if proxy variable z is correlated with a regressor x? OLS is inconsistent, but one can hope and argue that the inconsistency is less than if z is omitted.

Dealing with Proxy Variables 2. Consider using a lagged dependent variable as a proxy variable. Example: If you believe that omitted variable qt strongly affects outcome yt, then a lagged value of y (such as yt-2) is probably correlated with qt as well. Problem: yt-2 may be correlated with other x’s as well, leading to inconsistency.

Dealing with Proxy Variables 3. Consider using multiple proxy variables for a single omitted variable. How? Simply put all proxy variables in the equation. Note: they all must meet the requirements for proxies.

Dealing with Proxy Variables 4. What if omitted variable q interacts with a regressor x? y = a + b1x+ b2q + b3qx + e  dy/dx = b1+ b3q marginal effect of x on y involves q, which is unobserved

Dealing with Proxy Variables Demean z: take every value of z and subtract out the grand (overall) average value. Call it zd. y = a + b1x+ b2zd + b3zdx + e  dy/dx = b1+ b3zd = b1 because E[zd] = 0

Instrumental Variables

Method of Instrumental Variables Often used to deal with simultaneity. More generally, IV applies whenever a regressor x is correlated with the error term e.

IV Definition Model: y = a + b1x1 + b2x2 + e Suppose that x2 is endogenous to y. An instrumental variable is one that (a) is correlated with the endogenous variable x2 (b) is uncorrelated with error term e (c) should not enter the main equation (i.e., does not explain y)

Two-Stage Least Squares Two-stage least squares (2SLS) approach Stage 1: Predict x2 as a function of all other variables plus an IV (call it z): x2 = a + g1x1 + g2z + n Create predicted values of x2 – call them x2p

Two-Stage Least Squares Two-stage least squares (2SLS) approach Stage 2: Predict y as a function of x2p and all other variables (but not z): y = a + b1x1 + b2 x2p + e Note: adjust the standard errors to account for the fact that x2pis predicted.

Two-Stage Residual Inclusion 2SLS is only consistentwhen the Stage 2 equation is linear. If Stage 2 is nonlinear, use the two-stage residual inclusion (2SRI) method: - Stage 1 as in 2SLS, leading to predicted x2p - Develop residuals v = x2 - x2p

Two-Stage Residual Inclusion - Stage 2: Predict y as a function of x1, x2 (not x2p) and the new residuals v: y = f (a + b1x1 + b2 x2+b3v)+ e where f(.) is a nonlinear function. Note that if Stage 2 is linear, then 2SRI yields the same results as 2SLS.

Multiple IVs What if you have multiple endogenous variables? 1. The number of IVs must equal or exceed the number of endogenous variables 2. Estimate a separate 1st-stage regression for each endogenous variable 3. Every 1st-stage regression should contain all IVs

IV Issues Two issues plague the IV method: 1. No IV is available 2. A potential IV is found, but its quality is uncertain

IV Issues What if there is no IV? State that no IV exists and forge ahead anyway, arguing that any bias in OLS is likely to be small. • Argue that the endogeneity is weak on theoretical grounds. • Argue that external data indicate that the bias from OLS is likely to be small.

IV Properties What if you have an IV of unknown quality? Two characteristics mark a good IV: 1. Validity 2. Strength

IV Validity Validity has several components: a. Non-zero correlation with x2 b. Uncorrelated with error term e c. Uncorrelated with y except through x2 d. Monotonicity: as z increases, x2 increases

IV Validity There are several ways to show validity of an IV: • Non-zero correlation with the endogenous variable can be shown directly. • Robustness: do alternative IVs yield similar results? • Non-correlation with the outcome variable of the 2nd stage. This point must be argued from theory, an understanding of how the system under study works.

IV Validity Warning: one cannot simply add a candidate IV to the main model (i.e., the 2nd stage) to see whether it is significant. The result is biased. BUT If there are multiple IVs, one can use a test of over-identifying restrictions.

IV Validity Overidentification: number of candidate IVs exceeds number of endogenous variables. Suppose that (a) You have one endogenous variable and three candidate IVs (b) You know that one of the IVs is truly valid. Use the known-valid IV in the 1st stage and put the remaining two IVs in the 2nd stage.

IV Validity Over-identification test, continued If the two remaining IVs are jointly insignificant in the 2nd stage, then this supports their use as alternative IVs. Problem: this only works if the IV(s) in the 1st stage are truly valid – and you don’t know that!

IV Validity Over-identification test, continued Partial solution: use Sargan’s (1984) test, which assumes only that one or more of your IVs are valid –you don’t have to specify which. This method fails only if none of the IVs is valid. In the end, you must argue for validity on conceptual grounds at a minimum.

IV Validity Conceptual arguments: 1. Explain why z should influence x2 2. Explain why z should not influence y directly 3. Anticipate objections about omitted variables that link z to the error term e. Show that z is not related to those omitted variables, perhaps using outside data. For example, use data on non-veterans to support a claim about how veterans act.

IV Properties Two characteristics mark a good instrumental variable: 1. Validity 2. Strength

Strong IVs A strong instrument has a high correlation with the endogenous variable. How strong a correlation? Staiger & Stock (1997) recommend a partial F statistic of 5 or greater. - Run 1st stage with and without the IV. - Compare the overall F statistics: a difference of 5 or more is sufficient evidence of strength.

Weak IVs If the IVs are weak, • 2SLS and 2SRI are consistent, but there can be considerable bias even in large samples • standard errors are too small • 2SLS and 2SRI perform poorly

Weak IVs What to do if IVs are weak? If there is a single endogenous variable, use a conditional likelihood ratio (CLR) test: * perform a regular likelihood ratio test * adjust the critical values * available in Stata; see Stata Journal, 3, 57-70 and http://elsa.berkeley.edu/wp/marcelo.pdf by Moreira and Poi

Econometrics Course: Endogeneity & Simultaneity