Lecture 8 regression analysis, regression logic & meaning

Lecturer: David Reinstein Lecture 8regression analysis, regression logic & meaning

What is regression? When should you use one? A way of fitting a line (plane) through a bunch of dots. - In multiple dimensions -May have a causal interpretation (or not) CLM: Population model is linear in parameters: y = b0 + b1x1 + b2x2 +…+ bkxk+ u

OLS: Estimating Actual Linear Relationship? • Best linear approximation; ‘average slopes’ • Causal or not • Identifying restrictions; CLM model assumptions • Some coefficients/tests depend on normality, others an “asymptotic” justification with a large enough sample • "… Regression coefficients have an 'average derivative' interpretation. In multivariate regression models this interpretation is unfortunately complicated by the fact that the OLS slope vector is a matrix-weighted average of the gradient of the CEF. Matrix-weighted averages are difficult to interpret except in special cases (see Chamberlain and Lemur, 1976)." • -AP

Some Resources A Guide to Econometrics: Peter Kennedy Mostly Harmless Econometrics: An Empirical Companion (Angrist and Pischke) ATS guides Stata Web Books: Regression with Stata

HOW TO SPECIFY REGRESSION Functional form? (e.g., linear or “loglinear”? Include quadratic terms?) Impose restrictions?

WHICH DEPENDENT VARIABLE? Meaningful to your question and interpretable Relevant to what you are looking for (e.g., available for right years and countries) Reliably collected Specify in logs? Linearly? Categorically? Aggregated at what level?

Which right-hand side variables? The focal variables and control variables Typically, you care about: The effect of one (or a few) independent variable on the dependent variable, e.g., education on wages. (Although you might have more complicated hypotheses/relationships to test, involving differences between coefficients etc.)  You should focus on credibly identifying this relationship. • Other rhs variables are typically controls (e.g., control for parent’s education, control for IQ test scores). • Be careful not to include potentially “endogenous” variables as controls, as this can bias all coefficients (more on this later). • Be careful about putting variables on the right hand side that are determined after the outcomevariable (Y, the dependent variable).

ENDOGENEITY You care about estimating the impact of a variable x1, on y. Consider the example of regressing income at age 30 on years of education to try to get at the effect of education on income. x1: years of education x2…xk: set of “control” variables y: income at age 30 You regress

ENDOGENEITY • Suppose the true relationship (which you almost never know for sure in economics) is • For unbiasedness/consistency of all your estimated terms, the key requirement is: • E(u|x1, x2,… xk) = 0, implies that all of the explanatory variables are exogenous. • Alt (still ‘consistent’) if E(v) = 0 and Cov(xj,v) = 0, for j = 1, 2, …, k • There are various reasons why the above assumption might not hold; various causes of what we call “endogeneity”. Two examples are reverse causality and omitted variable bias.

REVERSE CASUALITY Education may affect income at age 30, but could income at age 30 also affect years of education? This is probably not a problem for this example, because the education is usually finished long before age 30 (even I finished at age 30 on the nose). However, in other examples it is an issue (e.g., consider regressing body weight on income, or vice/versa) Also, if the measure of education were determined years later, this might be a problem. For example, if your measure of years of education was based on self-reports at age 30, maybe those with a lower income would under-report, e.g., if they were ashamed to be waiting tables with a Ph.D.

...or a third, omitted factor may affect both Intelligence may effect both education obtained and income at age 30 Macro/aggregate: With variation across time, there may be a common trend. E.g., suppose I were to regress “average income” on “average education” for the UK, using only a time series with one observation per year. A “trend term”, perhaps driven by technological growth, may be leading to increases in education as well as increased income.

THE OMITTED VARIABLE BIAS FORUMULA; INTERPRETING/SIGNING THE BIAS You care about estimating the impact of a variable x1, on y, e.g., x1: years of education y: income at age 30 You estimate But the true relationship is Where x2 is an unobserved or unobservable variable, say “intelligence” or “personality”.

THE OMITTED VARIABLE BIAS FORUMULA; INTERPRETING/SIGNING THE BIAS Your estimate of the slope is likely to be biased (and “inconsistent”) . The “omitted variable (asymptotic) bias” is: where

THE OMITTED VARIABLE BIAS FORUMULA; INTERPRETING/SIGNING THE BIAS In other words, the coefficient you estimate will “converge to” the true coefficient plus a bias term. The bias is the product: [Effect of the omitted variable on the outcome] x [“effect” of omitted variable on variable of interest] E.g., [effect of intelligence on income] x [“effect” of intelligence of years of schooling] This can be helpful in understanding whether your estimates may be biased, and if so, in which direction!

CONTROL STRATEGIES • Control for “X2-Xk” variables that have direct effects on Y; this will reduce omitted variable bias (if these variables are correlated to your “X1” of interest) • Including controls can also make your estimates more precise. • If you put in an “Xk” variable that doesn’t actually have a true effect on Y, it will make your estimates less precise. However, it will only lead to a bias if it is itself endogenous (and correlated to your X1 of interest).

“BAD CONTOL” "some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coefficients. Bad controls are variables that are themselves outcome variables in the notational notional experiment at hand. That is, bad controls might just as well be dependent variables too." (Angrist and Pischke) – – They could also be interpreted as endogenous variables.

“BAD CONTOL” • "Once we acknowledge the fact that college affects occupation, comparison of wages by college degree status within occupation are no longer apples to apples, even if college degree completion is randomly assigned.“ • – The question here was whether to control for the category of occupation, not the college degree. • "It is also incorrect to say that the conditional comparison captures the part of the effect of college that is 'not explained by occupation'" • "so we would do better to control only for variables that are not themselves caused by education." • -AP

Fixed effects/difference-in-between The net effect of omitted variables and truly random term may have fixed and varying components. There may be a term “Ci” that is specific to an individual or “unit”, but that does not change over time. For example, an individual may be more capable of earning, a firm may have a particularly good location, and a country may have a particular high level of trust in institutions. There may also be a term that varies across units and over time. An individual may experience a particular negative shock to her income, a firm may be hit by a lawsuit, and a country may have a banking scandal.

Fixed effects/difference-in-between If this Ci part of the “error term” may be correlated to the dependent variable of interest, X1, it may help to “difference this out” by doing a Fixed Effects Regression. This essentially includes a dummy variable for each individual (or “unit”), but these dummies are usually not reported. The resulting coefficients are the same ones you would get if you “de-meaned” every X and Y variable before running the regression. By “demeaned” I mean replace each with and with where the bars indicate “the mean of this variable for individual i”.

INSTRUMENTAL VARIABLES • A variable Z that • (1) “causes” the X1 variable of interest but has no independent effect on Y, and • (2) is not correlated to the true error term, may be used as an “instrument”. • For example, it might be argued (debatable) that if one’s parents had a job near a good university, this would increases one’s chances of going to a good university. To use “distance to nearest university" as an instrument you would have to argue • (i) there is no direct effect of living near a good university on later income. • (ii) The probability of living near a good university is not caused by a third unobserved factor (e.g., parent’s interest in children’s success) that might also affect later income.

2sls One form of instrumental variables (IV) technique is called “two stage least squares” (2sls). This essentially involves regressing X1 on Z (and other controls) and obtaining a predicted value of X1 from this equation, , and then regressing Y on this (and the same set of other controls) but “excluding” Z from this second-stage regression. You should generally report both the first and second stages in a table, and “diagnostics” of this instrument.

SOME OTHER ISSUES AND “DIAGNOSTICS” • Time series (and panel) data: issues of autocorrelation, lag structure, trends, non-stationarity • Non-normal error terms, small sample • Categorical dependent variable: consider Logit/Probit if binary, Multinomial logit if categorical, Poisson if ‘count’ data; other variants/models • Bounded/censored dependent variable: Consider Tobit and other models • Sample selection issues; self-selection, selectivity, etc. • Missing values/variables … Imputation • Errors in variables (classical, otherwise) • The meaning of R-squared; when it is useful/important?

HETEROSKEDASTICITY OLS coefficients are still unbiased/consistent but maybe not efficient Estimated standard errors of estimator/tests are not Autocorrelation: similar, but it can be a sign of a mispecified dynamic model

RESPONSES (TO HETEROSKEDASTICITY AND “SIMPLE” AUTOCORRELATION) • “Feasible” GLS (only consider doing with lots of data) or • Regular OLS with robust standard errors (or clustered in a certain way) - “Test” for heteroskedasticity if you fail to reject homoscedasticity say “whew, I can ignore this”? Controversial; I don’t like this because the test may not be powerful enough. So use ‘robust’ anyway.

INTERPRETING YOUR RESULTS 1: TEST FOR SIGNIFICANCE Simple differences (not in a regression): A variety of parametric, nonparametric and “exact” tests Regression coefficients: t-tests Difference from zero (usually 2-sided) Difference from some hypothesis (e.g., difference from unit) Joint test of coefficients Evidence for ‘small or no effect’: one-sided t-test of , e.g., H0: >=10 vs HA: <10; where 10 is a ‘small value’ in this context

JOINT SIGNIFICANCE OF A SET OF COEFFICIENTS: F-TESTS H0: all tested coefficients are truly =0 HA: at least one coefficient has a true value ≠0

INTERPRETING RESULTS 2: MAGNITUDES & SIZES OF EFFECTS In a linear model in levels-on-levels the coefficients on continuous variables have a simple “slope interpretation” Note: assuming a homogenous effect, otherwise it gets complicated. Dummy variables have a “difference in means, all else equal” interpretation. But be careful to describe and understand and explain the estimated effects (or “linear relationships”) in terms of the units of the variable (e.g., impact of years of education on thousands of pounds of salary at age-30, pre-tax)

INTERPRETING RESULTS 2: MAGNITUDES & SIZES OF EFFECTS Transformed/nonlinear variables When some variables are transformed, e.g., expressed in logarithms, interpretation is a little more complicated (but not too difficult). Essentially, the impact of/on logged variables represent “proportional” or “percentage-wise” impact. Look this up and describe the effects correctly. In nonlinear models (e.g., Logit, Tobit, Poisson/Exponential) the marginal effect of a variable is not constant, it depends on the other variables and the error/unobservable term. But you can express things like “marginal effect averaged over the observed values” or (for some models) the “proportional percentage effect.”

INTERPRETING RESULTS 3: INTERACTION TERMS You may run a regression such as: INCOME = A + B1×YEARS_EDUC + B2×FEMALE×YEARS_EDUC + B3× FEMALE + U Where FEMALE is a dummy variable that =1 if the observed individual is a woman and =0 if he is a man. How do you interpret each coefficient estimate? A: A constant “intercept”; fairly meaningless by itself unless the other variables are expressed as differences from the mean, in which case it represents the mean income. B1 : “Effect” of years of education on income (at age 30, say) for males B2 : “Additional Effect” of years of education on income for females relative to males B3 : “Effect” of being female on income, holding education constant What about B1+ B2? = “Effect” of years of education on income for females

Lecture 8 regression analysis, regression logic & meaning