1 / 46

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error. Causal Effects in Bog-standard non-experimental data. Often no clever instrument or natural experiment available Just going to run regression of y on X 1 - and what else?

onofre
Download Presentation

Non-Experimental Data II What Should Be Included in a Regression? Omitted Variables and Measurement Error

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-Experimental Data IIWhat Should Be Included in a Regression?Omitted Variables and Measurement Error

  2. Causal Effects in Bog-standard non-experimental data • Often no clever instrument or natural experiment available • Just going to run regression of y on X1 - and what else? • What variables should be included is basic day-to-day decision of practising applied economist • Apologies if too basic but it is important • No specific recipe, but general principles

  3. Think of what question you want to answer • Want to estimate E(y|X,?) • Think of what `?’ should be • Returns to education – should you include or exclude occupation? • If include then will improve R2 so occupation is ‘relevant’ • But will be asking question ‘what is effect of education on earnings holding occupation constant – perhaps not what we want

  4. Will focus on econometric issues • What are the issues we need to worry about: • Omitted variables • Measurement error • Will discuss these issues in turn • Slight change in notation to more standard form • Run regression of y on X1, X2, etc – want causal effect of X1 on y

  5. Omitted Variable Issues • Basic Model is: y=X1β1+X2β2+ε • Two issues: • What happens if we include X2 when it is irrelevant (β2=0)? • What happens if we exclude X2 when it is relevant (β2≠0)?

  6. Proposition 4.1If X2 is irrelevant • OLS estimate of β1 is consistent so no problem here (Not surprising – imposing ‘true’ restriction on the data) • But there is a cost – lower precision in estimate of β1

  7. Proof of Proposition 4.1a • Many ways to prove this • Can just read it off from: • Or use result from partitioned regression model:

  8. Proof of Proposition 4.1b - Method 1 • Using results from partitioned regression model can write OLS estimate of β1 as: • This is linear in y but generally different from OLS estimate if X2 excluded • Can invoke Gauss-Markov theorem – OLS estimator is BLUE – note use of irrelevance of X2 here

  9. Proof of Proposition 4.1b - Method 2(X1 and X2 one-dimensional) • This uses results from the notes on experiments. • If exclude X2then variance of coefficient on X1is given by: • If include X2then variance of coefficient on X1is given by: • If X2is irrelevant then σ20=σ2

  10. What determines size of loss of precision? • Bigger the Correlation between X1 andX2 the greater the likely loss in precision. • To see this if X1 andX2 are uncorrelated then two estimates identical • Consider extreme case of perfect correlation – then perfect multicollinearity if X2 included • Also useful to think of Proposition 4.1b as a specific application of the general principle that if we impose a ‘true’ restriction on parameters (here β2=0) then precision of estimation of other parameters improves – a gain in efficiency.

  11. Contrast with earlier result on other variables with experimental data • Here inclusion of irrelevant variables correlated with X reduce precision • Earlier, inclusion of relevant variables uncorrelated with X increases precision • They are consistent: • Including relevant variables increases precision • Including variables correlated with X reduces precision • Ambiguous effect on precision of including relevant variable correlated with X

  12. Excluding Relevant Variables • Leads to Omitted Variable Bias if X1 and X2 are correlated

  13. Is it better to exclude the relevant or include the irrelevant • Omitting relevant variables causes bias • Putting in irrelevant variables causes lower precision • Might conclude better to err on side of caution and include lots of regressors – the ‘kitchen sink’ approach • But: • May be prepared to accept some bias for extra precision • Can worsen problems of measurement error

  14. Measurement Error • True value is X* • True model is: y=X*β+ε • But X* observed with error – observed value is X • Measurement error has classical form: X=X*+u, E(u|X*)=0 • Can write model in terms of observables as: y=Xβ-uβ+ε • X correlated with composite error (uβ +ε) so bias in OLS estimate

  15. Proposition 4.2With one regressor (with classical measurement error) the plim of the slope coefficient is: • OLS estimate is biased towards zero – this is attenuation bias • Extent of bias related to importance of measurement error – signal-to-noise ratio, reliability ratio

  16. The General Case • Have previous model but now have X to be more than one-dimensional • Some notation and assumptions: • Covariance matrix of u is Σ

  17. Proposition 4.3The plim of the OLS estimator with many error-ridden regressors

  18. Why is this?

  19. Matrix equivalent of attenuation bias • But, in general case, hard to say anything about direction of bias on any single coefficient • If ΣXX* and Σ both diagonal then all coefficients biased towards zero

  20. An Informative Special Case • Two variables One variable measured with error, the other measured without error

  21. Earlier formula leads to:

  22. Proposition 4.4Attenuation Bias of Error-Ridden Variable Worsens when other Variables are Included • Where ρ12 is correlation between X1 and X2 • If ρ12≠0 this attenuation bias is worse than when X2 excluded. • Intuition: X2 soaks up some of signal in X1 leaving more noise in what remains

  23. Proposition 4.5The Presence of Error-Ridden Variables Causes Inconsistency on the Coefficients of Other Variables • This is inconsistent if X1 and X2 are correlated (σ12≠0) • Mirror image of previous result - X2 soaks up some of true variation in X1

  24. An Extreme Case • Observed X1 is all noise, σ2u=∞ - its coefficient will be zero • Then we get: • Should recognise this as formula for omitted variable bias when X1 excluded

  25. Measurement error in theDependent Variable • Suppose classical measurement error in y y=y*+u • Assume u uncorrelated with y,X • Then: y=Xβ+ε+u • X uncorrelated with u so OLS consistent • But is loss in precision so still a cost to bad data

  26. Example 2: Including Variables as a Higher Level of Aggregation • X* is individual level variable • Only observe average value at some higher level of aggregation (e.g. village, industry,region) - call this X • Model for relationship between X and X* X*=X+u, E(u│X)=0 • Note change in format

  27. In regression we have: y=X*β+ε y=Xβ+ε+u β • X and u uncorrelated so no inconsistency in OLS estimate of coefficient • But not ideal: • Loss in precision – less variation • Limits way to model other higher-level variables • May cause inconsistency in coefficients on other variables as E(u│X,Z) will depend on Z

  28. Summary of results on omitted variables and measurement error • Including irrelevant variables leads to loss in precision • Excluding relevant variables leads to omitted variables bias • Measurement error in X variables typically causes attenuation bias in coefficients • Inclusion of other variables worsens attenuation bias (though may reduce omitted variables bias)

  29. Strategies For Omitted Variables/Measurement Error • One strategy for dealing with omitted variables is to get data on variable and include it • One strategy for dealing with measurement error variables is to get better-quality data • These are good strategies but may be easier said than done • IV offers another approach if instrument can be argued to be correlated with true value of variable of interest and not with measurement error/ omitted variable

  30. Clustered Standard Errors • In many situations individuals affected by variables that operate at a higher level e.g. industry, region, economy • Call this higher-level a group or cluster • Can include group-level variables in regression • May be difficult to control for all relevant group-level variables so common practice to include dummy variable for each group • These dummy variables will capture the impact of all group-level variables

  31. Can write this model as.. • Where D is (NxG) matrix of group-dummies and θ vector of group-level effects (assume mean zero) • Will often see this but: • Low precision if number of groups is large (only exploits within-group variation in X) • Can’t identify effect of group-level variable X

  32. Lets think some more about this case.. • Might think about dropping group-level dummies and simply estimating: y=Xβ+ε • But this assumes covariance between residuals of individuals in the same group is zero – this is very strong • Half-way house is to think of θ not as parameters to be estimated but ‘errors’ that operate at level of the group • Assume θ uncorrelated with X,ε

  33. An Error Component Model • Error for individual i can be written as: • Variance of this error is: • Correlation between errors for individuals in same group (zero for those not in same group):

  34. Why is this? • For individuals in the same group • As they have the same group-level component • For individuals in different groups covariance is zero as have different (and assumed independent) group-level component

  35. Implications • Covariance matrix of composite errors, ui, will no longer be diagonal - denote by σ2Ω • OLS estimate will still be consistent (though not efficient) • Computed standard errors will be inconsistent – should be computed by:

  36. With this particular error component model • i.e. usual formula plus something • Usual formula will be wrong if second term non-zero

  37. Can say more…. • (X’D) will be a (kxG) matrix whose kth row and gth column will consist of the sum of values of Xk for those in group g • Suppose all groups equal size and Ng=(N/G) • Define a (Gxk) matrix of the average values of X in each group:

  38. Using this in previous expression.. • For case of one regressor the variance of the slope coefficient will be: • Where Var(Xi) is variance in X across individuals, Var(Xg) is variance in X across groups

  39. Case I: X correlation within and between groups the same • i.e. usual formula correct • Implies no (or small) problem with standard errors for variables which do not have much group-level variation

  40. Case II: Group-Level Regressor • Standard formula understates true variance by a factor related to importance of group-level shock and the size of the groups

  41. An even more special case.. • All individuals within groups are clones – ρ=1 • Then: • Really only have G observations • Simplest to estimate at group-level • But group-level estimation generally causes loss in efficiency so not best solution

  42. Dealing with this in practice.. • STATA has an option to compute standard errors doing clustering: . reg y x1 x2, cl(x3) • Such standard errors are said to be clustered with the ‘cluster’ being x3 • So quite easy to do in practice

  43. An example – the effect of gender and regional unemployment on wages • Data from UK LFS • Would expect gender mix not to vary much between region so most variation within region • Unemployment rate only has variation at regional level • Would expect clustering to reduce standard error on gender only a little but u-rate a lot

  44. No clustering logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0091228 -25.05 urate | 1.057465 .3928981 2.69 _cons | 2.447221 .0228265 107.21 -----------------------------------------------

  45. With clustered standard errors | Robust logwage | Coef. Std. Err. t -------------+--------------------------------- sex | -.2285092 .0110932 -20.60 urate | 1.057465 2.943567 0.36 _cons | 2.447221 .1494707 16.37 ----------------------------------------------- As predicted by theory

  46. Conclusions • Good practice to cluster the standard errors if not going to include group-level dummies • This is particularly important for group-level regressors – standard errors will otherwise often be much too low

More Related