V. Regression Diagnostics

1 / 95

V. Regression Diagnostics - PowerPoint PPT Presentation

V. Regression Diagnostics. Regression analysis assumes a random sample of independent observations on the same individuals (i.e. units). What are its other basic assumptions? They all concern the residuals ( e ):.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'V. Regression Diagnostics' - mickey

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Regression analysis assumes a random sample of independent observations on the same individuals (i.e. units).

• What are its other basic assumptions? They all concern the residuals (e):

(1) The mean of the probability distribution of e over all possible samples is 0: i.e. the mean of e does not vary with the levels of x.

(2) The variance of the probability distribution of e is constant for all levels of x: i.e. the variance of the residuals does not vary with the levels of x.

(3) The errors associated with any two different y observations are 0: i.e. the errors are uncorrelated—the errors associated with one value of y have no effect on the errors associated with other y values.

(4) The probability distribution of e is normal.

The assumptions are commonly summarized as I.I.D.: independent & identically distributed residuals.

• To the degree that these assumptions are confirmed, then the relationship between the outcome variable & independent variables is adequately linear.

What are the implications of these assumptions?

• Assumption 1: ensures that the regression coefficients are unbiased.
• Assumptions 2 & 3: ensure that the standard errors are unbiased & are the lowest possible, making p-values & significance tests trustworthy.
• Assumption 4: ensures the validity of confidence intervals & p-values.

Assumption 4 is by far the least important: even if the distribution of a regession model’s residuals depart from approximate normality, the central limit theorem makes us generally confident that confidence intervals & p-values will be trustworthy approximations if the sample is at least 100-200.

• Problems with assumption 4, however, may be indicative of problems with the other, crucial assumptions: when might these be violated to the degree that the findings are highly biased & unreliable?

Serious violations of assumption 1 result from anything that causes serious bias of the coefficients: examples?

• Violations of assumption 2 occur, e.g., when variance of income increases as the value of income increases, or when variance of body weight increases as body weight itself increases: this pattern is commonplace.

Violations of assumption 3 occur as a result of clustered observations or time-series observations: variance is not independent from one observation to another but rather is correlated.

• E.g., in a cluster sample of individuals from neighborhoods, schools, or households, the individuals within any such unit tend to be significantly more homogeneous than are individuals in the wider sample. Ditto for panel or time-series observations.

In the real world, the linear model is usually no better than an approximation, & violations to one extent or another are the norm.

• What matters is if the violations surpass some critical threshold.
• Regression diagnostics: procedures to detect violations of the linear model’s assumptions; gauge the severity of the violations; & take appropriate remedial action.

Keep in mind:statistical vs. practical significance in evaluating the findings of diagnostic tests.

• See King et al. for applications of the logic of regression diagnostics to qualitative social science research as well.

Keep in mind that the linear model does not assume that the distribution of a variable’s observations is normal.

• Its assumptions, rather, involve the residuals (which are the sample estimates of the population e).
• While it’s important to inspect univariate & bivariate distributions, & to be careful about extreme outliers, recall that multiple regression expresses the joint, multivariate associations of x’s with y.

Let’s turn our attention to regression diagnostics.

• For the sake of presenting the material, we’ll examine & respond to the diagnostic tests step by step.
• In ‘real life,’ though, we should go through the entire set of diagnostic tests first & then use them fluidly & interactively to address the problems.

Model Specification

• Does a regression model properly account for the relationship between the outcome & explanatory variables?
• See Wooldridge, Introductory Econometrics, pages 289-94; Allison, Multiple Regression, pages 49-52, 123-25; Stata Reference G-M, pages 274-79; N-R, 363-65.

If a model is functionally misspecified, its slope coefficients may be seriously biased, either too low or too high.

• We could then either under or overestimate the y/x relationship; & conclude incorrectly that a coefficient is insignificant or significant.

If this is a problem, perhaps the outcome variable needs to be redefined to properly account for the y/x relationships (e.g., from ‘miles per gallon’ to ‘gallons per mile’).

• Or perhaps, e.g., ‘wage’ needs to be transformed to ‘log(wage)’.
• And/or maybe not OLS but another kind of regression—e.g., quantile regression— should be used.

Let’s begin by exploring the variables we’ll use.

. use WAGE1, clear

. hist wage, norm

. gr box wage, marker(1, mlab(id))

. su wage, d

• Note that ‘ladder wage’ doesn’t suggest a transformation, but log wage is common for right skewness.

. gen lwage=ln(wage)

• . su wage lwage
• . hist lwage, norm
• . gr box lwage, marker(1, mlab(id))
• While a log transformation makes wage’s distribution much more normal, it leaves an extreme low-value outlier, id=24.
• Let’s inspect its profile:
• It’s a white female with 12 years of education, but earning a very low wage.
• We don’t know if its wage is an error, we’ll keep an eye on id=24 for possible problems.
• Let’s exam the independent variables:

.

. hist educ

. gr box educ, marker(1, mlab(id))

. su educ, d

. sparl lwage educ

. twoway qfitci lwage educ

. gen educ2=educ^2

. su educ educ2

. hist exper, norm

. gr box exper, marker(1, mlab(id))

. su exper, d

. sparl lwage exper

. twoway qfitci lwage exper

. gen exper2=exper^2

. su exper exper2

. hist tenure, norm

• . gr box tenure, marker(1, mlab(id))
• . su tenure, d
• . su tenure if tenure>=25 & tenure<.
• Note that there are only 18 cases of tenure>=25, & increasingly fewer cases with greater tenure.
• . sparl lwage tenure

• Note that there are cases of tenure=0, which must be accommodated in a log transformation.

. gen ltenure=ln(tenure + 1)

. su tenure ltenure

. sparl lwage tenure

. sparl lwage ltenure [i.e. transformed]

. twoway qfitcit lwage tenure

• . twoway qfitcit lwage ltenure
• What other options could be explored for educ, exper, & tenure?
• The data do not include age, which could be an important factor, including as a control.

. tab female, su(lwage)

. ttest lwage, by(female)

. tab nonwhite, su(lwage)

• . ttest lwage, by(nonwhite)
• Although nonwhite tests insignificant, it might become significant in the model.
• Let’s first test the model omitting the transformed version of the variables:

. reg wage educ exper tenure female nonwhite

A form of model misspecification is omitted variables, which also causes biased slope coefficients (see Allison, Multiple Regression, pages 49-52).

• Leaving key explanatory variables out means that we have not adequately controlled for important x effects on y.
• Thus we may incorrectly detect or fail to detect y/x relationships.

In STATA we use ‘estat ovtest’ (also known as regression specification test, RESET) to indicate whether there are important omitted variables or not.

• ‘estat ovtest’ adds polynomials to the model’s fitted values: we want it to test insignificant so that we fail to reject the null hypothesis that there are no important omitted variables.
• We want to fail to reject the null hypothesis that the model has no omitted variables.

. reg wage educ exper tenure female nonwhite

. estat ovtest

• Ramsey RESET test using powers of the fitted values of wage
• Ho: model has no omitted variables
• F(3, 517) = 9.37
• Prob > F = 0.0000
• The model fails. Let’s add the transformed variables.

. reg lwage educ educ2 exper exper2 ltenure female nonwhite

• . estat ovtest
• . estat ovtest
• Ramsey RESET test using powers of the fitted values of lwage
• Ho: model has no omitted variables
• F(3, 515) = 2.11
• Prob > F = 0.0984
• The model passes. Perhaps it would do better if age were available, and with other variables & other forms of these predictors.

estat ovtest is a decisive hurdle to clear.

• Even so, passing any diagnostic test by no means guarantees that we’ve specified the best possible model, statistically or substantively.
• It could be, again, that other models would be better in terms of statistical fit and/or substance.

Another way to test the model’s functional specification via ‘linktest’: it tests whether y is properly specified or not.

• linktest’s ‘hatsq’ must test insignificant: we want to fail to reject the null hypothesis that y is specifed correctly.

---------------------------------------------------------------------------------------------------

lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+-------------------------------------------------------------------------------------

_hat | .3212452 .3478094 0.92 0.356 -.36203 1.00452

_hatsq | .2029893 .1030452 1.97 0.049 .0005559 .4054228

_cons | .5407215 .2855868 1.89 0.059 -.0203167 1.10176

-----------------------------------------------------------------------------------------------------

• The model fails. Let’s try another model that uses categorized predictors.
• . estat ovtest=.12
• We’ll stick with this model—unless other diagnostics indicate problems. Again, too bad the data don’t include age.

Passing ovtest, linktest, or other diagnostic indicatorsdoes not mean that we’ve necessarily specified the best possible model—either statistically or substantively.

• It merely means that the model has passed some minimal statistical threshold of data fitting.

At this stage it’s helpful to plot the model’s residual versus fitted values to obtain a graphic perspective on the model’s fit & problems.

. rvfplot, yline(0) ml(id)

• Problems of heteroscedasticity?

So, even though the model passed linktest & estat ovtest, at least one basic problems remains to be overcome.

• Let’s examine other assumptions.

The model specification tests have to do with biased slope coefficients, & thus with Assumption 1.

• Next, though, let’s examine a potential model problem—multicollinearity—which in fact does not violate any of the regression assumptions.

Multicollinearity

• Multicollinearity: high correlations between the explanatory variables.
• Multicollinearity does not violate any linear model assumptions: if our objective were merely to predict values of y, there would be no need to worry about it.
• But, like small sample or subsample size, it does inflate standard errors.

It thereby makes it tough to find out which of the explanatory variables has a signficant effect.

• For confidence intervals, p-values, & hypothesis tests, then, it does cause problems.

Signs of multicollinearity:

• High bivariate correlations (say, .80+) between the explanatory variables: but, because multiple regression expresses joint linear effects, such correlations (or their absence) aren’t reliable indicators.
• Global F-test is significant, but none of the individual t-tests is significant.
• Very large standard errors.

(4) Sign of coefficients may be opposite of hypothesized direction (but could be Simpson’s paradox, etc.)

(5) Adding/deleting explanatory variables causes large changes in coefficients, particularly switches between positive & negative.

The following indicators are more reliable because they better gauge the array of joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) - ‘VIF’>10: measures inflation in variance due to multicollinearity.

• Square root of VIF: shows the amount of increase in an explanatory variable’s standard error due to multicollinearity.
• (7) Post-model estimation (STATA command ‘vif’)- ‘Tolerance’<.10: reciprocal of VIF; measures extent of variance that is independent of other explanatory variables.
• (8) Pre-model estimation (downloadable STATA command ‘collin’) – ‘Condition Number’>15 or especially >30.
• .

. vif

• Variable | VIF 1/VIF
• -------------+-----------------------------
• exper | 15.62 0.064013
• exper2 | 14.65 0.068269
• hsc | 1.88 0.532923
• _Itencat_3 | 1.83 0.545275
• ccol | 1.70 0.589431
• scol | 1.69 0.591201
• _Itencat_2 | 1.50 0.666210
• _Itencat_1 | 1.38 0.726064
• female | 1.07 0.934088
• nonwhite | 1.03 0.971242
• -------------+----------------------------
• Mean VIF | 4.23
• The seemingly troublesome scores for exper exper2 are an artifact of the quadratic form & pose no problem. The results look fine.

Correct any inappropriate use of explanatory dummy variables or computed quantitative variables.

• ‘Center’ the offending explanatory variables (see Mendenhall/Sincich), perhaps using STATA’s ‘center’ command.
• Eliminate variables—but this might cause specification errors (see, e.g., ovtest).
• Collect additional data—if you have a big bank account & plenty of time!
• Group relevant variables into sets of variables (e.g., an index): how might we do this?

Learn how to do ‘principal components analysis’ or ‘principal factor analysis’ (see Hamilton).

• Or do ‘ridge regression’ (see Mendenhall/ Sincich).

• While this is by far the least important assumption, examining the residuals at this stage—as we began to do with rvfplot—can tip us off to other problems.

Normal Distribution of Residuals

• This is necessary for confidence intervals & p-values to be accurate.
• But this is the least worrisome problem in general: if the sample is as large as 100-200, the central limit theorem says that confidence intervals & p-values will be good approximations of a normal distribution.

The most basic way to assess the normality of residuals is simply to plot the residuals via histogram, box or dotplot, kdensity, or normal quantile plot.

• We’ll use studentized(a version of standardized)residuals because we can asses their distribution relative to the normal distribution:

. predict rstu if e(sample), rstu

. su rstu, d

Note: to obtain unstandardized residuals—

predict e if e(sample), resid

. hist rstu, norm

• Not bad for the assumption of normality, but the low-end outliers correspond to the earlier evidence of heteroscedasticity.

estatimtest (information matrix test) gives us a formal test of the normal distribution of residuals—which they really don’t need to pass—plus it leads us into the assessment of non-constant variance:

Cameron & Trivedi's decomposition of IM-test

Source chi2 df p

Heteroskedasticity 21.27 13 0.0677

Skewness 4.25 4 0.3733

Kurtosis 2.47 1 0.1160

Total 27.99 18 0.0621

• Normality (skewness) is good, but the model just edges by with respect to non-constant variance (p=.0677). Let’s investigate.

Non-Constant Variance

• If the variance changes according to the levels of the explanatory variables—i.e. the residuals are not random but rather are correlated with the values of x—then:
• the OLS standard errors are not optimal: alternative approaches such as weighted least squareswould give better estimates; &
• the standard errors are biased either up or down, making statistical significance either too hard or too easy to detect.

In STATA we test for non-constant variance by means of:

• tests with estat: hettest or szroeter or imtest; &
• graphs: rvfplot & rvpplot.
• We want the tests to turn out insignificant so that we fail to reject the null hypothesis that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity

• Ho: Constant variance
• Variables: fitted values of lwage
• chi2(1) = 15.76
• Prob > chi2 = 0.0001
• There seem to be problems. Let’s inspect the individual predictors.

. estat hettest, rhs mt(sidak)

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity

Ho: Constant variance

----------------------------------------------

Variable | chi2 df p

-------------+-------------------------------

hsc | 0.14 1 1.0000 #

scol | 0.17 1 1.0000 #

ccol | 2.47 1 0.7078 #

exper | 1.56 1 0.9077 #

exper2 | 0.10 1 1.0000 #

_Itencat_1 | 0.44 1 0.9992 #

_Itencat_2 | 0.00 1 1.0000 #

_Itencat_3 | 10.03 1 0.0153 #

female | 1.02 1 0.9762 #

nonwhite | 0.03 1 1.0000 #

-------------+--------------------------------

simultaneous | 23.68 10 0.0085

-----------------------------------------------

• It seems that the problem has to do with ‘tenure.’

. estat szroeter, rhs mt(sidak)

• both hettest & szroeter say that the serious problem is with tenure.

. szroeter, rhs mt(sidak)

Szroeter's test for homoskedasticity

Ho: variance constant

Ha: variance monotonic in variable

---------------------------------------

Variable | chi2 df p

-------------+-------------------------

hsc | 0.14 1 1.0000 #

scol | 0.17 1 1.0000 #

ccol | 2.47 1 0.7078 #

exper | 3.20 1 0.5341 #

exper2 | 3.20 1 0.5341 #

_Itencat_1 | 0.44 1 0.9992 #

_Itencat_2 | 0.00 1 1.0000 #

_Itencat_3 | 10.03 1 0.0153 #

female | 1.02 1 0.9762 #

nonwhite | 0.03 1 1.0000 #

---------------------------------------

• What measures might be taken to correct or reduce the problem?

What to do about tenure? Although the model passed ovtest, adding omitted variables is a principal response to non-constant variance. I’m guessing that including the variable age, which the data set doesn’t have, would either solve or reduce the problem. Why?

• Maybe the small # observations for the high end of tenure matters as well.
• Other options:
• Categorizing a continuous predictor (multi-level or binary) may work—although at the cost of lost information).
• We also could transform the outcome variable (see qladder, ladder, etc.), though not in this example because we had good reason for creating log(wage).

A more complicated option would be to use weighted least squares regression.

• If nothing else works, we could use robust standard errors. These relax Assumption 2—to the point that we wouldn’t have to check for non-constant variance (& in fact the diagnostics for doing so wouldn’t work).

Whatever strategies we try, we redo the diagnostics & compare the new model’s coefficients to the original model.

• The key question: is there a practically significant difference in the models?
• My own data exploration finds that nothing works, perhaps because tenure is correlated with age, which the data set does not include.

What can we do?

• We can use robust standard errors.

It’s quite common, recommended practice to use robust standard errors routinely, as long as the sample size isn’t small.

• Doing so relaxes Assumption 2,which we then no longer need to check.
• If we do use robust standard errors, lots of our routine diagnostic procedures won’t work because their statistical premises don’t hold.

A reasonable, routine strategy is to use robust standard errors at this point, then re-estimate the model without the robust standard errors & compare the difference.

• Even if we decide that we should stick with robust standard errors, we can do the following: estimate the model without them; conduct the next rounds of diagnostic tests; & then use robust standard errors in our final model.

. est store m1

.

. xi:reg lwage hs scol ccol exper exper2 i.tencat female nonwhite, robust

. est store m2_robust

. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)

----------------------------------------------

Variable | m1 m2_robust

-------------+--------------------------------

hsc | .22241643*** .22241643***

scol | .32030543*** .32030543***

ccol | .68798333*** .68798333***

exper | .02854957*** .02854957***

exper2 | -.00057702*** -.00057702***

_Itencat_1 | -.0027571 -.0027571

_Itencat_2 | .22096745*** .22096745***

_Itencat_3 | .28798112*** .28798112***

female | -.29395956*** -.29395956***

nonwhite | -.06409284 -.06409284

_cons | 1.1567164*** 1.1567164***

-------------+--------------------------------

N | 526 526

----------------------------------------------

legend: * p<0.05; ** p<0.01; *** p<0.001

There’s no difference at all!

• See Allison, who points out that non-constant variance has to be pronounced in order to make a difference.
• It’s a good idea, in any case, to specify robust standard errors in a final model.

For now, we won’t use robust standard errors so that we can explore additional diagnostics.

• Our final model, however, will use robust standard errors.

Correlated Errors

• In the case of these data there’s no need to worry about correlated errors: the sample is neither cluster nor panel or time series.
• In general there’s no straightforward way to check for correlated errors.
• If we suspect correlated errors, we compensate in one or more of the following three ways:

by using robust standard errors;

• if it’s a cluster sample, by using STATA’s cluster option with the sample-cluster variable.
• . xi:reg wage educ educ2 exper i.tenure, robust cluster(district)
• But again, our data aren’t based on a cluster sample.

(3) if it’s time-series data, by using Stata’s bygodfrey option for Breusch-Godfrey Lagrange Multiplier.

This model seems to be satisfactory from the perspective of linear regression’s assumptions with the exception of an insignificant problem with non-constant variance.

• But there’s another potential problem: influential outliers.
• Particularly in small samples, OLS slope estimates can be strongly influenced by particular observations.

An observation’s influence on the slope coefficients depends on its discrepancy & leverage:

discrepancy: how far the observation on the Y-axis falls from the mean for y;

leverage: how far the observation on the X-axis falls from the mean for x.

• discrepancy + leverage = influence
• Highly influential observations are most likely to occur in small samples.

Discrepancy is measured by residuals, a standardized version of which is the studentized residual, which behaves like a t or z statistic:

• Studentized residuals of –3 or less or +3 or more usually represent outliers (i.e. y-values with high residuals), which indicate potential influence.
• Large outliers can affect the equation’s constant, reduce its fit, & increase its standard errors, but they don’t influence the regression coefficients.

Leverage is measured by hat value, a non-negative statistic that summarizes how far the explanatory variables fall from their means: greater hat values are farther from their x-mean.

• Hat values are likely to be greater in small or moderate samples: values of 3*k/n or more are relatively large & indicate potential influence.

Whereas studentized residuals & hat values each measure potential influence, Cook’s Distance & DFITSmeasure the actual influence of an observation on the overall fit of a model.

• Cook’s Distance & DFITS values of 1 or more, or 4/n or more (in large samples), suggest substantial influence (of 1 or more standard deviations) on the model’s overall fit.

DFBETAs also measure the actual influence of observations, in this case on particular slope coefficients (e.g., DFeduc, DFexper, & DFtenure).

• DFBETAs, then, provide the most direct measure of the influence of explanatory variables on slope coefficients.
• Every DFBETA increment of 1 increases the corresponding slope coefficient by 1 standard deviation: DFBETAs of 1 or more, or of at least 2 times the square root of n (in large samples), represent influential outliers.

But before we examine these influence indicators, let’s examine some graphic indicators:

. lvr2plot, ml(id)

. avplots

. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)

• There are signs of a high residual point & a high leverage point (id=24), but, given that no points appear in the the top right-hand area, no observations appear to be influential.

. avplot _Itencat_3, ml(id), ml(id)

• There again is id=24. Why isn’t it a problem?

So far, we see no problems with influential observations.

• Next let’s numerically & graphically examine the studentized residuals (rstu), hat values (h), Cook’s Distance (d), & dfbeta (DF_).

. predict rstu if e(sample), rstu

• . predict h if e(sample), hat
• . predict d if e(sample), cooksd
• . dfbeta
• . su rstu-DF_Ixtenure_3
• Although rstu has some clear outliers on both the low & high sides, the other diagnostics look good.
• Let’s use d (Cook’s Distance) to illustrate the further analysis of influence diagnostics.

. scatter d id

• Note id=24.

. list rstu h d DF_Itencat_1 DF_Itencat_2 DF_Itencat_3wage educ exper tenure female nonwhite if id==24

• To repeat, there’s no problem of influential outliers.
• If there were a problem, what would we do?

Correct outliers that are coding errors if possible.

• Examine the model’s adequacy (see the sections on model specification & non-constant variance) & make adjustments (e.g., adding omitted variables, interactions, log or other transforms).
• Other options: try other types of regression—

Try, e.g., quantile—including median—regression (qreg, iqreg, sqreg, bsqreg) or robust regression (rreg). See Stata manual; Hamilton, Statistics with Stata; & Chen et al., Regression with Stata (UCLA-ATS web book).

• Quantile regression works with y-outliers only, while robust regression works with x-outliers.

One more thing: keep in mind that there are no significance tests associated with these outlier/influence diagnostics.

• A given outlier, then, could result from chance alone—as occurs by definition in a normal distribution.
• For a way—which includes a significance test—to examine if observations are outliers on more than one quantitative variable, see the command ‘hadimvo.’

lvr2 & hadimov are very useful tools that cut through lots of the other outlier/influence diagnostics.

• In a normal distribution we expect 5% of the observations to be outliers.
• Don’t over-fit a model to a sample—remember that there is sample-to-sample variation.

Let’s Wrap It Up

• Using robust standard errors:
• . xi:reg lwage hs scol ccol exper exper2 i.tencat female nonwhite, robust