1 / 25

Sociology 602 Martin Week 11, April 16 2002

The point of today's lecture. In the past, we looked at simple graphic measures of diagnosing problems for simple regression models (Models with only one explanatory variable)nonlinearitynonconstant error variance (heteroskedasticity)multicollinearitynon-normal error distributionerrors dependen

adrina
Download Presentation

Sociology 602 Martin Week 11, April 16 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Sociology 602 (Martin) Week 11, April 16 2002 Some useful regression diagnostics and remedial measures for multiple regression for nonlinearity of added variables NKNW 9.1 for heteroskedasticity NKNW 10.1 for influential cases NKNW 9.4, scan NKNW 9.2 for multicollinearity NKNW 9.5, scan NKNW 10.2 Ideas for model validation NKNW 10.6

    2. The point of todays lecture In the past, we looked at simple graphic measures of diagnosing problems for simple regression models (Models with only one explanatory variable) nonlinearity nonconstant error variance (heteroskedasticity) multicollinearity non-normal error distribution errors dependent on other errors or on x. Today, we look at some very practical techniques for identifying problems with multiple regression models.

    3. Why diagnostics and remedies? It is important to find and correct potential problems with a multiple regression model. It is equally important to explain to a skeptical reader some of the potential problems with your model, and to explain the steps you have taken to investigate those problems. Think of todays topics not as things that will go on behind the scenes, but as ways to strengthen the force of your conclusions.

    4. Diagnostics: nonlinearity For a simple regression model Yhat= bo + b2X2, we diagnose nonlinearity by plotting X2 against Y. What happens when we want to diagnose whether X1 will be nonlinear when we add it to the model? WRONG: plot X1 against Y. (This is wrong because part of Y is already explained by X2 in the model, and part of the variation in X1 is explained if it is collinear with X2) RIGHT: plot the residual of Y when regressed on X2 against the residual of X1 when regressed against X2. (Thats correct! Treat X1 as an outcome variable.)

    5. Diagnostics: nonlinearity The diagnostic tool for nonlinearity in a multiple regression plot is called a partial regression plot. Y axis = e(Y|X2), the residual when regressing Y on X2 X axis = e(X1|X2), the residual when regressing X1 on X2 The partial regression plot shows what you are adding when you add X1 to a model that already contains Y and X2. SAS format (see example in handout): proc reg; model Y = X1 X2 / partial; run;

    6. Remedies for nonlinearity A partial regression plot answers this question: When I add X1 to a regression model already containing X2, does the added explanatory power of X2 have a linear relationship with the unexplained portion of Y? If X1 has a linear partial regression plot, add it to the model. If X1 has a nonlinear partial regression plot, use a transform, then add it to the model. log transform? polynomial transform? categorical transform? If X1 has no relationship in a partial regression plot, leave it out of the model.

    7. Weaknesses of the partial regression plot You may be tempted to treat a partial regression plot as a simple plot of X against Y, but there are many ways that it is more complex. the relation shown is for Xj adjusted for the other variables in the model, not for Xj alone. the plot can suggest nonlinear patterns, but it does not provide a functional expression for the nonlinearity. If you are serious about using partial regression plots, I recommend trying a partial regression plot for each transformation you are thinking of using.

    8. Diagnostics for heteroskedasticity We already know a fairly useful SAS statement for testing heteroskedasticity: the SPEC test. Null hypothesis for a spec test: in the sample, there are different variances at different values of X, but these differences are due to random chance alone. For a SPEC test: p < .05 means that if the population has a constant error variance, less than 5% of samples of this size would have such a nonconstant error variance. Rejecting the null means that you have concluded that the population has a nonconstant error variance, in violation of regression model assumptions.

    9. Diagnostics for heteroskedasticity A warning about the spec test: If p > .05 for a sample, this does not prove that the error variance is constant. Do you see the following logical fallacy? If a population has constant error variance, then a sample will have a certain type of variance structure. The sample has that variance structure. Therefore, the population has a constant error variance. (If an animal is a polar bear, then it is white. The animal is white. Therefore, it is a polar bear.)

    10. Diagnostics for heteroskedasticity Another warning about the spec test: An influential observation can weaken a spec test. If there is one observation at an extreme Y and X value, then it will pull the regression line into itself, and it will appear to have a relatively small variance. Example: SMSA data set for numphys and totalpop.

    11. Remedies for heteroskedasticity One remedy for heteroskedasticity is to transform the Y axis, usually into log form. (I usually recommend this) Another way is to discuss the nonconstant error variance in the conclusion, and use conservative alpha levels. A third way (which I do not like to use) is to reduce the heteroskedasticity by reweighting the cases with large residuals. Decision rule: Any case with a large residual only counts for half a case. (Note that this reduces the average error in a sample!) Draw an example graph.

    12. Remedies for heteroskedasticity: reweighting the sample To run a regression model with a reweighted sample, figure out the absolute residual that corresponds to a large z-score (say, 3). Next, reweight all such cases so that they count less than cases with smaller residuals. proc reg; model numphys=totalpop perccc perchs / spec p; reweight r. ge 2000 or r. le -2000 / weight=.5; run;

    13. Problems with reweighting the sample 1.) People often object to reweighting cases based on the outcome of the regression. Isnt it arbitrary to discount a case just because it doesnt fit the model well? 2.) Reweighting produces a new regression equation that might have different problems with error variance! This can make reweighting a repeated exercise. A common procedure for this is robust regression, which is based on iteratively reweighted least squares (IRLS).

    14. Diagnostics: influential observations In earlier lectures, we identified influential observations by their extreme x-scores and/or extreme residuals. In multiple regression with several x-variables, it is much harder to make a plot that shows extreme x-scores. Instead, we rely on several SAS regression outputs that tell us what effect an observation is having on the regression model.

    15. Diagnostics: influential observations To get information on influential observations from SAS, type the following SAS statements: proc reg; model y=x1 x2 x3 / influence; run; SAS will then produce the following information on influential observations: (among other things) residuals studentized residuals hat matrix diagonal elements dffits dfbetas

    16. Influential observations: residuals Residuals tell you how far a Y-observation is from its predicted point on the regression line. ei = Yi - Yhat i Studentized residuals express the residuals as t-scores; this tells you the probability that an observation would be so far from the predicted value, assuming the errors are normally distributed. (Studentized residuals are denoted by ri) ri = ei / s{ei}

    17. Influential observations: hat matrix diagonals The hat matrix is a particularly cool matrix that you can calculate when you solve a regression equation: Yhat = HY For today, all we need to know is the following: each diagonal in the nxn hat matrix corresponds to all the variables for a single observation. the average value for a diagonal element is hbar = p/n larger diagonal elements correspond to extreme X outliers. (There is no explicit test for influential X values). (Draw a graph of x1 vs x2 with values for diagonal elements)

    18. Influential observations: DFFITS A useful question to identify influential observations is: Would the model be different without this observation? The difference in fitted values of Y (DFFITS) compares Yi hat in the model with all observations with Yi hat for the model missing observation i, expressed as a t-score (equation omitted) If you want to think of DFFITS as a t-score, use this equation: DFFITS ~= t*sqrt(p/n) see SAS output for examples.

    19. Influential observations: DFBETAS The difference in regression coefficients (DFBETAS) compares each bj in the model with all observations with the corresponding bj for the model missing observation i. If you want to think of DFBETAS as a t-score, use this equation: DFBETAS ~= t/sqrt(n) see SAS output for examples.

    20. Influential observations: remedies The safest way to deal with influential observations is to run a full model, then run the same model without the influential observations. If the results are the same, your findings are not affected by influential observations. If the results are different, your findings are strongly affected by influential observations, and you should transform the x-axis and/or the y-axis.

    21. Diagnostics for multicollinearity In earlier lectures, we mentioned two ways to identify multicollinearity between x-variables. Graph the two variables. Calculate the correlation coefficient for the two variables. We now add a third diagnostic tool: the Variance Inflation Factor (VIF) The VIF asks how much the standard error for coefficient bj is inflated by the presence of other variables in the model. A high VIF means that the model is giving us a very uncertain prediction for bj.

    22. Calculating VIF The SAS structure for VIF is as follows: proc reg; model y=x1 x2 x3 / vif; run; A rough way to calculate VIF is as follows: find the standard error for bj in the simple model Yhat = b0 + bjX find the standard error for bj in the full regression model. VIF = (s(bj full) / s(bjsimple))2 There is no set standard for a bad VIF. the book says 10 may be too much. in general, regression models still hold up if VIF < 30.

    23. Remedies for high multicollinearity 1.) Ridge regression: a conceptually difficult approach that I do not like. (but see SAS example). 2.) Transforming X variables: this may reduce high multicollinearity in cases where much of the high multicollinearity is due to an outlier. 3.) Leave one of the collinear X-variables out based on theoretical considerations, and discuss the possible problems with multicollinearity.

    24. Model validation. We have looked at many ways to check whether we have the right variables in the model, whether the variables are correctly specified, and whether the sample results are representative of the population. However, there are two important things to do: 1.) Validate your findings against other data: (Collect more data, or use a holdout sample.) 2.) Validate your findings against previous results: (What have people done in the past? What have they found? What have they concluded? Why didnt they already do what you just did?)

    25. More on model validation. I dont want you to think that model validation is unimportant because we only spent one slide on the topic. This section is short because you already know how to do these things; you just need to be reminded to spend the effort to do them. You cannot publish any work unless you validate your results against previous findings. You should not publish any work unless you validate your results against other data.

    26. Summary: We have now finished with OLS multiple regression, except for a few new tricks I might show you during the review session. You are now capable of the following: doing a multiple regression model thinking critically about things that might be wrong with your model doing something to check whether your model or conclusions are wrong describing to the reader some of the things you have done to make sure you are right.

More Related