1 / 22

Stat 6601 Project: Regression Diagnostics (V&R 6.3)

Stat 6601 Project: Regression Diagnostics (V&R 6.3). Presenters: Anthony Britto, Kathy Fung, Kai Koo. Basic Definition of Regression Diagnostics. An old robust method Developed to measure and iteratively detect possibly wrong data and reject them through analysis of globally fitted model.

prentice
Download Presentation

Stat 6601 Project: Regression Diagnostics (V&R 6.3)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stat 6601 Project:Regression Diagnostics(V&R 6.3) Presenters: Anthony Britto, Kathy Fung, Kai Koo Regression Diagnostics

  2. Basic Definition of Regression Diagnostics • An old robust method • Developed to measure and iteratively detect possibly wrong data and reject them through analysis of globally fitted model Regression Diagnostics

  3. Regression Diagnostics • Goal: • Detection of possibly wrong data through analysis of globally fitted model. • Typical approach: • (1) Determine an initial fitted model • (2) Compute the residuals • (3) Reject / identify outliers • (4) Rebuild model or tracking the source of errors Regression Diagnostics

  4. Influence and Leverage (1) • Influence: An observation is influential if the estimates change substantially when this observation is omitted. • Leverage: The "horizontal" distance of the x -value from the mean of x. The further from the mean, the more leverage an observation has. • y-discrepancy:The vertical distance between yobs.and ypredicted Conceptual formula: Influence = Leverage × y-Discrepancy Regression Diagnostics

  5. Influence and Leverage (2) High influence point (5,60) Low influence point (30,105) (x - mean of x)2 = 830 (x -mean of x)2 = 15 yobs - ypred = 45 yobs - ypred = 45 Regression Diagnostics

  6. Detecting Outliers • Distinguish the difference between two types of outliers • 1st type: outliers in the response variable represent model failure, such observations are called outliers. • 2nd type: outliers with respect to the predictors are called leverage points. • Both types can affect the regression model. However, they may almost uniquely determine regression coefficients. They may also cause the standard error of regression coefficients to be much smaller than they would be if the observation were excluded. Regression Diagnostics

  7. Methods to detect outliers in R Outliers in the predictors can often be detected by simply examining the distribution of the predictors. • Dot Plots • Stem-and-leaf plots • Box Plots • Histograms Regression Diagnostics

  8. Linear Model Y = b0 + b1x1+ b2x2 + .... + bkxk + e Matrix form Y = Xb + e Y = X = b= e = Regression Diagnostics

  9. R Functions for Regression Diagnostics Package Function Description Base plot(model) Basic diagnostics plots ls.diag (lsfit(x,y)) Diagnostic tool car cr.plots(model) Partial residual plots av.plots(model) Partial regression plots hatvalues (model) Hat values outlier.test (model) Test for largest residual df.betas(model) DfBet as measure of influence cookd(model) Cook’s D measure of influence rstudent(model) Studentized residuals vif(model) VIF or GVIF for each term in the model Regression Diagnostics

  10. R function for Robust Regression Package Function Description MASS rlm (yx) M-Estimation lqs ltsreg (yx) Least-Trimmed squares lms(yx) Least-Median regression Regression Diagnostics

  11. Example: Linear regression (one independent variable) 1 Matrix form R / S-plus script Y = Xb + e > xd <- c(rep(1,5),1,3,4,5,7) > yd <- c(6,14,10,14,26) > x <- matrix(xd,5,2, byrow=F) > y <- matrix(yd,5,1, byrow=T) > xtrp <- t(x) # Matrix transpose > xxtrp <- xtrp %*% x # Matrix multiplication > inxxtrp <- solve(xxtrp) #Matrix inverting > b.hat <- inxxtrp %*% xtrp %*% y > b.hat [,1] [1,] 2 [2,] 3 > H <- x %*% inxxtrp %*% xtrp # hat matrix > H [,1] [,2] [,3] [,4] [,5] [1,] 0.65 0.35 0.2 0.05 -0.25 [2,] 0.35 0.25 0.2 0.15 0.05 [3,] 0.20 0.20 0.2 0.20 0.20 [4,] 0.05 0.15 0.2 0.25 0.35 [5,] -0.25 0.05 0.2 0.35 0.65 Regression Diagnostics

  12. Example: Linear regression (one independent variable) 2 Extraction of leverages and predicted values Leverage of the ith observation (hii) (for one independent variable; n = # of obs.; p =1) > n <- 5 > lev <- numeric(n) > for (i in 1:n) { + lev[i] <- H[i,i] + } > lev [1] 0.65 0.25 0.20 0.25 0.65 > h <- lm.influence(lm(y~x))$hat > h [1] 0.65 0.25 0.20 0.25 0.65 > ls.diag(lsfit(x[,2],y))$hat [1] 0.65 0.25 0.20 0.25 0.65 > y1.pred <- 0 > for (i in 1:n) { + y1.pred <- y1.pred + H[1,i]* y[i] + } > y1.pred # y1.pred=(x1=1)*3(slope+2(intercept) [1] 5 hij = leverage of (xi, yi) if i =j Regression Diagnostics

  13. Example: linear regression (measurement of residuals) From y-discrepancy to influence Raw residual value (y-discrepancy) Standardized residual value (influence) Studentized residual value (influence) Regression Diagnostics

  14. Influence, leverage and discrepancy The influence of observations can be determined by their residual values and leverages. Regression Diagnostics

  15. Calculation of residual values # Do it by youself in R > y.pred <- numeric(n) > for (i in 1:n) { + for (j in 1:n) { + y.pred[i] <- y.pred[i] + H[i,j]* yd[j] + } + } > res <- yd-y.pred > Sy <- sqrt(sum(res^2)/(n-2)) > resstd <- res/(Sy*sqrt(1-lev)) > resstd [1] 0.4413674 0.9045340 -1.1677484 -0.9045340 1.3241022 # Using ls.diag to get residuals > ls.diag(lsfit(x[,2],y))$std.res #standardized residuals [1] 0.4413674 0.9045340 -1.1677484 -0.9045340 1.3241022 > ls.diag(lsfit(x[,2],y))$stud.res #Studentized residuals [1] 0.3726780 0.8660254 -1.2909944 -0.8660254 1.6770510 Regression Diagnostics

  16. Example: Multiple regression R / S-plus scriptR output Call: glm(formula = log10price ~ elevation + date + flood + distance, data = project.data) Deviance Residuals: Min 1Q Median 3Q Max -0.22145 -0.09075 -0.04765 0.07475 0.43564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.226620 0.092763 13.223 4.74e-13 *** elevation 0.032394 0.007304 4.435 0.000149 *** date 0.008065 0.001168 6.902 2.50e-07 *** flood -0.338254 0.087451 -3.868 0.000659 *** distance 0.025659 0.007177 3.575 0.001401 ** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for gaussian family taken to be 0.02453675) Null deviance: 2.90725 on 30 degrees of freedom Residual deviance: 0.63796 on 26 degrees of freedom AIC: -20.414 Number of Fisher Scoring iterations: 2 project.data<-read.csv("projdata.csv") model1 <- glm(log10price~elevation+date+flood+distance, data=project.data) summary(model1) Regression Diagnostics

  17. Example: Multiple regression (measurement of influence using R / S-plus) Residual plot R / S-plus script # Measurement of influence y <- matrix(log10price,31,1, byrow=T) x <- matrix(c(elevation, date, flood, distance), 31,4,byrow=F) lesi <- ls.diag(lsfit(x,y)) # Regression diagnostics lesi$stud.res # Extraction of Studentized residuals plot(lesi$stud.res, ylab="Studentized residuals", xlab="obs #") lesi$cooks # Extraction of Cook's [1] 1.392863e-02 3.528960e-01 8.396778e-02 1.518977e-01 1.390608e-01 [6] 1.145438e-02 2.437453e-03 1.972966e-03 1.705327e-01 9.386767e-02 [11] 7.468621e-03 1.134031e-06 1.945352e-04 1.678359e-03 8.794873e-03 [16] 5.150404e-03 2.257051e-05 4.193730e-03 1.961141e-02 1.120336e-03 [21] 1.075247e-01 1.071167e-02 2.825819e-02 2.193734e-03 5.710213e-02 [26] 7.024345e-02 1.166287e-03 1.322331e-02 2.616666e-03 1.411050e-01 [31] 1.06727e-02 Regression Diagnostics

  18. Example: Multiple regression (SAS) SAS scriptOutput The REG Procedure Model: MODEL1 Dependent Variable: log10price Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 2.00013 0.50003 14.33 <.0001 Error 26 0.90712 0.03489 Corrected Total 30 2.90725 Root MSE 0.18679 R-Square 0.6880 Dependent Mean0.98126 Adj R-Sq 0.6400 Coeff Var 19.03533 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.38737 0.09877 14.05 <.0001 size 1 0.00012958 0.00011481 1.13 0.2694 elevation 1 0.02820 0.00866 3.26 0.0031 flood 1 -0.23779 0.09837 -2.42 0.0229 date 1 0.00881 0.00150 5.88 <.0001 data land(drop=county sewer); infile "c:\stat 6401\projdata.csv" delimiter=',' firstobs=2; input price county size elevation sewer date flood distance; log10price=log10(price); run; proc reg data=land; model log10price=elevation size date flood /r ; plot rstudent.*log10price='+'; output out=pred pred=phat; title 'linear regression for housing prices'; run; Regression Diagnostics

  19. Example: Multiple regression (SAS) Output Statistics Dep Var Predicted Std Error Std Error Student Cook's Obs log10price Value Mean Predict Residual Residual Residual -2-1 0 1 2 D 1 0.6532 0.7276 0.0698 -0.0744 0.140 -0.530 | *| | 0.014 2 1.0253 0.5897 0.0628 0.4356 0.144 3.036 | |******| 0.353 3 0.2304 0.3623 0.0850 -0.1319 0.132 -1.002 | **| | 0.084 4 0.6990 0.8682 0.0872 -0.1692 0.130 -1.301 | **| | 0.152 5 0.6990 0.5380 0.0875 0.1609 0.130 1.239 | |** | 0.139 6 0.5185 0.5978 0.0623 -0.0793 0.144 -0.552 | *| | 0.011 7 0.7559 0.8078 0.0474 -0.0519 0.149 -0.348 | | | 0.002 8 0.7924 0.8400 0.0466 -0.0477 0.150 -0.319 | | | 0.002 9 1.2878 1.3972 0.1082 -0.1094 0.113 -0.966 | *| | 0.171 10 0.5051 0.7266 0.0635 -0.2215 0.143 -1.546 | ***| | 0.094 11 0.6721 0.7347 0.0634 -0.0626 0.143 -0.437 | | | 0.007 12 0.8388 0.8399 0.0495 -0.001063 0.149 -0.0072 | | | 0.000 13 0.9085 0.9256 0.0416 -0.0171 0.151 -0.113 | | | 0.000 14 1.0645 1.1228 0.0364 -0.0584 0.152 -0.383 | | | 0.002 15 1.2856 1.1825 0.0457 0.1031 0.150 0.688 | |* | 0.009 16 1.0682 1.1578 0.0409 -0.0896 0.151 -0.593 | *| | 0.005 17 1.1239 1.1305 0.0368 -0.006693 0.152 -0.0440 | | | 0.000 18 1.1790 1.0709 0.0315 0.1081 0.153 0.704 | |* | 0.004 19 1.0934 1.2252 0.0519 -0.1317 0.148 -0.891 | *| | 0.020 20 1.1847 1.1382 0.0373 0.0464 0.152 0.305 | | | 0.001 21 1.0864 1.2145 0.0920 -0.1281 0.127 -1.011 | **| | 0.108 22 1.2577 1.0865 0.0318 0.1712 0.153 1.116 | |** | 0.011 23 1.2253 1.0051 0.0393 0.2202 0.152 1.452 | |** | 0.028 24 0.7709 0.7985 0.0728 -0.0277 0.139 -0.200 | | | 0.002 25 0.6021 0.7314 0.0769 -0.1293 0.136 -0.948 | *| | 0.057 26 1.5705 1.2588 0.0431 0.3117 0.151 2.070 | |**** | 0.070 27 1.2601 1.2152 0.0391 0.0449 0.152 0.296 | | | 0.001 28 1.1790 1.2709 0.0589 -0.0919 0.145 -0.633 | *| | 0.013 29 1.3598 1.4007 0.0590 -0.0408 0.145 -0.281 | | | 0.003 30 1.1818 1.0540 0.0980 0.1279 0.122 1.047 | |** | 0.141 31 1.3404 1.4003 0.0737 -0.0598 0.138 -0.433 | | | 0.011 Regression Diagnostics

  20. Example: Multiple regression (SAS) Residual plotStudentized Residual plot Regression Diagnostics

  21. Further studies for regression analysis • Analysis of models • Multicollinearity • Heteroscedasticity • Autocorrelation • Validation of models • Website of R Function for modern regression http://socserv.socsci.mcmaster.ca/andersen/ICPSR/RFunctions.pdf Regression Diagnostics

  22. The End Regression Diagnostics

More Related