1 / 32

BMTRY 701 Biostatistical Methods II

Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor. BMTRY 701 Biostatistical Methods II. Recall the added variable plots. These can help check for adequacy of model Is there curvature between Y and X after adjusting for the other X’s?

junior
Download Presentation

BMTRY 701 Biostatistical Methods II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 13Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701Biostatistical Methods II

  2. Recall the added variable plots • These can help check for adequacy of model • Is there curvature between Y and X after adjusting for the other X’s? • “Refined” residual plots • They show the marginal importance of an individual predictor • Help figure out a good form for the predictor

  3. Example: SENIC • Recall the difficulty determining the form for INFIRSK in our regression model. • Last time, we settled on including one term, INFRISK^2 • But, we could do an adjusted variable plot approach. • How? • We want to know, adjusting for all else in the model, what is the right form for INFRISK?

  4. R code av1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) ) av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) ) resy <- av1$residuals resx <- av2$residuals plot(resx, resy, pch=16) abline(lm(resy~resx), lwd=2)

  5. Added Variable Plot

  6. What does that show? • The relationship between logLOS and INFRISK if you added INFRISK to the regression • But, is that what we want to see? • How about looking at residuals versus INFRISK (before including INFRISK in the model)?

  7. R code mlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION)) smoother <- lowess(INFRISK, mlr8$residuals) plot(INFRISK, mlr8$residuals) lines(smoother)

  8. R code > infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0) > mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION)) > summary(mlr9) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 *** INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 *** factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 *** factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1137 on 104 degrees of freedom Multiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16

  9. Residual Plots SPLINE FOR INFRISK INFRISK2

  10. Which is better? • Cannot compare via ANOVA because they are not nested! • But, we can compare statistics qualitatively • R-squared: • MLR7: 0.60 • MLR9: 0.62 • Partial R-squared: • MLR7: 0.17 • MLR9: 0.19

  11. Identifying Outliers • Harder to do in the MLR setting than in the SLR setting. • Recall two concepts that make outliers important: • Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates. • Influence is a measure of how much a data point actually does affect the estimated model. • Leverage and influence both may be defined in terms of matrices

  12. “Hat” matrix • We must do some matrix stuff to understand this • Section 6.2 is MLR in matrix terms • Notation for a MLR with p predictors and data on n patients. • The data:

  13. Matrix Format for the MLR model • More notation: • THE MODEL: • What are the dimensions of each?

  14. “Transpose” and “Inverse” • X-transpose: X’ or XT • X-inverse: X-1 • Hat matrix = H • Why is H important? It transforms Y’s to Yhat’s:

  15. Estimating, based on fitted model Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

  16. Other uses of H I = identity matrix Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

  17. Property of hij’s This means that each row of H sums to 1 And, that each column of H sums to 1

  18. Other use of H • Identifies points of leverage 1 2 3 4

  19. Using the Hat Matrix to identify outliers • Look at hii to see if a datapoint is an outlier • Large values of hii imply small values of var(ei) • As hii gets close to 1, var(ei) approaches 0. • Note that • As hii approaches 1, yhat approaches y • This gives hii the name “leverage” • HIGH HAT VALUE IMPLIES POTENTIAL FOR OUTLIER!

  20. R code hat <- hatvalues(reg) plot(1:102, hat) highhat <- ifelse(hat>0.10,1,0) plot(x,y) points(x[highhat==1], y[highhat==1], col=2, pch=16, cex=1.5)

  21. Hat values versus index

  22. Identifying points with high hii

  23. Does a high hat mean it has a large residual? • No. • hii measures leverage, not influence • Recall what hii is made of • it depends ONLY on the X’s • it does not depend on the actual Y value • Look back at the plot: which of these is probably most “influential” • Standard cutoffs for “large” hii: • 2p/n • 0.5 very high, 0.2-0.5 high

  24. Let’s look at our MLR9 • Any outliers?

  25. Using the hat matrix in MLR • Studentized residuals • Acknowledge: • each residual has a different variance • magnitude of residual should be made relative to its variance (or sd) • Studentized residuals recognize differences in sampling errors

  26. Defining Studentized Residuals • From slide 15, • We then define • Comparing ei and ri • ei have different variance due to sampling variations • ri have constant variance

  27. Deleted Residuals • Influence is more intuitively quantified by how things change when an observation is in versus out of the estimation process • Would be more useful to have residuals in the situation when the observation is removed. • Example: • if a Yi is far out then it may be very influential in the regression and the residual will be small • but, if that case is removed before estimating and then the residual is calculated based on the fit, the residual would be large

  28. Deleted Residuals, di • Process: • delete ith case • fit regression with all other cases • obtain estimate of E(Yi) based on its X’s and fitted model

  29. Deleted Residuals, di • Nice result: you don’t actually have to refit without the ith case! where ei is the ‘plain’ residual from the ith case and hii is the hat value. Both are from the regression INCLUDING the case • For small hii: ei and di will be similar • For large hii: ei and di will be different

  30. Studentized Deleted Residuals • Recall the need to standardize, based on the knowledge of the variance • The difference between ti and ri?

  31. Another nice result • You can calculate MSE(i) without refitting the model

  32. Testing for outliers • outlier = Y observations whose studentized deleted residuals are large (in absolute value) • ti ~ t with n-p-1 degrees of freedom • Two examples: • simulated data • mlr9

More Related