Chapter 6: Regression Diagnostics

3 Chapter 6: Regression Diagnostics

Objectives • Identify common problems in regression. • Review the assumptions of linear regression. • Examine the assumptions with scatterplots and residual plots. • Evaluate models where assumptions are met. • Evaluate models where assumptions are not met and examine possible alternatives.

Common Problems • Five common problems with regression are as follows: • nonconstant variance • nonnormal errors • correlated errors • influential observations • collinearity

Assumptions for Linear Regression • The variables are related linearly. • The errors are normally distributed with a mean of zero. • The errors have a constant variance. • The errors are independent.

Verifying Assumptions

Examining Residual Plots

Examining Residuals for Normality

Evaluating the Assumptions of Regression • This demonstration illustrates the concepts discussed previously.

Model Assumptions Not Met • The model is more than the interpolating function for the response, based on one or more predictors. • It also addresses the distribution of the errors, which are estimated by the residuals. • Various diagnostic methods, based on the residuals, might indicate a violation of the model assumptions. • Some of these problems can be corrected by transforming the response or predictor variables.

Transforming Variables • Variable transformations can be useful when one or more of the assumptions of linear regression are violated. They are typically used to • linearize the relationship between X and Y • stabilize the variance of the residuals • normalize the residuals.

Ad Hoc Transformations • Data analysts have used empirical trials to find a mathematical function to serve as a transformation of the predictor or response. • Square root • Logarithm

Systematic Transformation • Tukey devised a ladder of re-expressions to organize the transformations as common powers. • -3, -2, -1, -½, ½, 1, 2, 3 • The nature of the curvature in the response directed the kind and strength of the power. • This ladder is captured in the bulging rule and a graphic.

The Bulging Rule • Examine the curvature Y power X power

The Box-Cox Power Transformation • Box and Cox generalized the ladder of re-expression into a function of a continuous power, λ. • Transformation is applied over a reasonable range of powers. The best power minimizes the error sum of squares. • The final choice of power usually affords a physical interpretation.

Variable Transformations Using Fit Special and Box-Cox • This demonstration illustrates the concepts discussed previously.

Exercise • This exercise reinforces the concepts discussed previously.

6.01 Quiz • Match the residual plot on the left to the transformation that would be used to address the assumption violation. • 1. • a. Linearizing • b. Variance stabilizing • c. Normalizing • 2. d. No transformation • 3.

6.01 Quiz – Correct Answer • Match the residual plot on the left to the transformation that would be used to address the assumption violation. • 1. • a. Linearizing • b. Variance stabilizing • c. Normalizing • 2. d. No transformation • 3. 1-D, 2-B, 3-A

Objectives • Generate graphs to examine the influence of individual observations. • Use statistics to identify potential outliers and influential observations.

Multivariate Outlier Analysis Mahalanobis distance

Mahalanobis and Jackknife Distance Statistics • This demonstration illustrates the concepts discussed previously.

6.02 Quiz Why are points more likely to be detected as multivariate outliers using jackknife distances instead of ordinary Mahalanobis distances?

6.02 Quiz – Correct Answer Why are points more likely to be detected as multivariate outliers using jackknife distances instead of ordinary Mahalanobis distances? The jackknife distance excludes the point in question from the calculation of the centroid, so it is more likely to appear unusual when compared to the centroid. The Mahalanobis distance includes all points in the calculation of the centroid.

Influential Observations

Leverage Plots • Producing scatterplots of the response (Y) versus each of the possible predictor variables (the Xs) is recommended. • In the multiple regression situation, however, these plots can be somewhat misleading because Y might depend on the other Xs not accounted for in the plot. • Leverage plots compensate for this limitation of scatterplots.

Example of a Scatterplot

Example of a Leverage Plot

Leverage Plots Consider a multiple linear regression model with Y as the dependent variable and X1, X2, and X3 as the independent variables. • To create a leverage plot for one of the independent variables, for example X1: • <regress Y on X2 and X3. These residuals are the • vertical axis of the leverage plot. • <regress X1 on X2 and X3. These residuals are the • horizontal axis of the leverage plot.

Leverage Plots • This demonstration illustrates the concepts discussed previously.

Summary of Leverage Plots • No strong patterns indicating influential observations are obvious in any of the plots. Consequently, it appears that the model fits the data well. • Gracie appears to have some strong influence on the slopes of RunPulse and MaxPulse. • Gracie might have some influence on the whole model. Sammy might have some influence on the slope of Age.

Cook’s D: An Influence Statistic • Cook’s D statistic is a measure of the simultaneous change in the parameter estimates when an observation is deleted from the analysis. • A suggested cutoff for deciding if an observation might have an adverse effect on the analysis is if • where n is the sample size.

Looking for Influential Observations • This demonstration illustrates the concepts discussed previously.

How to Handle Influential Observations • Recheck the data to ensure that no transcription or data entry errors have occurred. • If the data are valid, one possible explanation is that the model is not adequate. A model with higher-order terms, such as polynomials and interactions between the variables, might be necessary to fit the data well. • If the data are valid and no higher-order terms are necessary, either collect additional data, redefine the population, or report the model results with and without the influential observation.

Exercise • This exercise reinforces the concepts discussed previously.

6.03 Quiz • Cook’s D for a particularith data point examines the square of the difference between the model’s least squares parameter estimates using all n data points (call it bn) and the model’s least squares estimate including all but the one point of interest (call it bn-i). • (bn - bn-i)2 • Why is a Cook’s D value close to zero indicative that the point is not influential?

6.03 Quiz – Correct Answer • Cook’s D for a particularith data point examines the square of the difference between the model’s least squares parameter estimates using all n data points (call it bn) and the model’s least squares estimate including all but the one point of interest (call it bn-i). • (bn - bn-i)2 • Why is a Cook’s D value close to zero indicative that the point is not influential? • If the difference is close to zero, then the parameter estimates for the model hardly changed at all when the ith point was removed from the data.

Objectives • Understand how collinearity causes problems in multiple regression. • Detect collinearity and resolve it.

Collinear Predictors • Consider two highly correlated predictors. X1 ≈ X2

Collinear Predictors and Interpretation • If larger Ys are good and smaller Ys are bad, each relationship gives a strikingly different interpretation.

Illustration of Collinearity Y X1 X2

Illustration of Collinearity Y * X1 X2

Effects of Collinearity on Parameter Estimates (Optional) • This demonstration illustrates the concepts discussed previously.

Variance Inflation Factor (VIF) • VIF indicates how much the variance of the parameter estimates willincrease due to the multiple correlation between one effect and all of the other effects in the model. R2 :  = 0 + 1X1 + 2 X2 + 12 X1 X2 R12:X1 = 0 + 2 X2 +  12 X1 X2

VIF Formula • Ideally Ri2 = 0, so the VIFi is 1. • VIF > 10 is cause for concern. It means that Ri2  0.9.

Chapter 6: Regression Diagnostics

Chapter 6: Regression Diagnostics

Presentation Transcript

Chapter 13

Regression for Data Mining

Logistic Regression – Basic Relationships

This chapter uses MS Excel and Weka

Principles of Biostatistics Simple Linear Regression

Regression Analysis

3.3 Hypothesis Testing in Multiple Linear Regression

Chapter 14

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

The Multiple Regression Model

Chapter 2 Statistical Tools in Evaluation

Molecular Diagnostics – How To Get Started

Chapter 2: Logistic Regression

Relative Importance of Predictors with Regression Models

Correlation and regression

Chapter 3

Nonlinear Regression Models

Multilevel Regression Models

Chapter 10 Correlation and Regression

Regression