Lecture 25
1 / 23

Lecture 25 - PowerPoint PPT Presentation

  • Uploaded on

Lecture 25. Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction variables. Assumptions of Multiple Linear Regression Model. Assumptions of multiple linear regression:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Lecture 25' - aulani

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 25 l.jpg
Lecture 25

  • Regression diagnostics for the multiple linear regression model

  • Dealing with influential observations for multiple linear regression

  • Interaction variables

Assumptions of multiple linear regression model l.jpg
Assumptions of Multiple Linear Regression Model

  • Assumptions of multiple linear regression:

    • For each subpopulation ,

      • (A-1A)

      • (A-1B)

      • (A-1C) The distribution of is normal

        [Distribution of residuals should not depend on


    • (A-2) The observations are independent of one another

Checking refining model l.jpg
Checking/Refining Model

  • Tools for checking (A-1A) and (A-1B)

    • Residual plots versus predicted (fitted) values

    • Residual plots versus explanatory variables

    • If model is correct, there should be no pattern in the residual plots

  • Tool for checking (A-1C)

    • Histogram of residuals

  • Tool for checking (A-2)

    • Residual plot versus time or spatial order of observations

Model building display 9 9 l.jpg
Model Building (Display 9.9)

  • Make scatterplot matrix of variables (using analyze, multivariate). Decide on whether to transform any of the explanatory variables. Check for obvious outliers.

  • Fit tentative model.

  • Check residual plots for whether assumptions of multiple regression model are satisfied. Look for outliers and influential points.

  • Consider fitting richer model with interactions or curvature. See if extra terms can be dropped.

  • Make changes to model and repeat steps 2-4 until an adequate model is found.

Multiple regression modeling and outliers leverage and influential points pollution example l.jpg
Multiple regression, modeling and outliers, leverage and influential pointsPollution Example

  • Data set pollutionhc.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

  • The variables are

  • y (MORT)=total age adjusted mortality in deaths per 100,000 population;

  • PRECIP=mean annual precipitation (in inches);

    EDUC=median number of school years completed for persons 25 and older;

    NONWHITE=percentage of 1960 population that is nonwhite;

    HC=relative pollution potential of hydrocarbons (product of tons emitted per day per square kilometer and a factor correcting for SMSA dimension and exposure)

Transformations for explanatory variables l.jpg
Transformations for Explanatory Variables influential points

  • In deciding whether to transform an explanatory variable x, we consider two features of the plot of the response y vs. the explanatory variable x.

  • Is there curvature in the relationship between y and x? This suggests a transformation chosen by Tukey’s Bulging rule.

  • Are most of the x values “crunched together” and a few very spread apart? This will lead to several points being very influential. When this is the case, it is best to transform x to make the x values more evenly spaced and less influential. If the x values are positive, the log transformation is a good idea.

  • For the pollution data, reason 2 suggests transforming HC to log HC.

Residual vs predicted plot l.jpg
Residual vs. Predicted Plot influential points

  • Useful for detecting nonconstant variance; look for fan or funnel pattern.

  • Plot of residuals versus predicted values ,

  • For pollution data, no strong indication of nonconstant variance.

Residual plots vs each explanatory variable l.jpg
Residual plots vs. each explanatory variable influential points

  • Make plot of residuals vs. an explanatory variable by using Fit Model, clicking red triangle next to response, selecting Save Columns and selecting save residuals. This creates a column of residuals. Then click Analyze, Fit Y by X and put residuals in Y and the explanatory variable in X.

  • Use these residual plots to check for pattern in the mean of residuals (suggests that we need to transform x or use a polynomial in x) or pattern in the variance of the residuals.

Slide13 l.jpg

Residual plots look fine. No strong indication of nonlinearity or

nonconstant variance.

Check of normality outliers l.jpg
Check of normality/outliers nonlinearity or

Normality looks okay. One residual outlier, Lancaster.

Influential observations l.jpg
Influential Observations nonlinearity or

  • As in simple linear regression, one or two observations can strongly influence the estimates.

  • Harder to immediately see the influential observations in multiple regression.

  • Use Cook’s Distances (Cook’s D influence) to look for influential observations. An obs. Has large influence if Cook’s distance is greater than 1.

  • Can use Table, Sort to sort observations by Cook’s Distance or Leverage.

  • For pollution data: no observation has high influence.

Strategy for dealing with influential observations l.jpg
Strategy for dealing with influential observations nonlinearity or

  • Use Display 11.8

  • Leverage of point: measure of distance between point’s explanatory variable values and explanatory variable values in entire data set.

  • Two sources of influence: leverage, magnitude of residual.

  • General approach: If an influential point has high leverage, omit point and report conclusions for reduced range of explanatory variables. If an influential point does not have high leverage, then the point cannot just be removed. We can report results with and without point.

Leverage l.jpg
Leverage nonlinearity or

  • Obtaining leverages from JMP: After Fit Model, click red triangle next to Response, select Save Columns, Hats.

  • Leverages are between 1/n and 1. Average leverage is p/n.

  • An observation is considered to have high leverage if the leverage is greater than 2p/n where p=# of explanatory variables. For pollution data, 2p/n = (2*4)/60=.133

Specially constructed explanatory variables l.jpg
Specially Constructed Explanatory Variables nonlinearity or

  • Interaction variables

  • Squared and higher polynomial terms for curvature

  • Dummy variables for categorical variables.

Interaction l.jpg
Interaction nonlinearity or

  • Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2).

  • There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.

  • To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).

Interaction variables in jmp l.jpg
Interaction variables in JMP nonlinearity or

  • To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box.

  • JMP creates the explanatory variable