1 / 14

# Lecture 18: Thurs., March 18 - PowerPoint PPT Presentation

Lecture 18: Thurs., March 18. R-Squared. The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable. Unitless measure of strength of relationship between x and y

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Lecture 18: Thurs., March 18' - ponce

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.

• Unitless measure of strength of relationship between x and y

• Total sum of squares = . Best sum of squared prediction error without using x.

• Residual sum of squares =

• R2=.6501. Read as “65.01 percent of the variation in car prices was explained by the linear regression on odometer.”

• R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association.

• For simple linear regression, R2 is the square of the correlation between the predictor and the response.

• If the residuals are all zero (a perfect fit), then R2 is 1. If the least squares line has slope 0, R2 will be 0.

• R2 is useful as a unitless summary of the strength of linear association.

• R2 is not useful for assessing model adequacy, i.e., does simple linear regression model hold (use residual plots) or whether or not there is an association (use test of

vs. )

• A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

• A high means that x has a strong linear relationship with y – there is a strong association between x and y. It does not imply that x causes y.

• Alternative explanations for high :

• Reverse is true. Y causes X.

• There may be a lurking (confounding) variable related to both x and y which is the common cause of x and y

• No cause and effect relationship can be inferred unless X is randomly assigned to units in a random experiment.

• A researcher measures the number of television sets per person X and the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?

• A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community may be able to cover the costs of increased police protection by gains in tax revenues from higher property values. Data on the average housing price and crime rate (per 1000 population) for communities in Pennsylvania near Philadelphia for 1996 are shown in housecrime.JMP.

• Can you deduce a cause-and-effect relationship from these data? What are other explanations for the association between housing prices and crime rate other than that high crime rates cause low housing prices?

• Does the ideal simple linear regression model appear to hold?

• Assumptions of ideal simple linear regression model

• There is a normally distributed subpopulation of responses for each value of the explanatory variable (Normality)

• The means of the subpopulations fall on a straight-line function of the explanatory variable (Linearity)

• The subpopulation standard deviations are all equal (to

) (constant variance)

• The selection of an observation from any of the subpopulations is independent of the selection of any other observation (independence)

• The conditions required for inference from simple linear regression must be checked:

• Linearity. Diagnostic: Residual plot

• Constant variance. Diagnostic: Residual plot.

• Normality. Diagnostic: Histogram of residuals/Normal probability plot of residuals.

• Independence. Diagnostic: Residual plot.

• Outliers and influential points. Diagnostic: Scatterplot, Cook’s Distances.

• Residual plot: Scatterplot of residuals versus x (or some other variable).

• JMP implementation: After Fit Y by X, fit line, click red triangle next to Linear Fit and click Plot Residuals.

• Use residual plot to look for nonlinearity and nonconstant variance

• If the ideal simple linear regression model holds, the residual plot should look like random scatter – there should be no pattern in the residual plot

• A pattern in the mean of the residuals, i.e., the residuals have a mean less than zero for some range of x and greater than zero for another range of x indicates nonlinearity.

• A pattern in the variance of the residuals, i.e., the residuals have a greater variance for some range of x and less variance for another range of x indicates nonconstant variance.

• Look for marked patterns. No residual plot looks perfectly like random scatter.

• Mean of residuals appears to be greater than zero for crime rate >60, as well as greater than zero for crime rate<10, indicating nonlinearity.