1 / 36

STATS 330: Lecture 5

STATS 330: Lecture 5. The Multiple Regression Model. Tutorials. These will cover computing details Held in basement floor tutorial lab, Building 303 South (303S B75) Tutorial 1: Wednesday 11-12 Tutorial 2: Friday 10-11 Tutorial 2: Friday 2-3 Start this Thursday. My office hours.

yen-doyle
Download Presentation

STATS 330: Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATS 330: Lecture 5 The Multiple Regression Model 330 Lecture 5

  2. Tutorials • These will cover computing details • Held in basement floor tutorial lab, Building 303 South (303S B75) • Tutorial 1: Wednesday 11-12 • Tutorial 2: Friday 10-11 • Tutorial 2: Friday 2-3 • Start this Thursday 330 Lecture 5

  3. My office hours • Tuesday 10:30 – 12:00 • Thursday 10:30 – 12:00 • Room 265, Building 303S • Come up to the second floor in the Computer Science lifts, through the doors and turn right 330 Lecture 5

  4. Tutor office hours • Tuesday 1-4 pm • Thursday 12-2, 3-4 pm • Friday 1-3 pm • All in Room 303.133 • (First floor, Building 303) 330 Lecture 5

  5. Next Week • Monday, Thursday lectures by Arden Miller • No lecture Tuesday • Don’t forget assignment due Aug 4 330 Lecture 5

  6. Today’s lecture: The multiple regression model Aim of the lecture: • To describe the type of data suitable for multiple regression • To review the multiple regression model • To describe some plots that check the suitability of the model 330 Lecture 5

  7. Suitable data Suppose our data frame contains • A numeric “response” variable Y • One or more numeric “explanatory” variables X1,…Xk (also called covariates, independent variables) and we want to “explain” (model, predict) Y in terms of the explanatory variables, e.g. • explain volume of cherry trees in terms of height and diameter • explain NOx in terms of C and E 330 Lecture 5

  8. The regression model The model assumes • The responses are normally distributed with means m (each response has a different mean) and constantvariance s2 • The mean response m of a typical observation depends on the covariates through a linear relationship m = b0 + b1x1 + . . . + b k xk • The responses are independent 330 Lecture 5

  9. The regression plane The relationship m = b0 + b1x1 + . . . + b k xk can be visualized as a plane. The plane is determined by the coefficients b0 , b1, . . . , b k e.g. for k=2 (k=number of covariates) 330 Lecture 5

  10. Slope = b1 b0 slope= b2 330 Lecture 5

  11. The regression model (cont) • The data is scattered above and below the plane: • Size of “sticks” is random, controlled by s2, doesn’t depend on x1, x2 330 Lecture 5

  12. Alternative form of model Y = m + e = b0 + b1x1 + . . . + b k xk + e where e is normally distributed with mean zero and variance s2 330 Lecture 5

  13. Checking the data • How can we tell when this model is suitable? • Suitable = data randomly scattered about plane, “planar” for short • k=1, draw scatterplot of y vs x • k=2, use spinner, look for “edge” • k³2 use coplots: if data are planar the plots should be parallel: coplot(y~x1|x2*x3) 330 Lecture 5

  14. 330 Lecture 5

  15. Interpretation of coefficients • The regression coefficient b0 gives the average response when all covariates are zero • The regression coefficient b1 • gives the slope of the plane in the x1 direction • measures the average increase in the response for a unit increase in x1when all other covariates are held constant • Is the slope of plots of y versus x1 in a coplot • Is not necessarily the slope of y versus x in a scatter plot 330 Lecture 5

  16. Example Consider data from model y = 5 - 1*x1 + 4*x2 + N(0,1) x1 and x2 highly correlated, r = 0.95 Regression of y on x1 alone has slope 2.767: 330 Lecture 5

  17. Scatter plot: y vs x1 Slope 2.767 330 Lecture 5

  18. Coplot Slopes are all about - 1 330 Lecture 5

  19. Points to note • The regression coefficients b1 and b2 refer to the conditional mean of the response, given the covariates (in this case conditional on x1 and x2) • b1 is the slope in the coplot, conditioning on x2 • The coefficient in the plot of y versus x1 refers to a different conditional distribution (conditional on x1 only) • The same if and only if x1 and x2 uncorrelated 330 Lecture 5

  20. Estimation of the coefficients • We estimate the (unknown) regression plane by the “least squares plane” (best fitting plane) • Best fitting plane = plane that minimizes the sum of squared vertical deviations from the plane • That is, minimize the least squares criterion 330 Lecture 5

  21. Estimation of the coefficients (2) • The R commandlm calculates the coefficients of the best fitting plane • This function solves the normal equations, a set of linear equations derived by differentiating the least squares criterion with respect to the coefficients 330 Lecture 5

  22. Math Stuff (only if you like it) Arrange data on response and covariates into a vector y and a matrix X 330 Lecture 5

  23. Math Stuff: Normal equations Estimates are solutions of 330 Lecture 5

  24. Calculation of best fitting plane > lm(y~x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 4.988 -1.055 4.060 330 Lecture 5

  25. How well does the plane fit? • Judge this by examining the residuals and fitted values: Each observation has a fitted value: • Yi: response for observation i, xi1, xi2: values of explanatory variables for observation i • Fitted plane is • Fitted value for observation i is (the height of the fitted plane at (xi1,xi2) 330 Lecture 5

  26. How well does the plane fit? (cont) • The residualis the difference between the response and the fitted value 330 Lecture 5

  27. How well does the plane fit (cont) • For a good fit, residuals must be • Small relative to the y’s • Have no pattern (don’t depend on the fitted values, x’s etc) • For the model to be useful, we should have a strong relationship between the response and the explanatory variables 330 Lecture 5

  28. Measuring goodness of fit • How can we measure the relative size of residuals and the strength of the relationship between y and the x’s? • The “anova identity” is a way of explaining this 330 Lecture 5

  29. Measuring goodness of fit (cont) • Consider the “residual sum of squares” • Provides an overall measure of the size of the residuals 330 Lecture 5

  30. Measuring goodness of fit (cont) • Consider the “total sum of squares” • Provides an overall measure of the variability of the data 330 Lecture 5

  31. Math stuff: ANOVA identity • Math result: Anova Identity TSS = RegSS + RSS Since RegSS is non-negative, RSS must be less than TSS so RSS/TSS is between 0 and 1. RegSS: Regression sum of squares 330 Lecture 5

  32. Interpretation • The smaller RSS compared to TSS, the better the fit. • If RSS is 0, and TSS is positive, we have a perfect fit • If RegSS=0, RSS=TSS and the plane is flat (all coefficients except the constant are zero) , so the x’s don’t help predict y 330 Lecture 5

  33. R2 • Another way to express this is the coefficient of determination R2 • This is defined as 1-RSS/TSS = RegSS/TSS • R2 is always between 0 and 1 • 1 means a perfect fit • 0 means a flat plane • R2 is the square of the correlation between the observations and the fitted values 330 Lecture 5

  34. Estimate of s2 • Recall that s2 controls the scatter of the observations about the regression plane: • the bigger s2 , the more scatter • the bigger s2 , the smaller R2 • s2 estimated by s2=RSS/(n-k-1) 330 Lecture 5

  35. Calculations cherry.lm <- lm(volume~diameter+height, data=cherry.df)) summary(cherry.lm) • The lm function produces an “lm object” that contains all the information from fitting the regression • lm = “linear model” “ell” not I !!!! 330 Lecture 5

  36. Cherries Call: lm(formula = volume ~ diameter + height, data = cherry.df) Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.98778.6382 -6.713 2.75e-07 *** diameter 4.70820.2643 17.816 < 2e-16 *** height 0.33930.1302 2.607 0.0145 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.882on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Info on residuals coefficients Estimate of s R2 330 Lecture 5

More Related