1 / 69

Applied Linear Regression

Applied Linear Regression. CSTAT Workshop March 16, 2007 Vince Melfi. References. “Applied Linear Regression,” Third Edition by Sanford Weisberg. “Linear Models with R,” by Julian Faraway. Countless other books on Linear Regression, statistical software, etc. Statistical Packages.

jacob
Download Presentation

Applied Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

  2. References • “Applied Linear Regression,” Third Edition by Sanford Weisberg. • “Linear Models with R,” by Julian Faraway. • Countless other books on Linear Regression, statistical software, etc.

  3. Statistical Packages • Minitab (we’ll use this today) • SPSS • SAS • R • Splus • JMP • ETC!!

  4. Outline • Simple linear regression review • Multiple Regression: Adding predictors • Inference in Regression • Regression Diagnostics • Model Selection

  5. Savings Rate Data Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate. • SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.) • Pop>75: Percent of the population over 75 years old. (One of the predictors.)

  6. Regression Output The regression equation is SaveRate = 7.152 + 1.099 pop>75 S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1% Analysis of Variance Source DF SS MS F P Regression 1 98.545 98.5454 5.34 0.025 Error 48 885.083 18.4392 Total 49 983.628 Fitted model R2 (coeff. of determination) Testing the model

  7. Importance of Plots • Four data sets • All have • Regression line Y = 3 + 0.5 x • R2 = 66.7% • S = 1.24 • Same t statistics, etc., etc. • Without looking at plots, the four data sets would seem similar.

  8. Importance of Plots (1)

  9. Importance of Plots (2)

  10. Importance of Plots (3)

  11. Importance of Plots (4)

  12. The model • Yi = β0 + β1xi + ei, for i = 1, 2, …, n • “Errors” e1, e2, …, en are assumed to be independent. • Usually e1, e2, …, en are assumed to have the same standard deviation, σ. • Often e1, e2, …, en are assumed to be normally distributed.

  13. Least Squares • The regression line (line of best fit) is based on “least squares.” • The regression line is the line that minimizes the sum of the squared deviations from the data. • The least squares line has certain optimality properties. • The least squares line is denoted

  14. Residuals • The residuals represent the difference between the data and the least squares line:

  15. Checking assumptions • Residuals are the main tool for checking model assumptions, including linearity and constant variance. • Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance. • Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.

  16. “Four in one” plot from Minitab

  17. Coefficient of determination (R2) Residual sum of squares, aka sum of squares for error: Total sum of squares: Coefficient of determination:

  18. R2 • The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X. • It’s also the square of the Pearson correlation coefficient

  19. Adding a predictor • Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is significant was 0.025.) • Another predictor: DPI (per-capita income) • Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)

  20. Adding a predictor (2) • Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI • p-values are 0.100 and 0.738 for pop>75 and DPI • The sign of the coefficient of DPI has changed! • pop>75 was significant alone, but neither it nor DPI are significant together!

  21. Adding a predictor (3) • What happened?? • The predictors pop>75 and DPI are highly correlated

  22. Added variable plots and partial correlation • Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75. • Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75. • A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75.This is called an “added variable plot.” • The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.

  23. Added variable plot Note that the slope term, -0.000341, is the same as the slope term for DPI in the two-predictor model

  24. Scatterplot matrices (Matrix Plots) • With one predictor X, a scatterplot of Y vs. X is very informative. • With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed. • A scatterplot matrix (or matrix plot) is just an organized display of the plots

  25. Changes in R2 • Consider adding a predictor X2 to a model that already contains the predictor X1 • Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2

  26. Changes in R2 (2) • The R2 value for the multiple regression fit is always larger than R2,1 and R2,2 • The R2 value for the multiple regression fit of Y versus X1 and X2 may be • less than R2,1 + R2,2(if the two predictors are explaining the same variation) • equal to R2,1 + R2,2(if the two predictors measure different things) • more than R2,1 + R2,2 (e.g. Response is area of rectangle, and the two predictors are length and width)

  27. Multiple regression model • Response variable Y • Predictors X1, X2, …, Xp • Same assumptions on errors ei (independent, constant variance, normality)

  28. Inference in regression • Most inference procedures assume independence, constant variance, and normality of the errors. • Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold. • In general, techniques like the bootstrap can be used when normality is suspect.

  29. New data set • Response variable: • Fuel = per-capita fuel consumption (times 1000) • Predictors: • Dlic = proportion of the population who are licensed drivers (times 1000) • Tax = gasoline tax rate • Income = per person income in thousands of dollars • logMiles = base 2 log of federal-aid highway miles in the state

  30. t tests • Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles • The regression equation is • Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income + 18.5 logMiles • Predictor Coef SE Coef T P • Constant 154.2 194.9 0.79 0.433 • Tax -4.228 2.030 -2.08 0.043 • Dlic 0.4719 0.1285 3.67 0.001 • Income -6.135 2.194 -2.80 0.008 • logMiles 18.545 6.472 2.87 0.006 t statistics p values

  31. t tests (2) • The t statistics tests the hypothesis that a particular slope parameter is zero. • The formula is t = (coefficient estimate)/(standard error) • degrees of freedom are n-(p+1) • p-values given are for the two-sided alternative • This is like simple linear regression

  32. F tests • General structure: • Ha: Large model • H0: Smaller model, obtained by setting some parameters in the large model to zero, or equal to each other, or equal to a constant • RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model • RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model • dfNH and dfAH are the corresponding degrees of freedom

  33. F tests (2) • Test statistic: • Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom

  34. F test example • Can the “economic” variables tax and income be dropped from the model with all four predictors? • AH model includes all predictors • NH model includes only Dlic and logMiles • Fit both models and get RSS and df values

  35. F test example (2) • RSSAH = 193700; dfAH = 46 • RSSNH = 243006; dfNH = 48 • P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054 • There’s pretty strong evidence that removing both Tax and Income is unwise

  36. Another F test example • Question: Does it make sense that the two “economic” predictors should have the same coefficient? • Ha: Y = β0 + β1Tax + β2Dlic+ β3Income + β4logMiles + error • H0: Y = β0 + β1Tax + β2Dlic+ β1Income + β4logMiles + error • Note: H0: Y = β0 + β1 (Tax + Income)+ β2Dlic + β4logMiles + error

  37. Another F test example (2) • Fit full model (AH) • Create new predictor “TI” by adding Tax and Income, and fit a model with TI and Dlic and logMiles (NH) • P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518 • This suggests that the simpler model with the same coefficient for Tax and Income fits well.

  38. Removing one predictor • We have two ways to test whether one predictor can be removed from the model: • t test • F test • The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent.

  39. Confidence regions • Confidence intervals for one parameter use the familiar t-interval. • For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model: • -6.135± (2.013)(2.194) = -6.135 ± 4.417. From t distribution with 46 df From Minitab output

  40. Joint confidence regions • Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution. • Minitab (and SPSS, and …) can’t draw these easily • On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.

  41. Boundary of confidence region (0,0) Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two.

  42. Prediction • Given a new set of predictor values x1, x2, …, xp, what’s the predicted response? • It’s easy to answer this: Just plug the new predictors into the fitted regression model: • But how do we assess the uncertainty in the prediction? How do we form a confidence interval?

  43. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 613.39 12.44 (588.34, 638.44) (480.39, 746.39) Values of Predictors for New Observations New Obs Dlic Income logMiles Tax 1 900 28.0 15.0 17.0 Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17 Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17

  44. Diagnostics • Want to look for points that have a large influence on the fitted model • Want to look for evidence that one or more model assumptions are untrue. • Tools: • Residuals • Leverage • Influence and Cook’s Distance

More Related