1 / 14

Regression Model Building

Regression Model Building. LPGA Golf Performance - 2008. Data Description. Response: log(Prize Winnings/Round) – Skewed data Potential Predictors: Average Drive Distance Percentage of Drives Reaching Fairway Percentage of Green s Reached in Regulation Average Putts per Hole

tal
Download Presentation

Regression Model Building

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression Model Building LPGA Golf Performance - 2008

  2. Data Description • Response: log(Prize Winnings/Round) – Skewed data • Potential Predictors: • Average Drive Distance • Percentage of Drives Reaching Fairway • Percentage of Greens Reached in Regulation • Average Putts per Hole • Average Number of Sand Traps Hit per Round (Sandshot) • Percentage of Sand Saves • Samples: • Training Sample – 100 Randomly Sampled Golfers • Validation Sample – 57 Remaining Golfers used to assess fit

  3. Modeling Strategies • Select Training Sample • Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: • Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) • Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) • Determine Validity of model by obtaining prediction errors for validation sample

  4. Top of Entire Sample (First 20 Golfers)

  5. Backward Elimination (RSS = SSE) Step 1: Start: AIC=-200.22 logprz ~ drive + fairway + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC - fairway 1 0.010 11.750 -202.132 <none> 11.740 -200.216 - drive 1 0.397 12.138 -198.887 - sandsave 1 0.405 12.145 -198.827 - sandshot 1 1.030 12.770 -193.806 - green 1 24.960 36.700 -88.238 - putts 1 35.360 47.100 -63.289 Step 2: AIC=-202.13 logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq RSS AIC <none> 11.750 -202.132 - sandsave 1 0.400 12.150 -200.784 - drive 1 0.537 12.287 -199.665 - sandshot 1 1.034 12.784 -195.698 - green 1 32.091 43.841 -72.461 - putts 1 35.688 47.438 -64.575 • At Step 1, Fairway is eliminated, AIC Is minimized (-202.132 < -200.216) • At Step 2, no other variables are removed (no AIC < -202.132)

  6. Forward Selection (RSS = SSE) Step 1: Start: AIC=-6.61 logprz ~ 1 Df Sum of Sq RSS AIC + green 1 38.599 53.150 -59.206 + putts 1 33.043 58.706 -49.263 + drive 1 11.622 80.126 -18.156 + sandshot 1 8.951 82.798 -14.876 + sandsave 1 3.118 88.631 -8.069 <none> 91.749 -6.611 + fairway 1 0.409 91.340 -5.058 Step 2: AIC=-59.21 logprz ~ green Df Sum of Sq RSS AIC + putts 1 39.514 13.636 -193.246 + sandsave 1 4.859 48.291 -66.793 <none> 53.150 -59.206 + fairway 1 0.635 52.514 -58.408 + drive 1 0.361 52.788 -57.888 + sandshot 1 0.004 53.146 -57.214 Step 3: AIC=-193.25 logprz ~ green + putts Df Sum of Sq RSS AIC + sandshot 1 0.73688 12.899 -196.80 + sandsave 1 0.66486 12.971 -196.25 + drive 1 0.31495 13.321 -193.58 <none> 13.636 -193.25 + fairway 1 0.09401 13.542 -191.94 Step 4: AIC=-196.8 logprz ~ green + putts + sandshot Df Sum of Sq RSS AIC + drive 1 0.74905 12.150 -200.78 + sandsave 1 0.61234 12.287 -199.66 <none> 12.899 -196.80 + fairway 1 0.25056 12.649 -196.76 Step 5: AIC=-200.78 logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave 1 0.40005 11.750 -202.13 <none> 12.150 -200.78 + fairway 1 0.00524 12.145 -198.83 Step 6: AIC=-202.13 logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq RSS AIC <none> 11.75 -202.13 + fairway 1 0.0099086 11.74 -200.22

  7. Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q Median 3Q Max -0.72852 -0.20634 0.01067 0.22439 0.72316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.272879 1.580975 9.028 2.14e-14 *** green 0.210379 0.013130 16.023 < 2e-16 *** putts -0.625367 0.037011 -16.897 < 2e-16 *** sandshot 0.790771 0.274937 2.876 0.00498 ** sandsave 0.008334 0.004658 1.789 0.07684 . drive -0.009563 0.004615 -2.072 0.04098 * --- Residual standard error: 0.3536 on 94 degrees of freedom Multiple R-squared: 0.8719, Adjusted R-squared: 0.8651 F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16

  8. Influence Measures (n=100, p’=6)

  9. Summary of Influence Measures - I • Studentized Residuals (Exceed 3.607 in absolute value) • Extreme values (in absolute value): -2.172 and +2.112 • Leverage Values (Exceed 0.12) • Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) • DFFITS (Exceed 0.49 in absolute value) • Three Golfers between -0.61 and -0.49 (Golfers 142, 91, and 117) • One Golfer between 0.49 and 0.59 (Golfer 59) • Cook’s D (Exceed 1, sometimes suggested to exceed 0.5) • Max value is .0626. None come close to 1 (or the sometimes suggested ½)

  10. Summary of Influence Measures • DFBETAS (Exceed 0.20 in absolute value) • Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45) • Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33) • Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43) • Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24) • Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43) • Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) • Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out

  11. Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = 0.9833, p-value = 0.2390

  12. No Evidence of non-constant error variance (Data had been transformed prior to fitting model)

  13. Equal (Homogeneous) Variance - I No evidence to reject the null hypothesis of equal variance among errors

  14. Equal (Homogeneous) Variance There is no evidence of unequal variance, based on either Brown-Forsyth or Breusch-Pagan tests Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = 1.9306, df = 5, p-value = 0.8587

More Related