1 / 24

Class 26

Pfeifer note: Section 6. Class 26. Model Building Philosophy. Assignment 26. 1. T-test 2-sample ≡ regression with dummy T= +/- 6.2/2.4483 (from data analysis, complicated formula, OR regression with dummy) 2. ANOVA single factor ≡ regression with p-1 dummies (see next slide)

johnbailey
Download Presentation

Class 26

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pfeifer note: Section 6 Class 26 Model Building Philosophy

  2. Assignment 26 • 1. T-test 2-sample ≡ regression with dummy T= +/- 6.2/2.4483 (from data analysis, complicated formula, OR regression with dummy) • 2. ANOVA single factor ≡ regression with p-1 dummies (see next slide) • 3. Better predictor? The one with lower regression standard error (or higher adj R2) • Not the one with the higher coefficient. • 4. Will they charge less than $4,500? • Use regression’s standard error and t.dist to calculate the probability.

  3. Ready for ANOVA Ready for Regression

  4. Agenda • IQ demonstration • What you can do with lots of data • What you should do with not much data • Practice using the Oakland As case

  5. Remember the Coal Pile! • Model Building involves more than just selecting which of the available X’s to include in the model. • See section 9 of the Pfeifer note to learn about transforming X’s. • We won’t do much in this regard…

  6. With lots of data (big data?) Stats like “std error” and adj R-square only measure FIT 2. Use the training set to build several models. Performance on a hold-out sample measures how well each model will FORECAST 1. Split the data into two sets 3. Use the hold-out sample to test/compare the models. Use the best performing model.

  7. With lots of data (big data?) • Computer Algorithms do a very good job of finding a model • They guard against “over-fitting” • Once you own the software, they are fast and cheap • They won’t claim, however, to do better than a professional model builder • Remember the coal pile!

  8. Without much Data • You will not be able to use a training set/hold out sample • You get “one shot” to find a GOOD model • Regression and all its statistics can tell you which model “FIT” the data the best. • Regression and all its statistics CANNOT tell you which model will perform (forecast) the best. • Not to mention….regression has no clue about what causes what…..

  9. Remember….. • The model that does a spectacular job of fitting the past….will do worse at predicting the future than a simpler model that more accurately captures the way the world works. • Better fit leads to poorer forecasts! • Instead of forecasting 100 for the next IQ, the over-fit model will sometimes predict 110 and other times predict 90!

  10. Requiring low-p-values for all coefficients does not protect against over-fitting. • If there are 100 X’s that are of NO help in predicting Y, • We expect 5 of them will be statistically significant. • And we’ll want to use all 5 to predict the future. • And the model will be over-fit • We won’t know it, perhaps • Our predictions will be WORSE as a result.

  11. Modeling Balancing Act • Useable (do we know the X’s?) • Simple • Make Sense • Use your judgment, given you can’t solely rely on the stats/data • Signs of coefficients should make sense • Significant (low p) coefficients • Except for sets of dummies • Low standard error • Consistent with high adjusted R-square • Meets all four assumptions • Linearity (most important) • Homoskedasticity (equal variance) • Independence • Normality (least important)

  12. Oakland As (A)

  13. Case Facts • Despite making only $40K, pitcher Mark Nobel had a great year for Oakland in 1980. • Second in the league for era (2.53), complete games (24), innings (284-1/3), and strikeouts (180) • Gold glove winner (best fielding pitcher) • Second in CY YOUNG award voting.

  14. Nobel Wants a Raise • “I’m not saying anything against Rick Langford or Matt Keough (fellow As pitchers)…but I filled the stadium last year against Tommy John (star pitcher for the Yankees)” • Nobel’s Agent argued • Avg. home attendance for Nobel’s 16 starts was 12,663.6 • Avg. home attendance for remaining home games was only 10,859.4 • Nobel should get “paid” for the difference • 1,804.2 extra tickets per start.

  15. Data from 1980 Home Games

  16. TASK • Be ready to report about the model assigned to your table (1 to 7) • What is the model? (succinct) • Critique it (succinctly) • Ignore “durban watson” • “standard deviation of residuals” aka regression’s standard error. • Output gives just t-stat. A t of +- 2 corresponds to p-value of 0.05.

  17. What does it mean that the coefficient of NOBEL in negative in most of the models? Why was the coefficient of NOBEL positive in model 1?

More Related