1 / 27

STATS 330: Lecture 16

STATS 330: Lecture 16. Case Study. Case study. Aim of today’s lecture To illustrate the modelling process using the evaporation data. The Evaporation data. Data in data frame evap.df Aims of the analysis: Understand relationships between explanatory variables and the response

arleen
Download Presentation

STATS 330: Lecture 16

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATS 330: Lecture 16 Case Study 330 lecture 16

  2. Case study Aim of today’s lecture To illustrate the modelling process using the evaporation data. 330 lecture 16

  3. The Evaporation data • Data in data frame evap.df • Aims of the analysis: • Understand relationships between explanatory variables and the response • Be able to predict evaporation loss given the other variables 330 lecture 16

  4. Case Study: Evaporation data Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 330 lecture 16

  5. Choose Model Plots, theory Fit model Examine residuals Transform Good fit Bad fit Use model Modelling cycle 330 lecture 16

  6. Modelling cycle (2) Our plan of attack: • Graphical check • Suitability for regression • Gross outliers • Preliminary fit • Model selection (for prediction) • Transforming if required • Outlier check • Use model for prediction 330 lecture 16

  7. Step 1: Plots • Preliminary plots • Want to get an initial idea of suitability of data for regression modelling • Check for linear relationships, outliers • Pairs plots, coplots • Data looks OK to proceed, but evap/maxh plot looks curved 330 lecture 16

  8. 330 lecture 16

  9. Points to note • Avh has very few values • Strong relationships between response and some variables (particularly maxh, avst) • Not much relationship between response and minst, minat, wind • strong relationships between min, av and max • No obvious outliers 330 lecture 16

  10. Step 2: preliminary fit Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -54.074877 130.720826 -0.414 0.68164 avst 2.231782 1.003882 2.223 0.03276 * minst 0.204854 1.104523 0.185 0.85393 maxst -0.742580 0.349609 -2.124 0.04081 * avat 0.501055 0.568964 0.881 0.38452 minat 0.304126 0.788877 0.386 0.70219 maxat 0.092187 0.218054 0.423 0.67505 avh 1.109858 1.133126 0.979 0.33407 minh 0.751405 0.487749 1.541 0.13242 maxh -0.556292 0.161602 -3.442 0.00151 ** wind 0.008918 0.009167 0.973 0.33733 Residual standard error: 6.508 on 35 degrees of freedom Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023 F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11 330 lecture 16

  11. 330 lecture 16

  12. Findings • Plots OK, normality dubious • Gam plots indicated no transformations • Point 31 has quite high Cooks distance but removing it doesn’t change regression much • Model is OK. • Could interpret coefficients, but variables highly correlated. 330 lecture 16

  13. Step 3: Model selection • Use APR • Model selected was evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh • Now proceed to step 4 330 lecture 16

  14. Step 4: Diagnostic checks For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 330 lecture 16

  15. Outliers ? Non-normal? 330 lecture 16

  16. Conclusions? • No real evidence of non-linearity, but will check further with gams • Normal plot looks curved • Some largish outliers • Points 2, 41 have largish Cooks D 330 lecture 16

  17. Checking linearity • Check for linearity with gams > library(mgcv) >plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 330 lecture 16

  18. Transform avst, maxh ? 330 lecture 16

  19. Remedy • Gam plots for avst and maxh are curved • Try cubics in these variables • Plots look better • Cubic terms are significant 330 lecture 16

  20. 330 lecture 16

  21. > model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 *** maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 *** poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * --- Residual standard error: 5.276 on 36 degrees of freedom Multiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15 330 lecture 16

  22. New model Lets now adopt model lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 330 lecture 16

  23. 330 lecture 16

  24. 330 lecture 16

  25. Points 2, 6, 7, 41 are affecting the fitted values, some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. Deletion of points 330 lecture 16

  26. Normality? However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 330 lecture 16

  27. predict.df = data.frame(avst = mean(evap.df$avst), maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr 1 34.67391 21.75680 47.59103 1 32.38471 21.39857 43.37084 CV fit: fit lwr upr 1 34.67391 21.02628 48.32154 330 lecture 16

More Related