205 Views

Download Presentation
##### STATS 330: Lecture 16

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**STATS 330: Lecture 16**Case Study 330 lecture 16**Case study**Aim of today’s lecture To illustrate the modelling process using the evaporation data. 330 lecture 16**The Evaporation data**• Data in data frame evap.df • Aims of the analysis: • Understand relationships between explanatory variables and the response • Be able to predict evaporation loss given the other variables 330 lecture 16**Case Study: Evaporation data**Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 330 lecture 16**Choose Model**Plots, theory Fit model Examine residuals Transform Good fit Bad fit Use model Modelling cycle 330 lecture 16**Modelling cycle (2)**Our plan of attack: • Graphical check • Suitability for regression • Gross outliers • Preliminary fit • Model selection (for prediction) • Transforming if required • Outlier check • Use model for prediction 330 lecture 16**Step 1: Plots**• Preliminary plots • Want to get an initial idea of suitability of data for regression modelling • Check for linear relationships, outliers • Pairs plots, coplots • Data looks OK to proceed, but evap/maxh plot looks curved 330 lecture 16**Points to note**• Avh has very few values • Strong relationships between response and some variables (particularly maxh, avst) • Not much relationship between response and minst, minat, wind • strong relationships between min, av and max • No obvious outliers 330 lecture 16**Step 2: preliminary fit**Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -54.074877 130.720826 -0.414 0.68164 avst 2.231782 1.003882 2.223 0.03276 * minst 0.204854 1.104523 0.185 0.85393 maxst -0.742580 0.349609 -2.124 0.04081 * avat 0.501055 0.568964 0.881 0.38452 minat 0.304126 0.788877 0.386 0.70219 maxat 0.092187 0.218054 0.423 0.67505 avh 1.109858 1.133126 0.979 0.33407 minh 0.751405 0.487749 1.541 0.13242 maxh -0.556292 0.161602 -3.442 0.00151 ** wind 0.008918 0.009167 0.973 0.33733 Residual standard error: 6.508 on 35 degrees of freedom Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023 F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11 330 lecture 16**Findings**• Plots OK, normality dubious • Gam plots indicated no transformations • Point 31 has quite high Cooks distance but removing it doesn’t change regression much • Model is OK. • Could interpret coefficients, but variables highly correlated. 330 lecture 16**Step 3: Model selection**• Use APR • Model selected was evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh • Now proceed to step 4 330 lecture 16**Step 4: Diagnostic checks**For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 330 lecture 16**Outliers ?**Non-normal? 330 lecture 16**Conclusions?**• No real evidence of non-linearity, but will check further with gams • Normal plot looks curved • Some largish outliers • Points 2, 41 have largish Cooks D 330 lecture 16**Checking linearity**• Check for linearity with gams > library(mgcv) >plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 330 lecture 16**Transform avst, maxh ?**330 lecture 16**Remedy**• Gam plots for avst and maxh are curved • Try cubics in these variables • Plots look better • Cubic terms are significant 330 lecture 16**> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat +**minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 *** maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 *** poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * --- Residual standard error: 5.276 on 36 degrees of freedom Multiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15 330 lecture 16**New model**Lets now adopt model lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 330 lecture 16**Points 2, 6, 7, 41 are affecting the fitted values, some**coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. Deletion of points 330 lecture 16**Normality?**However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 330 lecture 16**predict.df = data.frame(avst = mean(evap.df$avst),**maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr 1 34.67391 21.75680 47.59103 1 32.38471 21.39857 43.37084 CV fit: fit lwr upr 1 34.67391 21.02628 48.32154 330 lecture 16