STATS 330: Lecture 16

STATS 330: Lecture 16 Case Study 330 lecture 16

Case study Aim of today’s lecture To illustrate the modelling process using the evaporation data. 330 lecture 16

The Evaporation data • Data in data frame evap.df • Aims of the analysis: • Understand relationships between explanatory variables and the response • Be able to predict evaporation loss given the other variables 330 lecture 16

Case Study: Evaporation data Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 330 lecture 16

Choose Model Plots, theory Fit model Examine residuals Transform Good fit Bad fit Use model Modelling cycle 330 lecture 16

Modelling cycle (2) Our plan of attack: • Graphical check • Suitability for regression • Gross outliers • Preliminary fit • Model selection (for prediction) • Transforming if required • Outlier check • Use model for prediction 330 lecture 16

Step 1: Plots • Preliminary plots • Want to get an initial idea of suitability of data for regression modelling • Check for linear relationships, outliers • Pairs plots, coplots • Data looks OK to proceed, but evap/maxh plot looks curved 330 lecture 16

330 lecture 16

Points to note • Avh has very few values • Strong relationships between response and some variables (particularly maxh, avst) • Not much relationship between response and minst, minat, wind • strong relationships between min, av and max • No obvious outliers 330 lecture 16

Step 2: preliminary fit Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -54.074877 130.720826 -0.414 0.68164 avst 2.231782 1.003882 2.223 0.03276 * minst 0.204854 1.104523 0.185 0.85393 maxst -0.742580 0.349609 -2.124 0.04081 * avat 0.501055 0.568964 0.881 0.38452 minat 0.304126 0.788877 0.386 0.70219 maxat 0.092187 0.218054 0.423 0.67505 avh 1.109858 1.133126 0.979 0.33407 minh 0.751405 0.487749 1.541 0.13242 maxh -0.556292 0.161602 -3.442 0.00151 ** wind 0.008918 0.009167 0.973 0.33733 Residual standard error: 6.508 on 35 degrees of freedom Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023 F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11 330 lecture 16

330 lecture 16

Findings • Plots OK, normality dubious • Gam plots indicated no transformations • Point 31 has quite high Cooks distance but removing it doesn’t change regression much • Model is OK. • Could interpret coefficients, but variables highly correlated. 330 lecture 16

Step 3: Model selection • Use APR • Model selected was evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh • Now proceed to step 4 330 lecture 16

Step 4: Diagnostic checks For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 330 lecture 16

Outliers ? Non-normal? 330 lecture 16

Conclusions? • No real evidence of non-linearity, but will check further with gams • Normal plot looks curved • Some largish outliers • Points 2, 41 have largish Cooks D 330 lecture 16

Checking linearity • Check for linearity with gams > library(mgcv) >plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 330 lecture 16

Transform avst, maxh ? 330 lecture 16

Remedy • Gam plots for avst and maxh are curved • Try cubics in these variables • Plots look better • Cubic terms are significant 330 lecture 16

330 lecture 16

> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 *** maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 *** poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * --- Residual standard error: 5.276 on 36 degrees of freedom Multiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15 330 lecture 16

New model Lets now adopt model lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 330 lecture 16

330 lecture 16

Points 2, 6, 7, 41 are affecting the fitted values, some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. Deletion of points 330 lecture 16

Normality? However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 330 lecture 16

predict.df = data.frame(avst = mean(evap.df$avst), maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr 1 34.67391 21.75680 47.59103 1 32.38471 21.39857 43.37084 CV fit: fit lwr upr 1 34.67391 21.02628 48.32154 330 lecture 16

STATS 330: Lecture 16

STATS 330: Lecture 16

Presentation Transcript

STATS 330: Lecture 4

Stats 330: Lecture 23

STATS 330: Lecture 4

Stats 330: Lecture 26

STATS 330: Lecture 13

Stats 330: Lecture 22

STATS 330: Lecture 2

STATS 330: Lecture 16

STATS 330: Lecture 1

Stats 330: Lecture 20

Stats 330: Lecture 29

STATS 330: Lecture 7

Stats 330: Lecture 21

Stats 330: Lecture 31

Stats 330: Lecture 24

Stats 330: Lecture 30

Stats 330: Lecture 19

STATS 330: Lecture 8

Stats 330: Lecture 25

Stats 330: Lecture 18