1 / 40

Practical Sheet 6 Solutions

Practical Sheet 6 Solutions. The R data frame “whiteside” which deals with gas consumption is made available in R by > data(whiteside, package=MASS)

feileen
Download Presentation

Practical Sheet 6 Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Sheet 6 Solutions The R data frame “whiteside” which deals with gas consumption is made available in R by > data(whiteside, package=MASS) It records the weekly gas consumption and average external temperature at a house in south-east England during two heating seasons, one before and one after cavity-wall insulation was installed. The variables are: Variable Description Gas weekly gas consumption Temp average external temperatureduring week Insul (binary factor) Before (insulation) or After

  2. We check for b and ρ significantly different from 0. (It is clear from the print-out about b so this manual calculation is not normally required).

  3. ^ The value of b is -0.3932 Now carry out a hypothesis test. H0: b = 0 H1: b ≠ 0 The standard error of b is This is calculated in R as 0.01959

  4. The test statistic is This calculates as (-0.3932 – 0)/0.01959 = -20.071

  5. Ds….. ………. ………-2.064………………................ 2.064 t tables using 24 degrees of freedom (there are 26 points) give cut of point of 2.064 for 2.5%.

  6. Since -20.071 is less than -2.064, we accept H1. There is evidence at the 5% level of a significant positive relationship. In fact, the t values associated with significance levels of 1%. 0.1% are 2.492 and 3.497 and so b is also significant at the 0.1% level (“very highly significant”). This corresponds to the three stars on the R output.

  7. We now check the significance of r. The computer output gives R2 = 0.9438. r is the square root of this, i.e. 0.9714. It is fairly clear that this will be significantly different from 0 but test anyway.

  8. We know that In this case the test statistic calculates as 84.686. Let the true correlation coefficient be ρ.

  9. H0: ρ = 0 H1: ρ≠ 0 As seen previously, the cut off points for the t distribution with 24 degrees of freedom for 2.5% top and bottom are +/-2.064.

  10. The t value of implies H1 is accepted. There is evidence of a non zero correlation between Gas and Temp.

  11. Fisher’s Transformation

  12. Use of Weighted Least Squares

  13. In fitting models of the form yi = f(xi) + i i = 1………n, least squares is optimal under the condition 1……….n are i.i.d. N(0, 2) and is a reasonable fitting method when this condition is at least approximately satisfied. (Most importantly we require here that there should be no significant outliers).

  14. In the case where we have instead 1……….n are independent N(0, i2), it is natural to use instead weighted leastsquares: choose f from within the permitted class of functions f to minimise wi(yi-f(xi))2 Where we take wi proportional to 1/i2 (clearly only relative weights matter) ^ ^

  15. Example: Scottish hill races data. These data are made available in R as data(hills, package=MASS) They give record times (minutes) in 1984 of 35 Scottish hill races, against distance (miles) and total height climbed (feet). We regard time as the response variable, and seek to model how its conditional distribution depends on the explanatory variables distance and climb.

  16. The R code pairs(hills) produces the plots shown.

  17. The fitted model is: time=5.62xdistance+0.0323x(distance)2+ 0.000262xclimb+0.00000180x(climb)2+ε

  18. For the hill races data, it is natural to assume greater variability in the times for the longer races, with the variability perhaps proportional to the distance. We therefore try refitting the quadratic model with weights proportional to 1/distance2 > model2w = lm(time ~ -1 + dist +I(dist^2)+ climb + I(climb^2),data = hills[-18,], weights=1/dist^2)

  19. The fitted model is now time=4.94*distance+0.0548*(distance)2+0.00349*climb +0.00000134*(climb)2+’

  20. The fitted model is now time=4.94*distance+0.0548*(distance)2+0.00349*climb +0.00000134*(climb)2+’ Note that the residual summary above is on a “reweighted” scale, and cannot be directly compared with the earlier residual summaries.

  21. The fitted model is now time=4.94*distance+0.0548*(distance)2+0.00349*climb +0.00000134*(climb)2+’ Note that the residual summary above is on a “reweighted” scale, and cannot be directly compared with the earlier residual summaries. While the coefficients here appear to have changed somewhat from those in the earlier, unweighted, fit of Model 2, the fitted model is not really very different.

  22. This is confirmed by the plot of the residuals from the weighted fit against those from the unweighted fit, produced by >plot(resid(model2w)~resid(model2))

  23. Resistant Regression

  24. As already observed, least squares fitting is very sensitive to outlying observations. However, there are also a large number of resistant fitting techniques available. One such is least trimmed squares: choose f from within the permitted class of functions f to minimise:- ^

  25. Example: phones data. The R dataset phones in the package MASS gives the annual number of phone calls (millions) in Belgium over the period 1950-73. Consider the model calls = a + b*year The following two graphs plot the data and shows the result of fitting the model by least squares and then fitting the same model by least trimmed squares.

  26. These graphs are achieved by the following code: > plot(calls~year) > phonesls=lm(calls~year) > abline(phonesls) > plot(calls~year) > library(lqs) > phoneslts=lqs(calls~year) > abline(phoneslts)

  27. The explanation for the data is that for a period of time total length of all phone calls in each year was accidentally recorded instead.

  28. Nonparametric Regression

  29. Sometimes we simply wish to fit a smooth model without specifying any particular functional form for f. Again there are very many techniques here. One such is called loess. This constructs the fitted value f(xi) for each observation i by performing a local regression using only those observations with x values in the neighbourhood of xi (and attaching most weight to the closest observations). ^

  30. Example: cars data. The R data frame cars (in the base package) records 50 observations of speed (mph) and stopping distance (ft). These observations were collected in the 1920s! We treat stopping distance as the response variable and seek to model its dependence on speed.

  31. We try to fit a model using loess. Possible R code is > data(cars) > attach(cars) > plot(cars) > library(modreg) > carslo=loess(dist~speed) > lines(fitted(carslo)~speed)

  32. An optional argument span can be increased from its default value of 0:75 to give more smoothing: > plot(cars) > carslo2=loess(dist~speed, span=1) > lines(fitted(carslo2)~speed)

More Related