1 / 123

MT2004

MT2004. Olivier GIMENEZ Telephone: 01334 461827 E-mail: olivier@mcs.st-and.ac.uk Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html. 12. Regression. Objective here : investigate the relationship between two (or more) variables, e.g. height and weight. y. x. 12. Regression.

lenci
Download Presentation

MT2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MT2004 Olivier GIMENEZ Telephone: 01334 461827 E-mail: olivier@mcs.st-and.ac.uk Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html

  2. 12. Regression • Objective here: investigate the relationship between two (or more) variables, e.g. height and weight y x

  3. 12. Regression • Objective here: investigate the relationship between two (or more) variables, e.g. height and weight • We will distinguish the response variable Y, which is random, and the explanatory variable x, which takes fixed values. • We can consider that Y varies in response to x • The relationship between the response variable and the explanatory variable is known as a regression problem • More formally, the regression problem is concerned with finding the regression relationship between the response variable Y and the explanatory variable x, on the basis of n pairs of observations (x1,y1), …, (xn,yn)

  4. 12. Regression • Objective here: investigate the relationship between two (or more) variables, e.g. height and weight y y1 x1

  5. 12. Regression • Objective here: investigate the relationship between two (or more) variables, e.g. height and weight y y5 y1 x1 x5

  6. 12.1 Linear regression • The simplest form of regression is linear regression: Y is related to the explanatory variable x by • E(Y) =  +  x • where  and  are unknown parameters. • It means that Y is linearly related to x, and the line y =  + x is called the (population) regression line,  and  are called the regression parameters.

  7. 12.1 Linear regression • To better understand the relationship between Y and x, we write Yi for the random variable Y associated with the value xi of x, for i = 1,…, n • Then we have that: • E(Yi) =  + xi, i = 1,…, n • Which can be re-written as: • Yi =  + xi + ii = 1,…, n • where 1,…, n are independent random variables with mean zero.

  8. 12.1.1 Least squares • We want to estimate the regression parameters  and  • In other words, we’d like to fit the model to data • A reasonable way of doing this is by least squares • We consider the vertical distances ei = yi – ( + xi) between the observed values yi and the corresponding values ( + xi) given by the model

  9. 12. Regression • We consider the model first, which is the regression line y y =  +  x

  10. 12. Regression • We then consider the vertical distances ei = yi – ( + xi) y y =  +  x +  x1 y1 x1 x6

  11. 12. Regression • We consider the vertical distances ei = yi – ( + xi) y y =  +  x +  x1 e1 y1 x1 x6

  12. 12. Regression • We consider the vertical distances ei = yi – ( + xi) y y =  +  x y6 +  x6 x1 x6

  13. 12. Regression • We consider the vertical distances ei = yi – ( + xi) y y =  +  x y6 e6 +  x6 x1 x6

  14. 12.1.1 Least squares • We want to estimate the regression parameters  and  • In other words, we’d like to fit the model to data • A reasonable way of doing this is by least squares • We consider the vertical distances ei = yi – ( + xi) between the observed values yi and the corresponding values ( + xi) given by the model • A way a measuring the difference between the data and the model is by the sum of squares:

  15. 12.1.1 Least squares • We want to estimate the regression parameters  and  • A way a measuring the difference between the data and the model is by the sum of squares: • The method of least squares estimates  and  by the values which minimise S(, ) • Your turn: Obtain the least squares estimates for  and  • (hint: differentiate first S(, ) w.r.t. )

  16. 12.1.1 Least squares • The sum of squares is: • The method of least squares estimates  and  by the values which minimise S(, ) • We first differentiate S(, ) w.r.t.  • Setting this to zero gives • So:

  17. 12.1.1 Least squares • We then differentiate S(, ) w.r.t.  • Setting this to zero gives • Plugging the estimate for  in the equation above gives:

  18. 12.1.1 Least squares • Finally, using the expression for the estimate of  in • We end up with:

  19. and 12.1.1 Least squares • The least squares estimates of  and  are often expressed as: • where:

  20. 12.1.1 Least squares • Note that the estimation by least squares requires no assumption about the distribution of Y1,…,Yn • As usual, the least squares estimators of  and  are functions of random variables, so are random variables • Let’s see whether these estimators are unbiased

  21. 12.1.1 Least squares • Let’s see whether the least squares estimators of  and  are unbiased. First,  • Then, 

  22. 12.1.2 The normal linear regression model • We’d like to test hypotheses regarding parameters  and  and construct confidence intervals, in other words, doing inference • To do this, we need to make assumptions about the distribution of Y1,…,Yn • We will assume that Y1,…,Yn are independent, normally distributed with the same variance • And we wish to calculate maximum likelihood estimators for  and 

  23. 12.1.2 The normal linear regression model • We’d like to test hypotheses regarding parameters  and  and construct confidence intervals, in other words, doing inference • To do this, we need to make assumptions about the distribution of Y1,…,Yn • We will assume that Y1,…,Yn are independent, normally distributed with the same variance • And we wish to calculate maximum likelihood estimators for  and  • 1. Form the likelihood and then the corresponding log-likelihood • 2. Maximise the log-likelihood w.r.t. ,  and 2and obtain their MLEs

  24. 12.1.2 The normal linear regression model • We will assume that Y1,…,Yn are independent, normally distributed with the same variance • Form the likelihood and then the corresponding log-likelihood: • The log-likelihood is then:

  25. 12.1.2 The normal linear regression model • 2. Maximise the log-likelihood w.r.t. ,  and 2and obtain their MLEs • Note that maximising the log-likelihood is equivalent to minimising the sum of squares • It means that, for the normal linear regression model, the maximum likelihood estimates are equal to the least squares estimates

  26. and 12.1.2 The normal linear regression model • 2. Maximise the log-likelihood w.r.t. , and obtain their MLEs • Also, we see that these estimators are linear combinations of the normal random variables Y1,…,Yn, so they are normally distributed:

  27. 12.1.2 The normal linear regression model • 2. Maximise the log-likelihood w.r.t. 2and obtain its MLE • so: • But this estimator is biased • So we will use: • where is the fitted value corresponding to xi • It can be shown that s2 is an unbiased estimator of 2

  28. 12.1.2 The normal linear regression model • We can also show that • and • but

  29. 12.1.2 The normal linear regression model • Remember that: • and • So we have that: • and • To be used to test hypotheses about the regression parameters

  30. 12.1.2 The normal linear regression model • The results of the previous slide also allow confidence intervals to be constructed • E.g. 95% confidence intervals for  and  are: • You do not need to memorise the expression under the square root signs as they are provided by R

  31. 12.2 Regression using R • Regression can be carried out using R with the lm command (linear modelling) • The major advantage is that R provides, among other things, confidence intervals and p-values for the test of hypotheses regarding the regression parameters • Example: the following measurements give the concentration of chlorine (in parts per million) in a swimming pool at various times after chlorination treatment: • time (hours) 2 4 6 8 10 12 • chlorine (p.p.m.) 1.8 1.5 1.4 1.1 1.1 0.9

  32. Combine by rows (cbind combine by columns) Linear model in which chlorine is regressed on time, i.e. chlorine =  +  time least squares estimates (or MLEs) 12.2 Regression using R > time <- c(2,4,6,8,10,12) > chlorine <- c(1.8,1.5,1.4,1.1,1.1,0.9) > pool <- rbind(time,chlorine) > swim <- lm(chlorine~time) > swim Call: lm(formula = chlorine ~ time) Coefficients: (Intercept) time 1.90000 -0.08571

  33. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Call: • lm(formula = chlorine ~ time) • Residuals: • 1 2 3 4 5 6 • 0.07143 -0.05714 0.01429 -0.11429 0.05714 0.02857 • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 • Residual standard error: 0.08018 on 4 degrees of freedom • Multiple R-Squared: 0.9524, Adjusted R-squared: 0.9405 • F-statistic: 80 on 1 and 4 DF, p-value: 0.0008642

  34. Recall the R command used to create the swim object The residuals r_i are, by definition, the differences between the observed and fitted values, i.e. where e.g. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Call: • lm(formula = chlorine ~ time) • Residuals: • 1 2 3 4 5 6 • 0.07143 -0.05714 0.01429 -0.11429 0.05714 0.02857

  35. A name for the parameters (intercept for  and time for ) 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  36. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The estimates of  and 

  37. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  38. the values of the t-statistics in tests of H0:  = 0 vs. H1 0 and H0:  = 0 vs. H1 0 Note: these are 2-sided tests!!!! 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  39. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -0.085714/0.009583 = -8.944 to be compared with a t4;0.025 1.9/0.074642 = 25.455 to be compared with a t4;0.025

  40. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Coefficients: • Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.900000 0.074642 25.455 1.41e-05 *** • time -0.085714 0.009583 -8.944 0.000864 *** • --- • Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 the p-values of these t-statistics, with a ‘star rating’: If the p-value is ‘***’, it means that it lies within (0,0.001) If the p-value is ‘**’, it means that it lies within (0.001,0.01) … If the p-value is ‘’, it means that it lies within (0.1,1)

  41. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Residual standard error: 0.08018 on 4 degrees of freedom The residual standard error is s, where s2 is an unbiased of 2, defined as

  42. 12.2 Regression using R • More details can be obtained using the command summary • > summary(swim) • Multiple R-Squared: 0.9524, Adjusted R-squared: 0.9405 • F-statistic: 80 on 1 and 4 DF, p-value: 0.0008642 The Multiple R-squared measures the quality/goodness of fit of the model, it is known as the squared (sample) correlation coefficient r2 between x and Y. It is defined by: It can be shown that r2 1. And the multiple R-squared can be interpreted as the proportion of variance in y1,…, yn which is explained by the model we use. Obviously, the closer r2 is to 1, the better the model explains the variation in the data (100%)

  43. 12.2 Regression using R • Now, how to obtain 95% confidence intervals for  and  • We know that: • So we use the R commands: > alphalimits <- c(1.9+qt(0.025,6-2)*0.074642,1.9-qt(0.025,6-2)*0.074642) > alphalimits [1] 1.692761 2.107239 > betalimits<-c(-0.085714+qt(0.025,6-2)*0.009583,-0.085714-qt(0.025,6-2)*0.009583) > betalimits [1] -0.11232067 -0.05910733

  44. 12.2.1 Regression through the origin • Sometimes, we need to assume that the regression line passes through the origin, i.e. that  = 0 • To do this with R, you need to amend the code, and use -1 in the lm R command to show that the intercept should be removed > swimnoc <- lm(chlorine~time-1) > swimnoc Call: lm(formula = chlorine ~ time - 1) Coefficients: time 0.1335

  45. 12.2.1 Regression through the origin • Graphical Comparison between the standard regression and regression through the origin • > plot(time,chlorine,xlab="time",ylab="chlorine")

  46. 12.2.1 Regression through the origin • Graphical Comparison between the two alternatives, standard regression and regression through the origin • > abline(reg=swim) # add line corresp. to the standard regression

  47. 12.2.1 Regression through the origin • Graphical Comparison between the two alternatives, standard regression and regression through the origin • > abline(reg=swimnoc,lty=2) # add line corresp. to regression through origin

  48. 12.3 Confidence intervals and prediction intervals • One of the main objective of regression is to make predictions, i.e. find the value Y0 of the response variable corresponding to any new value x0 of the explanatory variable • Note that x0 may be one of x1,…, xn but not necessarily • In other words, using the past i.e. the observations y1,…, yn, and the fitted regression model , we’d like to predict the future Y0 given that x = x0 • We have two types of interval of interest • Confidence intervals for the mean E(Y0) • Prediction intervals for Y0

  49. 12.3 Confidence intervals and prediction intervals • Given a new value of the explanatory variable, x0, what is the predicted response? Easy – just • However, we need to distinguish between predictions of the future mean response E(Y0) and predictions of future observations Y0. • To make the distinction, suppose we have built a regression model that predicts the selling price of homes in a given area that is based on predictors like the number of bedrooms, closeness to a major highway etc. • There are two kinds of predictions that can be made for a given x0.

  50. 2. Suppose a new house comes on the market with characteristics x0. Its selling price will be  + x0 + e. Since E(e) = 0, the predicted price is , but in assessing the variance of this prediction, we must include the variance of e. 1. Suppose we ask the question: “What would the house with characteristics x0” sell for on average. This selling price is  + x0and is again predicted by but now only the variance in the estimated regression parameters and needs to be taken into account. 12.3 Confidence intervals and prediction intervals

More Related