1 / 53

Regression

Regression. Georg Simon Ohm, 1789-1854. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 1. 1. 6. 6. 2. 2. 2. 2. 8. 8. 3. 3. 3. 3. 13. 13. 4. 4. 0. Obs. Expec. Gas- mile efficiency. Francis Galton (1822 – 1911). How can you get Jack’s beanstalk ?.

clark
Download Presentation

Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression

  2. Georg Simon Ohm, 1789-1854

  3. 1 1 6 6 2 2 2 2 8 8 3 3 3 3 13 13 4 4

  4. 1 1 6 6 2 2 2 2 8 8 3 3 3 3 13 13 4 4

  5. 0

  6. Obs. Expec.

  7. Gas- mile efficiency

  8. Francis Galton (1822 – 1911)

  9. How can you get Jack’s beanstalk ?

  10. genetic character acquired character noise simple (linear) regression : slopeof the line genetic character (model) dependent variable acquired character (residual) : independent variable

  11. Question: What’s happen if all xj’s are 1?

  12. noise, residual (estimated) residual

  13. simple (linear) regression

  14. SST SSR SSE : coefficient of determination ANOVA table

  15. 2.0 3.0 4.0 0.5 1.5 2.5 7.5 6.5 Sepal.Length 5.5 4.5 4.0 Sepal.Width 3.0 2.0 7 6 5 4 Petal.Length 3 2 1 2.5 1.5 Petal.Width 0.5 3.0 Species 2.0 1.0 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0 virginica setosa versicolor S.Length S.Width P.Length P.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa ………………. 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor …………………. 99 6.2 2.9 4.3 1.3 versicolor 100 5.7 2.8 4.1 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica ………………… 150 5.9 3.0 5.1 1.8 virginica Fisher’s Iris data

  16. For setosa of Iris data

  17. For setosa of Iris data > setosa <- iris[1:50,1:4] > head(setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.1 3.5 1.4 0.2 …. 6 5.4 3.9 1.7 0.4 > names(setosa)<-c("sl","sw","pl","pw") > head(setosa) sl sw pl pw 1 5.1 3.5 1.4 0.2 …. 6 5.4 3.9 1.7 0.4 > > plot(setosa$sw,setosa$sl,pch=16, xlab="sepal.width", ylab="sepal.length") > ( rout <- lm(sl~sw,data=setosa) ) Call: lm(formula = sl ~ sw, data = setosa) Coefficients: (Intercept) sw 2.6390 0.6905 > abline(rout,col="red")

  18. For setosa of Iris data > summary(rout) Call: lm(formula = sl ~ sw, data = setosa) Residuals: Min 1Q Median 3Q Max -0.52476 -0.16286 0.02166 0.13833 0.44428 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6390 0.3100 8.513 3.74e-11 *** sw 0.6905 0.0899 7.681 6.71e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2385 on 48 degrees of freedom Multiple R-squared: 0.5514, Adjusted R-squared: 0.542 F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

  19. For setosa of Iris data > anova(rout) Analysis of Variance Table Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 3.3569 3.3569 58.994 6.71e-10 *** Residuals 48 2.7313 0.0569 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > 1.5 1.0 0.5 0.0 0 20 40 60

  20. Linear Regression Non-linear Regression Linear afterdata transformation

  21. J. Kepler, 1571-1630 Tycho Brache, 1546-1601

  22. James, D. Forbes (1875)

  23. 3.4 3.3 log(Air Pressure) 3.2 3.1 195 200 205 210 Boiling Temperature (BT) > install.packages("forward") > library(forward) > forbes > lm(log(pres)~bp,data=forbes) Call: lm(formula = log(pres) ~ bp, data = forbes) Coefficients: (Intercept) bp -0.97087 0.02062

  24. No intercept model example : Lawn roller data > lm(y~0+x) > lm(y~-1+x) > lm(y~1+x) > lm(y~x) > install.packages("DAAG") > install.packages("randomForest") > library(DAAG) > roller > rout0<- lm(depression ~ -1+weight,data=roller) > rout1<- lm(depression ~ weight,data=roller) > with(roller, plot(weight,depression, pch=16,xlim=c(-1,14))) > abline(rout1) > abline(rout0,col="red") > points(0,0, pch=16, col="red")

  25. No-intercept model example : Lawn roller data > anova(rout0) Analysis of Variance Table Response: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 2637.32 2637.3 63.862 2.233e-05 *** Residuals 9 371.68 41.3 > anova(rout1) Analysis of Variance Table Response: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 657.97 657.97 14.503 0.005175 ** Residuals 8 362.93 45.37 > roller weight depression 1 1.9 2 2 3.1 1 3 3.3 5 4 4.8 5 5 5.3 20 6 6.1 20 7 6.4 23 8 7.6 10 9 9.8 30 10 12.4 25 Df of residuals for no-intercept model is n-1. (n=10) Df of residuals is n-2. The SSE of no-intercept model is always greater than the SSE of intercept model. In many cases, intercept model has smaller MSE than no-intercept model, but in this lawn roller data MSE of no-intercept model is smaller than that of the intercept model. This allude to the suitability of the no-intercept model.

  26. No intercept model example : Lawn roller data > summary(rout0) Call: lm(formula = depression ~ -1 + weight, data = roller) Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.3919 0.2993 7.991 2.23e-05 *** Residual standard error: 6.426 on 9 degrees of freedom Multiple R-squared: 0.8765, Adjusted R-squared: 0.8628 F-statistic: 63.86 on 1 and 9 DF, p-value: 2.233e-05 > summary(rout1) Call: lm(formula = depression ~ weight, data = roller) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0871 4.7543 -0.439 0.67227 weight 2.6667 0.7002 3.808 0.00518 ** Residual standard error: 6.735 on 8 degrees of freedom Multiple R-squared: 0.6445, Adjusted R-squared: 0.6001 F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175 The p-value of no-intercept model is smaller than that of the intercept model for this lawn roller data.

  27. multiple (linear) regression

  28. For versicolor of Iris data > vscolor<- iris[51:100,1:4] > names(vscolor)<-c("sl","sw","pl","pw") > rout<-lm(sl~ sw+pl,data=vscolor) > summary (rout) …. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1164 0.4943 4.282 9.06e-05 *** sw 0.2476 0.1868 1.325 0.191 pl 0.7356 0.1248 5.896 3.87e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3399 on 47 degrees of freedom Multiple R-squared: 0.5841, Adjusted R-squared: 0.5664 F-statistic: 33.01 on 2 and 47 DF, p-value: 1.11e-09

  29. For versicolor of Iris data > summary( lm(sl~ sw+pl+pw,data=vscolor) ) Call: lm(formula = sl ~ sw + pl + pw, data = vscolor) Residuals: Min 1Q Median 3Q Max -0.7248 -0.2406 -0.0321 0.2958 0.5594 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.8955 0.5071 3.738 0.000511 *** sw 0.3869 0.2045 1.891 0.064890 . pl 0.9083 0.1654 5.491 1.67e-06 *** pw -0.6792 0.4354 -1.560 0.125599 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3348 on 46 degrees of freedom Multiple R-squared: 0.605, Adjusted R-squared: 0.5793 F-statistic: 23.49 on 3 and 46 DF, p-value: 2.28e-09

  30. > irs<-iris > names(irs)<-c("sl","sw","pl","pw","sp") > setosa<-irs[1:50,]; vscolor <-irs[51:100,] > vginica <-irs[101:150,] > with(irs, plot(sw,sl,xlab="sepal.width", + ylab="sepal.length")) > with(setosa, points(sw,sl,pch=16,col="red")) > with(vscolor, points(sw,sl,pch=16,col="blue")) > with(vginica, points(sw,sl,pch=16,col="sienna2")) > rout1<- lm(sl ~sw , data=setosa) > rout2<- lm(sl ~sw , data=vscolor) > rout3<- lm(sl ~sw , data=vginica) > cfx<-coef(lm(sl~sw+sp,data=irs)) > cf1<-c( cfx[1] , cfx[2] ) > cf2<-c( cfx[1]+cfx[3], cfx[2] ) > cf3<-c( cfx[1]+cfx[4], cfx[2] ) > lines(c(2.5,4.3), c(sum(cf1*c(1,2.5)), sum(cf1*c(1,4.3)) ), lwd=4, col="pink") > lines(c(2.0,3.5), c(sum(cf2*c(1,2.0)), sum(cf2*c(1,3.5)) ), lwd=4, col="skyblue") > lines(c(2.2,3.9), c(sum(cf3*c(1,2.2)), sum(cf3*c(1,3.9)) ), lwd=4, col="tan1") The same slopes, but different intercepts

  31. setosa versicolor verginica factor (categoriacal) variable dummy variables for verginica for versicolor for setosa

  32. > anova(rout) Analysis of Variance Table Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 1.412 1.412 7.3628 0.00746 ** sp 2 72.752 36.376 189.6512 < 2e-16 *** Residuals 146 28.004 0.192 > is.factor(irs$sp) [1] TRUE > is.factor(irs$sw) [1] FALSE > rout<-lm(sl~sw+sp,data=irs) > coef(rout) (Intercept) sw spversicolor spvirginica 2.2513932 0.8035609 1.4587431 1.9468166 > summary(rout) Call: lm(formula = sl ~ sw + sp, data = irs) Residuals: Min 1Q Median 3Q Max -1.30711 -0.25713 -0.05325 0.19542 1.41253 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** sw 0.8036 0.1063 7.557 4.19e-12 *** spversicolor 1.4587 0.1121 13.012 < 2e-16 *** spvirginica 1.9468 0.1000 19.465 < 2e-16 *** Residual standard error: 0.438 on 146 degrees of freedom Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203 F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16 This means the number of groups is 3 (3 species, K=3). df = K-1 = 2 sp is a factor variable for 3 groups.

  33. > coef( lm(sl~sp*sw,data=irs) ) (Intercept) spversicolor spvirginica sw 2.6390012 0.9007335 1.2678352 0.6904897 spversicolor:sw spvirginica:sw 0.1745880 0.2110448 > ( csetosa<- coef( lm(sl~sw, data=setosa) ) ) (Intercept) sw 2.6390012 0.6904897 > coef(lm(sl~sw, data=versicolor) )-csetosa (Intercept) sw 0.9007335 0.1745880 > coef(lm(sl~sw, data=vginica) )-csetosa (Intercept) sw 1.2678352 0.2110448 Regression for many datasets at one command, by using factor variable

  34. Thank you !!

More Related