Regression

Regression

Georg Simon Ohm, 1789-1854

1 1 6 6 2 2 2 2 8 8 3 3 3 3 13 13 4 4

Obs. Expec.

Gas- mile efficiency

Francis Galton (1822 – 1911)

How can you get Jack’s beanstalk ?

genetic character acquired character noise simple (linear) regression : slopeof the line genetic character (model) dependent variable acquired character (residual) : independent variable

Question: What’s happen if all xj’s are 1?

noise, residual (estimated) residual

simple (linear) regression

SST SSR SSE : coefficient of determination ANOVA table

2.0 3.0 4.0 0.5 1.5 2.5 7.5 6.5 Sepal.Length 5.5 4.5 4.0 Sepal.Width 3.0 2.0 7 6 5 4 Petal.Length 3 2 1 2.5 1.5 Petal.Width 0.5 3.0 Species 2.0 1.0 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0 virginica setosa versicolor S.Length S.Width P.Length P.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa ………………. 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor …………………. 99 6.2 2.9 4.3 1.3 versicolor 100 5.7 2.8 4.1 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica ………………… 150 5.9 3.0 5.1 1.8 virginica Fisher’s Iris data

For setosa of Iris data

For setosa of Iris data > setosa <- iris[1:50,1:4] > head(setosa) Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.1 3.5 1.4 0.2 …. 6 5.4 3.9 1.7 0.4 > names(setosa)<-c("sl","sw","pl","pw") > head(setosa) sl sw pl pw 1 5.1 3.5 1.4 0.2 …. 6 5.4 3.9 1.7 0.4 > > plot(setosa$sw,setosa$sl,pch=16, xlab="sepal.width", ylab="sepal.length") > ( rout <- lm(sl~sw,data=setosa) ) Call: lm(formula = sl ~ sw, data = setosa) Coefficients: (Intercept) sw 2.6390 0.6905 > abline(rout,col="red")

For setosa of Iris data > summary(rout) Call: lm(formula = sl ~ sw, data = setosa) Residuals: Min 1Q Median 3Q Max -0.52476 -0.16286 0.02166 0.13833 0.44428 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6390 0.3100 8.513 3.74e-11 *** sw 0.6905 0.0899 7.681 6.71e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2385 on 48 degrees of freedom Multiple R-squared: 0.5514, Adjusted R-squared: 0.542 F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10

For setosa of Iris data > anova(rout) Analysis of Variance Table Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 3.3569 3.3569 58.994 6.71e-10 *** Residuals 48 2.7313 0.0569 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > 1.5 1.0 0.5 0.0 0 20 40 60

Linear Regression Non-linear Regression Linear afterdata transformation

J. Kepler, 1571-1630 Tycho Brache, 1546-1601

James, D. Forbes (1875)

3.4 3.3 log(Air Pressure) 3.2 3.1 195 200 205 210 Boiling Temperature (BT) > install.packages("forward") > library(forward) > forbes > lm(log(pres)~bp,data=forbes) Call: lm(formula = log(pres) ~ bp, data = forbes) Coefficients: (Intercept) bp -0.97087 0.02062

No intercept model example : Lawn roller data > lm(y~0+x) > lm(y~-1+x) > lm(y~1+x) > lm(y~x) > install.packages("DAAG") > install.packages("randomForest") > library(DAAG) > roller > rout0<- lm(depression ~ -1+weight,data=roller) > rout1<- lm(depression ~ weight,data=roller) > with(roller, plot(weight,depression, pch=16,xlim=c(-1,14))) > abline(rout1) > abline(rout0,col="red") > points(0,0, pch=16, col="red")

No-intercept model example : Lawn roller data > anova(rout0) Analysis of Variance Table Response: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 2637.32 2637.3 63.862 2.233e-05 *** Residuals 9 371.68 41.3 > anova(rout1) Analysis of Variance Table Response: depression Df Sum Sq Mean Sq F value Pr(>F) weight 1 657.97 657.97 14.503 0.005175 ** Residuals 8 362.93 45.37 > roller weight depression 1 1.9 2 2 3.1 1 3 3.3 5 4 4.8 5 5 5.3 20 6 6.1 20 7 6.4 23 8 7.6 10 9 9.8 30 10 12.4 25 Df of residuals for no-intercept model is n-1. (n=10) Df of residuals is n-2. The SSE of no-intercept model is always greater than the SSE of intercept model. In many cases, intercept model has smaller MSE than no-intercept model, but in this lawn roller data MSE of no-intercept model is smaller than that of the intercept model. This allude to the suitability of the no-intercept model.

No intercept model example : Lawn roller data > summary(rout0) Call: lm(formula = depression ~ -1 + weight, data = roller) Coefficients: Estimate Std. Error t value Pr(>|t|) weight 2.3919 0.2993 7.991 2.23e-05 *** Residual standard error: 6.426 on 9 degrees of freedom Multiple R-squared: 0.8765, Adjusted R-squared: 0.8628 F-statistic: 63.86 on 1 and 9 DF, p-value: 2.233e-05 > summary(rout1) Call: lm(formula = depression ~ weight, data = roller) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0871 4.7543 -0.439 0.67227 weight 2.6667 0.7002 3.808 0.00518 ** Residual standard error: 6.735 on 8 degrees of freedom Multiple R-squared: 0.6445, Adjusted R-squared: 0.6001 F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175 The p-value of no-intercept model is smaller than that of the intercept model for this lawn roller data.

multiple (linear) regression

For versicolor of Iris data > vscolor<- iris[51:100,1:4] > names(vscolor)<-c("sl","sw","pl","pw") > rout<-lm(sl~ sw+pl,data=vscolor) > summary (rout) …. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1164 0.4943 4.282 9.06e-05 *** sw 0.2476 0.1868 1.325 0.191 pl 0.7356 0.1248 5.896 3.87e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3399 on 47 degrees of freedom Multiple R-squared: 0.5841, Adjusted R-squared: 0.5664 F-statistic: 33.01 on 2 and 47 DF, p-value: 1.11e-09

For versicolor of Iris data > summary( lm(sl~ sw+pl+pw,data=vscolor) ) Call: lm(formula = sl ~ sw + pl + pw, data = vscolor) Residuals: Min 1Q Median 3Q Max -0.7248 -0.2406 -0.0321 0.2958 0.5594 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.8955 0.5071 3.738 0.000511 *** sw 0.3869 0.2045 1.891 0.064890 . pl 0.9083 0.1654 5.491 1.67e-06 *** pw -0.6792 0.4354 -1.560 0.125599 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3348 on 46 degrees of freedom Multiple R-squared: 0.605, Adjusted R-squared: 0.5793 F-statistic: 23.49 on 3 and 46 DF, p-value: 2.28e-09

> irs<-iris > names(irs)<-c("sl","sw","pl","pw","sp") > setosa<-irs[1:50,]; vscolor <-irs[51:100,] > vginica <-irs[101:150,] > with(irs, plot(sw,sl,xlab="sepal.width", + ylab="sepal.length")) > with(setosa, points(sw,sl,pch=16,col="red")) > with(vscolor, points(sw,sl,pch=16,col="blue")) > with(vginica, points(sw,sl,pch=16,col="sienna2")) > rout1<- lm(sl ~sw , data=setosa) > rout2<- lm(sl ~sw , data=vscolor) > rout3<- lm(sl ~sw , data=vginica) > cfx<-coef(lm(sl~sw+sp,data=irs)) > cf1<-c( cfx[1] , cfx[2] ) > cf2<-c( cfx[1]+cfx[3], cfx[2] ) > cf3<-c( cfx[1]+cfx[4], cfx[2] ) > lines(c(2.5,4.3), c(sum(cf1*c(1,2.5)), sum(cf1*c(1,4.3)) ), lwd=4, col="pink") > lines(c(2.0,3.5), c(sum(cf2*c(1,2.0)), sum(cf2*c(1,3.5)) ), lwd=4, col="skyblue") > lines(c(2.2,3.9), c(sum(cf3*c(1,2.2)), sum(cf3*c(1,3.9)) ), lwd=4, col="tan1") The same slopes, but different intercepts

setosa versicolor verginica factor (categoriacal) variable dummy variables for verginica for versicolor for setosa

> anova(rout) Analysis of Variance Table Response: sl Df Sum Sq Mean Sq F value Pr(>F) sw 1 1.412 1.412 7.3628 0.00746 ** sp 2 72.752 36.376 189.6512 < 2e-16 *** Residuals 146 28.004 0.192 > is.factor(irs$sp) [1] TRUE > is.factor(irs$sw) [1] FALSE > rout<-lm(sl~sw+sp,data=irs) > coef(rout) (Intercept) sw spversicolor spvirginica 2.2513932 0.8035609 1.4587431 1.9468166 > summary(rout) Call: lm(formula = sl ~ sw + sp, data = irs) Residuals: Min 1Q Median 3Q Max -1.30711 -0.25713 -0.05325 0.19542 1.41253 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** sw 0.8036 0.1063 7.557 4.19e-12 *** spversicolor 1.4587 0.1121 13.012 < 2e-16 *** spvirginica 1.9468 0.1000 19.465 < 2e-16 *** Residual standard error: 0.438 on 146 degrees of freedom Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203 F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16 This means the number of groups is 3 (3 species, K=3). df = K-1 = 2 sp is a factor variable for 3 groups.

> coef( lm(sl~sp*sw,data=irs) ) (Intercept) spversicolor spvirginica sw 2.6390012 0.9007335 1.2678352 0.6904897 spversicolor:sw spvirginica:sw 0.1745880 0.2110448 > ( csetosa<- coef( lm(sl~sw, data=setosa) ) ) (Intercept) sw 2.6390012 0.6904897 > coef(lm(sl~sw, data=versicolor) )-csetosa (Intercept) sw 0.9007335 0.1745880 > coef(lm(sl~sw, data=vginica) )-csetosa (Intercept) sw 1.2678352 0.2110448 Regression for many datasets at one command, by using factor variable

Thank you !!

Regression

Regression

Presentation Transcript

Regression Analysis Simple Regression

Regression

Regression

Regression

Regression

Regression

Regression

REGRESSION

Regression

Regression

REGRESSION

Regression

Regression Linear Regression Regression Trees

Regression Linear Regression

Regression

REGRESSION

Regression

Regression

Regression Analysis Simple Regression

REGRESSION

Regression

Regression