Regression

1. SM339 Spring 05 - Simple Regr 1 Regression So far, we have considered single populations or have compared 2 or more populations Now we consider relationships between different variables

2. SM339 Spring 05 - Simple Regr 2 Regression The simplest relationship is a linear relation between 2 variables Y=b0 + b1 X +e E is normal, mean=0, SD unknown Need to estimate b0, b1 from data

3. SM339 Spring 05 - Simple Regr 3 Regression Usual criterion for �best fit� is �least squares� Find b0, b1 to minimize SUM(y-(b0+b1*x))^2 Note that this is the vertical sum of squares Same as min SUM e^2

4. SM339 Spring 05 - Simple Regr 4 Regression Matlab has this calculation built in Suppose we have two col vectors, x & y Xx=[ones(size(x)) x]; Puts a column of ones on the left so that XX has 2 cols B=XX\y will be the (two) coefficients Yh=XX*b will the model values plot(x,y,'o',x,yh,'-');grid Will plot the data and the line

5. SM339 Spring 05 - Simple Regr 5 Regression The LS line minimizes SUM(y-yh)^2 Call this quantity SSE For a column, SS=x�*x We would like to know if SSE is small Can compare it to the horizontal line (slope=0) where yh=yavg SSR=sum(y-yavg)^2 For SSE, df=N-2 For SSR, df=1 Calculate MS and F and test

6. SM339 Spring 05 - Simple Regr 6 Regression Note that F measures the relative sizes of SSR and SSE SSR+SSE = SSTotal = SUM(y-Yavg)^2 So we might want to know how much of SSTotal is SSR and how much is SSE R^2 = Index of determination = SSR/SSTotal Large is good because we want SSE to be small Not simple to interpret because it is the percent of the squared variation (We�ll be back to R^2)

7. SM339 Spring 05 - Simple Regr 7 Regression We can also test the slope directly The slope is a linear combination of the data and so has a normal distn SD(slope) = SD(data)/sqrt(Sxx) Sxx=sum(x-xavg)^2 Use sqrt(MSE)=RMSE to estimate SD(data) Since we estimated SD, use t distn with df=N-2 Note that t^2=F Pattern for df is df=N-# parameters estimated For regression, we estimate slope and intercept For t test, we only estimate the mean

8. SM339 Spring 05 - Simple Regr 8 Regression We can also find confidence bounds for the slope

9. SM339 Spring 05 - Simple Regr 9 Regression The precise values of slope and intercept depend on the sample that we use If we used different samples (from the same population), then we would get some variation in slope and intercept (We have already found the SD(slope))

10. SM339 Spring 05 - Simple Regr 10 Regression For a given value of X, what would Y be? Two answers (1) What about the mean of the Y�s for this particular X? (2) What about a particular Y for this value of X?

11. SM339 Spring 05 - Simple Regr 11 Regression We would use the regression line to estimate the mean value of Y for a particular value of X But the line is based on a sample. We need to account for the variability of our estimates. When we let X=Xi, then the SD of the mean Y is: SD*sqrt( (Xi-Xavg)^2/Sxx) If our Xi is near the middle of our X�s, then our estimate is less variable If Xi is near the extremes, then our estimate is more variable

12. SM339 Spring 05 - Simple Regr 12 Regression This only accounts for the variability in our estimate of the line If we had a situation where X=Xi, what might Y be? We know that Y might be above or below the line SD(predicted) = SD*sqrt( 1+ (Xi-Xavg)^2/Sxx) The �1+� is the additional variation about the line

13. SM339 Spring 05 - Simple Regr 13 Correlation Suppose we consider X to be a linear function of Y Y = b0 + b1 X X = c0 + c1 Y NOT true that c1=1/b1 Because the two lines are fit by different criteria The first line minimizes squared differences in the Y direction The second line minimizes squared differences in the X direction

14. SM339 Spring 05 - Simple Regr 14 Correlation What if we don�t know which model to use? Correlation measures the degree to which X and Y are related But not necessarily X as a fn of Y or Y as a fn of X Can think of correlation as a generalization of slope which does not have units

15. SM339 Spring 05 - Simple Regr 15 Correlation Define Sxy = SUM (x-xavg)(y-yavg) and Syy as we did Sxx Then b1=Sxy/Sxx Note the units are y/x C1=Sxy/Syy (again note units) Define (Pearson) correlation r= Sxy/sqrt(Sxx Syy) Note R^2 = b1*c1 So, if the slopes are reciprocals, then R^2 = 1

16. SM339 Spring 05 - Simple Regr 16 Correlation Define Sxy = SUM (x-xavg)(y-yavg) and Syy as we did Sxx Then b1=Sxy/Sxx Note the units are y/x C1=Sxy/Syy (again note units) Define (Pearson) correlation r= Sxy/sqrt(Sxx Syy) Note R^2 = b1*c1 (Here, R^2 is either correlation squared or the index of determination. They are the same value.) So, if the slopes are reciprocals, then R^2 = 1 R^2=1 means that SSE=0, so the points fall exactly on a line

17. SM339 Spring 05 - Simple Regr 17 Correlation We can also do tests on r SD(r ) = sqrt((1-r^2)/(n-2)) But this depends on our data So need a t distn Df=N-2 Same as for slope Same as for SSE

18. SM339 Spring 05 - Simple Regr 18 Correlation To test if r=0, we could compute R-0 / sqrt((1-r^2)/(n-2)) This is the same as the test for slope=0 (EITHER slope) And if we square it, we get F

19. SM339 Spring 05 - Simple Regr 19 Other models Suppose we leave X out of the model Y = b0 The est is b0=Yavg The test for b0 is the same as the t test for the mean

Regression

Regression

Presentation Transcript

Regression Analysis Simple Regression

Regression

Regression

Regression

Regression

Regression

Regression

REGRESSION

Regression

Regression

REGRESSION

Regression

Regression Linear Regression Regression Trees

Regression Linear Regression

Regression

REGRESSION

Regression

Regression

Regression Analysis Simple Regression

REGRESSION

Regression

Regression