190 likes | 832 Views
SM339 Spring 05 - Simple Regr. 2. Regression. The simplest relationship is a linear relation between 2 variablesY=b0 b1 X eE is normal, mean=0, SD unknownNeed to estimate b0, b1 from data. SM339 Spring 05 - Simple Regr. 3. Regression. Usual criterion for best fit" is least squares"Fin
E N D
1. SM339 Spring 05 - Simple Regr 1 Regression So far, we have considered single populations or have compared 2 or more populations
Now we consider relationships between different variables
2. SM339 Spring 05 - Simple Regr 2 Regression The simplest relationship is a linear relation between 2 variables
Y=b0 + b1 X +e
E is normal, mean=0, SD unknown
Need to estimate b0, b1 from data
3. SM339 Spring 05 - Simple Regr 3 Regression Usual criterion for “best fit” is “least squares”
Find b0, b1 to minimize SUM(y-(b0+b1*x))^2
Note that this is the vertical sum of squares
Same as min SUM e^2
4. SM339 Spring 05 - Simple Regr 4 Regression Matlab has this calculation built in
Suppose we have two col vectors, x & y
Xx=[ones(size(x)) x];
Puts a column of ones on the left so that XX has 2 cols
B=XX\y will be the (two) coefficients
Yh=XX*b will the model values
plot(x,y,'o',x,yh,'-');grid
Will plot the data and the line
5. SM339 Spring 05 - Simple Regr 5 Regression The LS line minimizes SUM(y-yh)^2
Call this quantity SSE
For a column, SS=x’*x
We would like to know if SSE is small
Can compare it to the horizontal line (slope=0) where yh=yavg
SSR=sum(y-yavg)^2
For SSE, df=N-2
For SSR, df=1
Calculate MS and F and test
6. SM339 Spring 05 - Simple Regr 6 Regression Note that F measures the relative sizes of SSR and SSE
SSR+SSE = SSTotal = SUM(y-Yavg)^2
So we might want to know how much of SSTotal is SSR and how much is SSE
R^2 = Index of determination = SSR/SSTotal
Large is good because we want SSE to be small
Not simple to interpret because it is the percent of the squared variation
(We’ll be back to R^2)
7. SM339 Spring 05 - Simple Regr 7 Regression We can also test the slope directly
The slope is a linear combination of the data and so has a normal distn
SD(slope) = SD(data)/sqrt(Sxx)
Sxx=sum(x-xavg)^2
Use sqrt(MSE)=RMSE to estimate SD(data)
Since we estimated SD, use t distn with df=N-2
Note that t^2=F
Pattern for df is df=N-# parameters estimated
For regression, we estimate slope and intercept
For t test, we only estimate the mean
8. SM339 Spring 05 - Simple Regr 8 Regression We can also find confidence bounds for the slope
9. SM339 Spring 05 - Simple Regr 9 Regression The precise values of slope and intercept depend on the sample that we use
If we used different samples (from the same population), then we would get some variation in slope and intercept
(We have already found the SD(slope))
10. SM339 Spring 05 - Simple Regr 10 Regression For a given value of X, what would Y be?
Two answers
(1) What about the mean of the Y’s for this particular X?
(2) What about a particular Y for this value of X?
11. SM339 Spring 05 - Simple Regr 11 Regression We would use the regression line to estimate the mean value of Y for a particular value of X
But the line is based on a sample. We need to account for the variability of our estimates.
When we let X=Xi, then the SD of the mean Y is:
SD*sqrt( (Xi-Xavg)^2/Sxx)
If our Xi is near the middle of our X’s, then our estimate is less variable
If Xi is near the extremes, then our estimate is more variable
12. SM339 Spring 05 - Simple Regr 12 Regression This only accounts for the variability in our estimate of the line
If we had a situation where X=Xi, what might Y be?
We know that Y might be above or below the line
SD(predicted) = SD*sqrt( 1+ (Xi-Xavg)^2/Sxx)
The “1+” is the additional variation about the line
13. SM339 Spring 05 - Simple Regr 13 Correlation Suppose we consider X to be a linear function of Y
Y = b0 + b1 X
X = c0 + c1 Y
NOT true that c1=1/b1
Because the two lines are fit by different criteria
The first line minimizes squared differences in the Y direction
The second line minimizes squared differences in the X direction
14. SM339 Spring 05 - Simple Regr 14 Correlation What if we don’t know which model to use?
Correlation measures the degree to which X and Y are related
But not necessarily X as a fn of Y or Y as a fn of X
Can think of correlation as a generalization of slope which does not have units
15. SM339 Spring 05 - Simple Regr 15 Correlation Define Sxy = SUM (x-xavg)(y-yavg) and Syy as we did Sxx
Then b1=Sxy/Sxx
Note the units are y/x
C1=Sxy/Syy (again note units)
Define (Pearson) correlation r= Sxy/sqrt(Sxx Syy)
Note R^2 = b1*c1
So, if the slopes are reciprocals, then R^2 = 1
16. SM339 Spring 05 - Simple Regr 16 Correlation Define Sxy = SUM (x-xavg)(y-yavg) and Syy as we did Sxx
Then b1=Sxy/Sxx
Note the units are y/x
C1=Sxy/Syy (again note units)
Define (Pearson) correlation r= Sxy/sqrt(Sxx Syy)
Note R^2 = b1*c1
(Here, R^2 is either correlation squared or the index of determination. They are the same value.)
So, if the slopes are reciprocals, then R^2 = 1
R^2=1 means that SSE=0, so the points fall exactly on a line
17. SM339 Spring 05 - Simple Regr 17 Correlation We can also do tests on r
SD(r ) = sqrt((1-r^2)/(n-2))
But this depends on our data
So need a t distn
Df=N-2
Same as for slope
Same as for SSE
18. SM339 Spring 05 - Simple Regr 18 Correlation To test if r=0, we could compute
R-0 / sqrt((1-r^2)/(n-2))
This is the same as the test for slope=0
(EITHER slope)
And if we square it, we get F
19. SM339 Spring 05 - Simple Regr 19 Other models Suppose we leave X out of the model
Y = b0
The est is b0=Yavg
The test for b0 is the same as the t test for the mean