Regression Analysis. W&W Chapter 12 & 15 (1-3). Correlation and Regression I. Bivariate regression shows us how variables are linearly related. Correlational analysis tells us the degree to which variables are related. Recall how to calculate correlation, or r :
W&W Chapter 12 & 15 (1-3)
r = (X-Mx)(Y-My) = covariance (X,Y)
(X-Mx)2 (Y-My)2 (sx)(sy)
Compare this calculation to the one for a bivariate slope coefficient, b:
b = (X – Mx)(Y – My) = covariance (X,Y)
(X – Mx)2 variance (X)
With correlation, we do not make a distinction between a dependent variable, Y, and an independent variable, X. Thus correlation takes into account the individual variance of X and Y in the denominator.
What is correlation intuitively?
We are subtracting X from its mean and Y from its mean, which tells us how far a pair (X,Y) is from the mean (Mx, My).
If r is positive, then X and Y are both above or both below their mean.
If r is negative, then when one variable is above its mean, the other is below its mean.
We have already seen that b and r are similar. In fact, we can write b in terms of r as:
b = r (sy/sx)
Suppose we collect the following data on a sample (N=8) of students’ math (X) and verbal scores (Y) on a college entrance exam.
X Y X Y
80 65 72 48
50 60 60 44
36 35 56 48
58 39 68 61
Let’s plot this data in a scatterplot. We could calculate the regression line using the formulas for slope and intercept we learned.
Suppose we want to predict a student’s verbal score if their math score is unknown. What would be our best guess?
Our best guess would be the most typical value, or the mean (My). The prediction error for a given student would be the distance between their verbal score and the mean, or (Y-My).
But we can do better with information on X by using the regression line (Yp) for our prediction.
(Y – My) = (Yp – My) + (Y – Yp)
Total Explained Unexplained
Deviation Deviation Deviation
The same is true for their sum of squares:
(Y – My)2 = (Yp – My)2 + (Y – Yp)2
SS(total) SSR SSE
ANOVA table for Linear Regression
Source of df Sum of Mean
Regression k (Yp – My)2 MST= (Yp – My)2/k
Error n-k-1 (Y – Yp)2 MSE= (Y – Yp)2 /(n-k-1)
Total n-1 (Y – My)2
k = the number of independent variables
We can calculate the F-statistic like we did for ANOVA previously.
F = MSR = variance explained by regression
MSE unexplained variance
With k, n-k-1 numerator and denominator degrees of freedom
Y = + X +
Ho: = 0
Y= 0 + 1X1 + 2X2 + 3X3 +
Ho: 1 = 2 = 3 = 0
HA: 1 2 3 0
We will see later on that we can also use F-tests to compare various models against each other.
For example, in the previous model we might want to compare a model with all variables, to one with just X1 and X2.
R2 = explained (regression) sum of squares
total sum of squares
R2 = (Yp – My)2 = SSR
(Y – My)2 SS
R2 measures the proportion of the total variation in Y explained by the regression model.
0 R2 1
Better models have higher R2.
R2 = 1 means Yp = Yi for all I
R2 = 0 means no relationship, Yp = My
In the bivariate model, R2 is the square of the correlation coefficient, r.
In order to estimate a regression model, we must make several simplifying assumptions. Let's go back to the problem of determining if fertilizer affects crop yield. Suppose (in Figure 12a) that we set fertilizer at level X1 for many, many plots. The resulting yields will not all be the same; the weather might be better for some plots, the soil might be better for others, etc. Thus we would get a distribution of Y1 given X1 or p(Y1X1). There will similarly be a distribution of Y2 at X2 and so forth. We can visualize a whole set of Y populations such as those shown in Figure 12-1a.
To analyze such populations, we make three assumptions about the regularity of these Y distributions.
1) Homogenous Variance: All the Y distributions have the same spread. Formally this means that the probability distribution p(Y1X1) has the same variance 2 as p(Y2X2) and so on. This assumption is often called fixed in repeated samples.
2) Linearity: For each Y distribution, the mean or E(Y1X1), or just 1 lies on a straight line known as the true (population) regression line:
E(Yi) = i = + Xi
The population parameters are estimated with the sample data as a and b.
3) Independence: The random variables Y1, Y2 …are statistically independent. For example, if Y1 happens to be large, there is no reason to expect Y2 to be large. Their mean and variance are given by:
Mean = + XI
Variance = 2
Or Yi = + Xi + ei
Where e1, e2, ….ei are independent errors with mean = 0 and variance = 2.
For example, in Figure 12-1b, the first observed value Y1 is shown along with the corresponding error term e1.
It is often assumed additionally that the errors are normally distributed:
i N[0, 2]
A violation of non-constant variance is called heteroskedasticity (errors are not the same across the range of X values).
We also assume no auto-correlation, or that Cov[i, j] = 0 if i j. A typical violation occurs with time series data where the errors are related over time.
The error may be regarded as the sum of two components:
To summarize, we can see in Figure 12-2 that we have the true population regression line (black thick line), and since we do not have the population data, we must estimate & with our sample data as a & b. Because of error, sometimes the values of Y observed in the sample will be a little low too (Y1) and sometimes they will be a little too high (Y2 and Y3). The question then is how close did we come to the true population line? The best we can hope for is something reasonably close, and we go back to the notion of a sampling distribution to figure out how close we are.
How is the slope estimate b distributed around its target ? Statisticians have been able to derive the theoretical sampling distribution and it is normally distributed with an expected value of b = and the standard error of b = /[(X - Mx)2]. Here represents the standard deviation of the Y observations about the population line.
An easier way to express the standard error of b is as follows:
Standard error = 1
There are three ways that the standard error can be reduced to produce a more accurate estimate b:
The third point is particularly interesting because increasing the variance of our independent variable increases our leverage to explain the variance of Y. We can see this clearly in Figure 12-4 where a larger range of values allows us to fit a more accurate regression line.
Because we do not generally know , we estimate standard error as:
SE = s/[(X-Mx)2],
where s2 = 1/(n-k-1)(Y - Yp)2