Regression Analysis. W&W, Chapter 11. Introduction. ANOVA allows us to analyze the difference in means among various populations. We will extend this analysis today to regression, which will allow us to examine the impact of one or more independent variables on a dependent variable.
W&W, Chapter 11
ANOVA allows us to analyze the difference in means among various populations. We will extend this analysis today to regression, which will allow us to examine the impact of one or more independent variables on a dependent variable.
Today we will focus on the calculation of regression coefficients for the bivariate model.
Suppose we collect the following data from various fields where we have applied fertilizer and measured the subsequent crop yield.
100 40 We can plot this data in a scatterplot.
We want to find the regression line that best fits the scatterplot of data. We can represent the regression line for a bivariate regression model generally as:
Yp = a + bX + e
Where Yp = predicted value of Y
X = independent variable
a = Y-intercept
b = slope
e = error
The form of the model presented above is the regression line calculated on the basis of the sample data. What are trying to make inferences about is the (usually) unobservable relationship between X and Y in the population (such as all crop fields).
Y = + X +
Once again, we use Greek letters to denote the population parameters we are trying to estimate.
Statisticians have discovered that the best fitting line is the one that minimizes the squared deviations from the line to each Y value on the scatterplot. This is called the OLS line, or Ordinary Least Squares line.
The goal is to minimize (Y – Yp)2
Y = actual sample values of the dependent variable
Yp = predicted values of Y based on the regression line
The quantity (Y – Yp)2 is the sum of squares for error, or SSE.
Yp = a + bX + e
We will assume that E(e)=0.
Calculating the slope coefficient:
b = (X – Mx)(Y – My) = covariance (X,Y)
(X – Mx)2 variance (X)
Mx = sample mean for X, My = sample mean for Y
X Y (X-Mx)(Y-My) (X-Mx)2
100 40 (100-400)(40-60) (100-400)2 = 90,000
200 50 (200-400)(50-60) (200-400)2 = 40,000
. . . .
. . . .
700 80 (700-400)(80-60) (700-400)2 = 90,000
(X – Mx)(Y – My) = 16,500, (X – Mx)2 = 280,000
b = 16,500/280,000 = .059
For a bivariate regression model, we calculate the Y-intercept, a, as:
a = My – bMx
My = sample mean for Y
Mx = sample mean for X
b = slope coefficient
a = 60 – (.059)(400) = 36.4
We can now plug these values into our regression equation.
Yp = a + bX, or Yp = 36.4 + .059X
We can plot this line on our scatterplot.
Thus when we increase fertilizer by one unit (lbs), we increase our yield by .059 (bushels).
2) Interpreting the y-intercept (a): the value of Y when X=0. If we applied no fertilizer, we would have a yield of 36.4 bushels.
Y= 0 + 1X1 + 2X2 + 3X3 +
Each slope coefficient (i) measures the responsiveness of the dependent variable to a one-unit change in the associated independent variable when the other independent variables are held constant.
We will talk about calculating the coefficients for multivariate regression next week.
Y: body weight (lbs)
X1: food intake (average daily calories)
X2: exercise (average daily expenditure, calories)
X3: gender (1=male, 0=female)
Intercept: A female who eats no food and does not exercise weighs 152 pounds.
FOOD: A one calorie increase in daily average food intake increases a person’s weight by .028 pounds. A 100 calorie increase results in a 2.8 pound increase in weight (100x.028).
EXERCISE: A one calorie increase in calories expended through exercise decreases a person’s weight by 0.045 pounds.
The coefficient can be interpreted as the difference in the expected value of Y between a case for which X=0 and a case for which X=1 (holding all other independent variables constant).
For two individuals with identical food intake and exercise, a man can expect to weigh 35 pounds more than a woman.
We have estimated the relationship between X and Y in our sample (b). We want to make an inference, however, about the relationship in the population ().
Under the null hypothesis, we will assume that X and Y have no relationship; in other words the slope coefficient equals zero under the null hypothesis.
Two-tailed test (no theoretical expectation):
Ho: = 0
One-tailed test (such as fertilizer increases crop yield)
HA: > 0
= b +/- t/2(seb)
b = estimated slope coefficient
t/2 = two tailed critical t value, df=n-2
seb = (Y – Yp)2/(n-2) = s
(X – Mx)2 (X – Mx)2
We will denote the numerator of this equation, s.
seb is called the standard error of the estimate.
In our previous fertilizer example, we found that b=.059. Let’s test the null hypothesis that fertilizer has no impact on crop yield.
X Y Yp (Y-Yp)2
100 40 42.3 (-2.3)2=5.29
200 50 48.2 3.24
300 50 54.1 16.81
400 70 60 100
500 65 65.9 0.81
600 65 71.8 46.24
700 80 77.7 5.29
s = (Y – Yp)2/(n-2) = 177.68/(7-2) = 35.5
s = 5.96
seb = s = 5.96 = .011
(X – Mx)2 280,000
= b +/- t/2(seb)
= .059 +/- 2.57(.011)
= .059 +/- .029
Our estimate of falls between .030 and .088. Because our confidence interval does not contain the hypothesized null value (zero), we can reject the null hypothesis.
We can conclude that fertilizer has a statistically significant effect on crop yield.
Suppose we want to test the hypothesis that fertilizer increases crop yield.
HA: > 0
First, we will calculate our test statistic, t:
t = b/ seb = .059/.011 = 5.36
This is our calculated t, which we compare to our critical t which is 2.02 for a one-tailed test.
Because 5.36 > 2.02, we can reject the null hypothesis and conclude that fertilizer significantly increases crop yield.