Regression Analysis

Regression Analysis W&W, Chapter 11

Introduction ANOVA allows us to analyze the difference in means among various populations. We will extend this analysis today to regression, which will allow us to examine the impact of one or more independent variables on a dependent variable.

Examples we will use today • The effect of fertilizer (X) on yield observations (Y): this is an example of a bivariate regression model • The effect of food (X1), exercise (X2), and gender (X3) on body weight (from Berry and Sanders): this is an example of a multivariate regression model. Today we will focus on the calculation of regression coefficients for the bivariate model.

Fertilizer and Crop yield Suppose we collect the following data from various fields where we have applied fertilizer and measured the subsequent crop yield. X Y 100 40 We can plot this data in a scatterplot. 200 50 300 50 400 70 500 65 600 65 700 80

Objective We want to find the regression line that best fits the scatterplot of data. We can represent the regression line for a bivariate regression model generally as: Yp = a + bX + e Where Yp = predicted value of Y X = independent variable a = Y-intercept b = slope e = error

Bivariate Regression Model The form of the model presented above is the regression line calculated on the basis of the sample data. What are trying to make inferences about is the (usually) unobservable relationship between X and Y in the population (such as all crop fields). Y =  + X +  Once again, we use Greek letters to denote the population parameters we are trying to estimate.

Finding the best fitting line Statisticians have discovered that the best fitting line is the one that minimizes the squared deviations from the line to each Y value on the scatterplot. This is called the OLS line, or Ordinary Least Squares line. The goal is to minimize (Y – Yp)2 Y = actual sample values of the dependent variable Yp = predicted values of Y based on the regression line The quantity (Y – Yp)2 is the sum of squares for error, or SSE.

Calculating the Regression Line Yp = a + bX + e We will assume that E(e)=0. Calculating the slope coefficient: b = (X – Mx)(Y – My) = covariance (X,Y) (X – Mx)2 variance (X) Mx = sample mean for X, My = sample mean for Y

Calculating b for our crop example X Y (X-Mx)(Y-My) (X-Mx)2 100 40 (100-400)(40-60) (100-400)2 = 90,000 200 50 (200-400)(50-60) (200-400)2 = 40,000 . . . . . . . . 700 80 (700-400)(80-60) (700-400)2 = 90,000 (X – Mx)(Y – My) = 16,500, (X – Mx)2 = 280,000 b = 16,500/280,000 = .059

Calculating the Y-intercept, a For a bivariate regression model, we calculate the Y-intercept, a, as: a = My – bMx My = sample mean for Y Mx = sample mean for X b = slope coefficient

Calculating a for the crop example a = 60 – (.059)(400) = 36.4 We can now plug these values into our regression equation. Yp = a + bX, or Yp = 36.4 + .059X We can plot this line on our scatterplot.

Interpreting the regression coefficients • Interpreting the slope (b): the slope coefficient represents the change in Y that accompanies a one unit change in X Thus when we increase fertilizer by one unit (lbs), we increase our yield by .059 (bushels). 2) Interpreting the y-intercept (a): the value of Y when X=0. If we applied no fertilizer, we would have a yield of 36.4 bushels.

Multivariate Regression Y= 0 + 1X1 + 2X2 + 3X3 +  Each slope coefficient (i) measures the responsiveness of the dependent variable to a one-unit change in the associated independent variable when the other independent variables are held constant. We will talk about calculating the coefficients for multivariate regression next week.

Example of Multivariate Regression Y: body weight (lbs) X1: food intake (average daily calories) X2: exercise (average daily expenditure, calories) X3: gender (1=male, 0=female)

Table 3.1 Regression Model of Body Weight Coefficient Intercept 152.0 FOOD 0.028 EXERCISE -0.045 MALE 35.00

Interpretation Intercept: A female who eats no food and does not exercise weighs 152 pounds. FOOD: A one calorie increase in daily average food intake increases a person’s weight by .028 pounds. A 100 calorie increase results in a 2.8 pound increase in weight (100x.028). EXERCISE: A one calorie increase in calories expended through exercise decreases a person’s weight by 0.045 pounds.

Interpretation (continued) MALE (dichotomous) The coefficient can be interpreted as the difference in the expected value of Y between a case for which X=0 and a case for which X=1 (holding all other independent variables constant). For two individuals with identical food intake and exercise, a man can expect to weigh 35 pounds more than a woman.

Something to note • In the special case where Y has no relation to X (b=0), then Yp = a because a = My – bMx, thus a = My or Yp=My. This means that the best prediction we can make for Y with no other information is its mean.

Hypothesis Testing for the Regression Slope Coefficient (b) We have estimated the relationship between X and Y in our sample (b). We want to make an inference, however, about the relationship in the population (). Under the null hypothesis, we will assume that X and Y have no relationship; in other words the slope coefficient equals zero under the null hypothesis.

Hypothesis Testing for the Regression Slope Coefficient Two-tailed test (no theoretical expectation): Ho:  = 0 HA:   0 One-tailed test (such as fertilizer increases crop yield) Ho:   0 HA:  > 0

Testing via a Confidence Interval  = b +/- t/2(seb) b = estimated slope coefficient t/2 = two tailed critical t value, df=n-2 seb = (Y – Yp)2/(n-2) = s (X – Mx)2 (X – Mx)2 We will denote the numerator of this equation, s. seb is called the standard error of the estimate.

Testing via a Confidence Interval In our previous fertilizer example, we found that b=.059. Let’s test the null hypothesis that fertilizer has no impact on crop yield. X Y Yp (Y-Yp)2 100 40 42.3 (-2.3)2=5.29 200 50 48.2 3.24 300 50 54.1 16.81 400 70 60 100 500 65 65.9 0.81 600 65 71.8 46.24 700 80 77.7 5.29  (Y-Yp)2=177.68

Testing via a Confidence Interval s = (Y – Yp)2/(n-2) = 177.68/(7-2) = 35.5 s = 5.96 seb = s = 5.96 = .011 (X – Mx)2 280,000  = b +/- t/2(seb)  = .059 +/- 2.57(.011)  = .059 +/- .029

Decision Our estimate of  falls between .030 and .088. Because our confidence interval does not contain the hypothesized null value (zero), we can reject the null hypothesis. We can conclude that fertilizer has a statistically significant effect on crop yield.

Testing via a t-test Suppose we want to test the hypothesis that fertilizer increases crop yield. Ho:   0 HA:  > 0 First, we will calculate our test statistic, t: t = b/ seb = .059/.011 = 5.36

Testing via a t-test This is our calculated t, which we compare to our critical t which is 2.02 for a one-tailed test. Because 5.36 > 2.02, we can reject the null hypothesis and conclude that fertilizer significantly increases crop yield.

Regression Analysis