Multiple Regression

Multiple Regression W&W, Chapter 13, 15(3-4)

Introduction • Multiple regression is an extension of bivariate regression to take into account more than one independent variable. The most simple multivariate model can be written as: Y= 0 + 1X1 + 2X2 +  We make the same assumptions about our error () that we did in the bivariate case.

Example • Suppose we examine the impact of fertilizer on crop yields, but this time we want to control for another factor that we think affects yield levels, rainfall. • We collect the following data.

Data Y (yield) X1 (fertilizer) X2 (rainfall) 40 100 10 50 200 20 50 300 10 70 400 30 65 500 20 65 600 20 80 700 30

Multiple Regression Partial Slope Coefficients 1 is geometrically interpreted as the marginal effect of fertilizer (X1) on yield (Y) holding rainfall (X2) constant. The OLS model is estimated as: Yp = b0 + b1X1 + b2X2 + e Solving for b0, b1, and b2 becomes more complicated than it was in the bivariate model because we have to consider the relationships between X1 and Y, X2 and Y, and X1 and X2.

Finding the Slopes We would solve the following equations simultaneously for this problem: (X1-Mx1)(Y-My) = b1(X1-Mx1)2 + b2 (X1-Mx1)(X2-Mx2) (X2-Mx2)(Y-My) = b1 (X1-Mx1)(X2-Mx2) + b2(X2-Mx2)2 b0 = My – b1X1 – b2X2 These are called the normal or estimating equations.

Solution b1 = [(X1-Mx1)(Y-My)][(X2-Mx2)2] – [(X2-Mx2)(Y- My)][ (X1-Mx1)(X2-Mx2)] (X1-Mx1)2 (X2-Mx2)2 -  (X1-Mx1)(X2-Mx2)2 b2 = [(X2-Mx2)(Y-My)][(X1-Mx1)2] – [(X1-Mx1)(Y- My)][ (X1-Mx1)(X2-Mx2)] (X1-Mx1)2 (X2-Mx2)2 -  (X1-Mx1)(X2-Mx2)2 Good thing we have computers to calculate this for us!

Hypothesis Testing for  We can calculate a confidence interval:  = b +/- t/2(seb) Df = N – k – 1, where k=# of regressors We can also use a t-test for each independent variable to test the following hypotheses (as one or two tailed tests): Ho: 1 = 0 Ho: 2 = 0 where t = bi/(sebi) HA: 1 0 HA: 2 0

Dropping Regressors • We may be tempted to throw out variables that are insignificant, but we might bias the remaining coefficients in the model. Such an omission of important variables is called omitted variable bias. If you have a strong theoretical reason to include a variable, then you should keep it in the model. One way to minimize such bias is to use randomized assignment of the treatment variables.

Interpreting the Coefficients In the bivariate regression model, the slope (b) represents a change in Y that accompanies a one unit change in X. In the multivariate regression model, each slope coefficient (bi) represents the change in Y that accompanies a one unit change in the regressor (Xi) if all other regressors remain constant. This is like taking a partial derivative in calculus, which is why we refer to these as partial slope coefficients.

Partial Correlation Partial correlation calculates the correlation between Y and Xi with the other regressors held constant: Partial r = b/(seb) = t  [b/(seb)2 + (n-k-1)] [t2 + (n-k-1)]

Calculating Adjusted R2 R2 = SSR/SS Problem: R2 increases as k increases, so some people advocate the use of the adjusted R2: R2A: (n-1)R2 – k (n-k-1) We subtract k in the numerator as a “penalty” for increasing the size of k (# of regressors).

Stepwise Regression • W&W talk about stepwise regression (pages 499-500). This is an atheoretical procedure that selects variables on the basis of how they increase R2. Don’t use this technique because it is not theoretically driven and R2 is a very problematic statistic (as you will learn later).

Standard error of the estimate A better measure of model fit is the standard error of the estimate: s = [(Y-Yp)2]/[n-k-1] This is just the square root of the SSE, controlling for degrees of freedom. A model with a smaller standard error of the estimate is better. See Chris Achen’s Sage monograph on regression for a good discussion of this measure.

Multicollinearity • An additional assumption we must make in multiple regression is that none of the independent variables are perfectly correlated with each other. • In the simple multivariate model, for example: Yp = b0 + b1X1 + b2X2 + e r12 1

Multicollinearity With perfect multicollinearity, you cannot estimate the partial slope coefficients. To see why this is so, rewrite the estimate for b1 in the model with two independent variables as: b1 = [ry1 – r12ry2] sy [1 – r122] s1 ry1 = correlation between Y and X1, r12 = correlation between X1 and X2, ry2 = correlation between Y and X2, sy = standard deviation of Y, s1 = standard deviation of X1

Multicollinearity We can see that if r12=1 or –1 you are dividing by zero, which is impossible. Often times, if r12 is high, but not equal to one, you will get a good overall model fit (high R2, significant F-statistic), but insignificant t-ratios. You should always examine the correlations between you independent variables to determine if this might be an issue.

Multicollinearity Multicollinearity does not bias our estimates, but if inflates the variance and thus the standard error of the parameters (or increases inefficiency). This is why we get insignificant t ratios, because we t = b/(seb), and as we inflate seb, we depress the t-ratio making it less likely that we will reject the null hypothesis.

Standard error for bi We can calculate the standard error for b1, for example, as: se1 = s [(X1-Mx1)2][1-R12] Where R1 = the multiple correlation of X1 with all the other regressors. As R1 increases, our standard error increases. Note that for bivariate regression, the term [1-R12] drops out.

Multiple Regression