70-208 Regression Analysis Week 3 Dielman: Ch 4 (skip Sub-Sec 4.4.2, 4.6.2, and 4.6.3 and Sec 4.7), Sec 7.1
Multiple Independent Variables • We believe that both education and experience affect the salary you earn. Can linear regression still be used to capture this idea? • Yes, of course • The “linear” part of “linear regression” means that the regression coefficients cannot enter the eq’n in a nonlinear way (such as β12 * x1)
Multiple Independent Variables • Salaryi = β0 + β1 * Educi + β2 * Experi + μi • Graphing this equation requires the use of 3 dimensions, so the usefulness of graphical methods such as scatterplots and best-fit lines is somewhat limited now • As the number of explanatory variables increases, the formulas for computing the estimates of the regression coefficients become increasingly complex • So we will not cover how to solve them by hand
Multiple Independent Variables • Equation that “best” describes the relationship btwn a dependent variable y and K independent variables x1, x2, … , xK can be written as: • y = β0 + β1 * x1 + β2 * x2 + … + βK * xK + μ • Note that I will mostly drop the “i” subscript moving fwd • The criterion for “best” is the same as it was for simple (i.e. K = 1) regression – the sum of the squared difference btwn the true values of y and the values predicted yhat should be as small as possible • β0,hat, β1,hat, β2,hat, … , βK,hat ensure that the sum of squared errors is minimized
Labeling β • Sometimes we just use β0, β1, β2, … , βK to label the coefficients • Other times, it is useful to be more specific. For example, if x1 represents “education level”, it is better to write β1 as βeduc. • β0 is always written the same • The first regression below is more helpful in seeing and presenting your work than the second regression, even if we knew that y was salary, x1 was education, etc • Salary = β0 + βeduc * Educ + βexper * Exper + μ • y = β0 + β1 * x1 + β2 * x2 + μ • I will go back and forth with my labeling throughout the course. I just wanted you to understand the difference and why one way might be better in practice.
Multiple Independent Variables • Ceteris paribus – all else equal • In the case of simple regression, we interpreted the regression coefficient estimate as meaning how much the dependent variable increased when the independent variable went up one unit • Implicit was the concept that the error term for any two individuals were equally distributed, in other words, that all else was equal
Multiple Independent Variables • It is very possible that that is a bad implicit assumption • That is one reason we like to add multiple explanatory variables. Once they are added, they are not part of the error term and can be explicitly accounted for when we interpret coefficient estimates • What the hell do I mean by all of this?
Multiple Independent Variables • Go back to the salary example • Hopefully you all agree that education and experience are both highly likely to explain salary in statistically significant ways • But what if we didn’t have experience data, so we just ran the regression on salary and education?
Multiple Independent Variables • What we would like to run: • Salary = β0 + β1 * Educ + β2 * Exper + μ • What we do run: • Salary = β0 + β1 * Educ + μ • Which means that experience has now been sucked into the error term. If experience levels (conditional on education) differ in our sample data set, the implicit assumption that the errors are equally distributed across all observations is wrong! • If we ran the 2nd regression written above, we would interpret β1,hat as the amount by which salary increases when education increases by one unit (implicitly saying all else, i.e. the “errors”, are equal, which I just argued is probably a poor assumption)
Multiple Independent Variables • So now say we have the experience data and we can run the regression with 2 explanatory variables • Now we would interpret β1,hat as the amount by which salary increases when education increases by one unit AND EXPERIENCE IS THE SAME (plus the remaining information captured by the errors is the same across all observations) • So we explicitly take experience out of the error term and can now condition on it being the same when we interpret the education coefficient
Multiple Independent Variables • But how good does the implicit, ceteris paribus, “error” assumption hold up even when both educ and exper are included? • Maybe still not very good. Everything you can think of is still being captured by the error terms except for education and experience levels. If these somehow differ systematically across observations, the assumption of equal error distributions is still wrong!
Multiple Independent Variables • What do I mean by “everything you can think of”? Very simply, anything else that might (or might not!) affect salary. • Years of experience at current company • Number of extended family members that work at same company • Intelligence • How many sick days you took over the past 5 years • How many kids you have • How many siblings you have • How many different cities you’ve lived in • How many hot dogs you eat each year • Etc, etc, etc, blah, blah, blah
Multiple Independent Variables • Let’s look at those closer • Years of experience at current company – Probably would have significant effect on salary. We should include this in the regression if we can get the data. • Number of extended family members that work at same company – Might or might not have affect on salary. • Intelligence – Tough to measure, but could proxy for it using an IQ score. Very likely to affect salary, so it should be included in the regression, too. • How many sick days you took over the past 5 years – Kind of a measure of effort, so I think it would matter. • How many kids you have – Could matter, especially for women. • How many siblings you have – Doubtful it would be significant. • How many different cities you’ve lived in – Very unlikely to be significant. • How many hot dogs you eat each year – I’m literally just making stuff up at this point, so I doubt this would affect salary (unless we are measuring the salaries of competitive eaters, so note that context can matter when “answering” these questions)
Multiple Independent Variables • So what happens if we think intelligence matters but it wasn’t included in the regression as a separate explanatory variable? • Then intelligence is rolled up into the error term. But if education and intelligence are highly correlated (smarter people have more years of education), then the errors are not the same across the individuals in the sample (E(μi|X) ≠ 0). In fact, those with higher education have “higher” error, by which I mean one component of the error term is systemically bigger for some individuals • This would make our ceteris paribus assumption false and we would end up with biased estimators!
Multiple Independent Variables • What if we include insignificant variables because we are afraid of getting biased estimates if we don’t throw everything in? • Not really a problem. We will see how to evaluate whether there are any relevant gains to including additional variables. If there are, they should be kept in the regression. If the gains are negligible or even negative, drop those insignificant variables and fear not the repercussions of bias.
Multiple Independent Variables – Output • Look at and interpret output • Sales are dependent on Advertising and Bonus • Run the regression: • Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus • This equation can be interpreted as providing an estimate of mean sales for a given level of advertising and bonus payment. • If advertising is held constant, mean sales tend to rise by $1860 (1.86 thousands of dollars) for each unit increase in Bonus. If bonus is held fixed, mean sales tend to rise by $2470 (2.47 thousands of dollars) for each unit increase in Adv.
Multiple Independent Variables – Output • Notice in the Excel output that the dof of the Regression is now 2 (always used to be 1). This is because there are 2 explanatory variables. The SSE, MSR, F, etc are calculated basically the same way as before, which we will go over very soon. • Look at Fig 4.7b on pg 141 of Dielman to see how Excel outputs all the regression information when multiple independent variables are included
Multiple Independent Variables – Prediction • As in simple regression, when we run a multiple regression we can then predict, or estimate, values for y when we have values for every explanatory variable by solving for yhat • Back to sales example with Adv and Bonus only • Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus • Say Adv = 200 and Bonus = 150. What would we predict for Sales (i.e. what is Saleshat)? • Plug in Adv = 200 and Bonus = 150 • Saleshat = -516.4 + 2.47 * 200 + 1.86 * 150 = 256.6
Confidence Intervals and Hypothesis Testing • The confidence interval on βk,hat when K explanatory variables are included is • (βk,hat – tα/2,N-K-1 * sβk, βk,hat – tα/2,N-K-1 * sβk) • Notice the dof change on the t-value • Hypothesis testing on any one independent variable is the same as before. The default Excel test is shown below. • H0 : βk = 0 • Ha : βk≠ 0
Hypothesis Testing • If the null on the previous slide is not rejected, then the conclusion is that, once the effects of all other variables in the regression are included, xk is not linearly related to y. In other words, adding xk to the regression eq’n is of no help in explaining any additional variation in y left unexplained by the other explanatory variables. You can drop xk from the regression and still have the same “fit”.
Hypothesis Testing • If the null is rejected, then there is evidence that xk and y are linearly related and that xk does help explain some of the variation in y not accounted for by the other variables
Hypothesis Testing • Are Sales and Bonus linearly related? • Use t-test • H0 : βBON = 0 • Ha : βBON ≠ 0 • Dec rule → reject null if test stat more extreme than t-value and do not reject otherwise • βBON,hat = 1.856 and sβBON = 0.715, so test stat = 1.856 / 0.715 = 2.593 • The t value with 22 dof (from N-K-1) for a two-tailed test with α = 0.05 is 2.074. • Since 2.593 > 2.074, reject null • Yes, they are linearly related (even when Advertising is also accounted for)
Hypothesis Testing • Could have used p-value or CI to answer the question on previous slide • Would have reached same conclusion • Don’t use full F when testing just one variable (more explanation later)
Assessing the Fit • Recall SST, SSR, and SSE • SST = ∑ (yi – ybar)2 • SSR = ∑ (yi,hat – ybar)2 • SSE = ∑ (yi – yi,hat)2 • For SSR, dof is equal to number of explanatory variables K • For SSE, dof is N – K – 1 • So SST has N – 1 dof
Assessing the Fit • Recall that R2 = SSR / SST = 1 – (SSE / SST) • It was a measure of the goodness of fit of the regression line and ranged from 0 to 1. If R2 was multiplied by 100, it represented the percentage of the variation in y explained by the regression. • Drawback to R2 in multiple regression → As more explanatory variables are added, the value of R2 will never decrease even if the additional variables are explaining an insignificant proportion of the variation in y
Assessing the Fit • From R2 = 1 – (SSE / SST), you can see that R2 gets increasingly closer to 1 since SSE falls any time any little tiny bit more variation in y is explained • Addition of unnecessary explanatory variables, which add little, if anything, to the explanation of the variation in y, is not desirable • An alternative measure is called adjusted R2, or Radj2 • “Adjusted” because it adjusts for the dof
Assessing the Fit • Radj2 = 1 – (SSE / (N – K – 1)) / (SST / (N – 1)) • Now suppose an explanatory variable is added to the regression model that produces only a very small decrease in SSE. The divisor N-K-1 also falls since K has been increased by 1. It is possible that the numerator of Radj2 may increase if the decrease in SSE from the addition of another variable is not great enough to overcome the decrease in N-K-1.
Assessing the Fit • Radj2 no longer represents the proportion of variation in y explained by the regression (that is still captured only by R2), but it is useful when comparing two regressions with different numbers of explanatory variables. A decrease in Radj2 from the addition of one or more explanatory variables signals that the added variable(s) was of little importance in the regression, so it can be dropped.
Assessing the Fit • F = MSR / MSE • MSR = SSR / K • MSE = SSE / (N – K – 1) • Full F statistic is used to test the following hypothesis: • H0 : β1 = β2 = … = βK = 0 • Ha : At least one coefficient above is not equal to 0
Assessing the Fit • Decision rule → reject null if F > fcrit(α; K, N-K-1) and do not rej otherwise • Failing to reject the null implies that the explanatory variables in the regression equation are of little or no use in explaining the variation in y. Rejection of the null implies that at least one (but not necessarily all) of the explanatory variables helps explain the variation in y.
Assessing the Fit • Rejection of the null does not mean that all pop’n regression coefficients are different from 0 (though this may be true), just that the regression is useful overall in explaining y. • The full F test can be thought of as a global test designed to assess the overall fit of the model. • That’s why full F cannot be used for hypothesis testing on a single variable in multiple regression, but it could be used for the hypothesis testing on the single explanatory variable in simple regression (since that variable was the whole, “global” model)
Sales Example • Show the calculation of F on the Excel sheet • Using SSE and SSR • Using MSE and MSR • Would we reject the null that all coefficients are equal to 0? • YES
Comparing Two Regression Models • Remember that the t-test can check whether each individual regression coefficient is significant and the full F test can check the overall fit of the regression by asking whether any coefficient is significant • Partial F test is in between – it answers the question of whether some subset of coefficients are significant or not
Comparing Two Regression Models • Want to test whether variables xL+1, … , xK are useful in explaining any variation in y after taking into account variation already explained by x1, … , xL variables • Full model has all K variables: • y = β0 + β1 * x1 + β2 * x2 + … + βL * xL + βL+1 * xL+1 + … + βK * xK + μ • Reduced model only has L variables: • y = β0 + β1 * x1 + β2 * x2 + … + βL * xL + μ
Comparing Two Regression Models • Is the full model significantly better than the reduced model at explaining the variation in y? • H0 : βL+1 = … = βK = 0 • Ha : at least one of them isn’t equal to 0 • If null is not rejected, choose the reduced model • If null is rejected, xL+1, … , xK contribute to explaining y, so use the full model
Comparing Two Regression Models • To test the hypothesis, use the following partial F statistic • Fpart = ((SSER – SSEF) / (K – L)) / ((SSEF) / (N – K – 1)), where the “R” stands for reduced model and “F” stands for full model • SSER – SSEF is always greater than or equal to 0 • Full model includes K – L extra variables which, at worst, explain none of variation in y and in all likelihood explain at least a little of it, so SSE falls • This difference represents the additional amount of variation in y explained by adding xL+1, … , xK to the regression
Comparing Two Regression Models • This measure of improvement is then divided by the number of additional variables included, K – L • Thus the numerator of Fpartis the additional variation in y explained per additional explanatory variable used • Reject null if Fpart> fcrit(α; K – L, N – K – 1) and do not reject otherwise
Sales Example Revisited • Example 4.4, pg 152 of Dielman • Let’s add two more variables to the sales example from earlier • x3 is mkt share held by company in each territory and x4 is largest competitor’s sales in each territory • So the “reduced” model results we already have. They were shown earlier when just x1 (Adv) and x2 (Bonus) were included
Sales Example Revisited • We need to see the full model results • Notice that R2 is higher for the full model (remember, R2 can never fall when more variables are added) but Radj2 is actually lower • This should be a clue that we will probably not reject the null on β3 and β4 when comparing the full and reduced models
Sales Example Revisited • SSER = 181176, SSEF = 175855, K – L = 2, N – K – 1 = 20 • Note that this last value is the dof of SSE in the full model • So Fpart= ((181176 – 175855) / 2) / (175855 / 20) = 0.303 • fcrit(0.05; 2, 20) = 3.49 • Since 0.303 < 3.49, do not reject null • Conclude that β3 = β4 = 0, so x3 and x4 should not be included in the regression
Sales Example Revisited • Notice that the values for β0,hat, βADV,hat, and βBON,hat changed when we added additional variables • Saleshat = -516.4 + 2.47 * Adv + 1.86 * Bonus • Saleshat = -593.5 + 2.51 * Adv + 1.91 * Bonus + 2.65 * Mkt_Shr – 0.121 * Compet • This should not surprise you. Some of what was previously rolled up into μ has now been explicitly accounted for, and that changes the way the initial set of explanatory variables relate to Sales. • Note that the inclusion of additional observations(i.e. we gather more data) could also adjust the estimates of β0,hat, etc • Every regression is different! (like snowflakes.......)
Sales Example Revisited • If we chose to stick with the “full” sales model, we would include the x3 and x4 variables in predicting Saleshat • Even though they are insignificant, because the β0,hat, βADV,hat, and βBON,hatvalues changed with their inclusion, it would be wrong to make predictions without them (unless we re-ran the original regression where they were not even included) • So what is Saleshat for Adv = 500, Bonus = 150, Mkt_Shr = 0.5, and Compet = 100? • Saleshat = -593.5 + 2.51 * 500 + 1.91 * 150 + 2.65 * 0.5 – 0.121 * 100 = 937.2
Limits to K? • There are K + 1 coefficients that need to be estimated (β0, β1, … , βK) • We need at least N observations to estimate that many coefficients • Normally written as K ≤ N – 1 • This is a similar concept from an algebra class you’d have taken in middle school, where we needed at least M equations to solve for X unknowns (i.e. M ≥ X) • Here, you can think of N being similar to the number of equations needed and K being the number of unknowns to be solved for
Multicollinearity • For a regression of y on K explanatory variables, it is hoped that the explanatory variables are highly correlated with the dependent variable • However, it is not desirable for strong relationships to exist among the explanatory variables themselves • When explanatory variables are correlated with one another, the problem of multicollinearity is said to exist
Multicollinearity • Seriousness of problem depends on degree of correlation • Some books list an additional assumption of OLS that the sample data X is not all the same value, and a follow-up assumption that X1 cannot directly determine X2 • The first point made in the last bullet hardly ever happens. As long as X varies in the population, the sample data will almost always vary unless the pop’n variation is minimal or the sample size is very small. • The second point made in the last bullet expressly forbids perfect multicollinearity to occur between any 2 explanatory variables
Biggest Problem for MultiC • The std errors of regression coefficients are large when there is high multicollinearity among explanatory variables • The null hypo that the coefficients are 0 may not be rejected even when the associated variable is important in explaining variation in y • Summary: Perfect collinearity is fatal for a regression. Any small degree of multicollinearity increases std errors and is thus somewhat undesirable, though basically unavoidable. • We will look at one strategy for investigating multicollinearity and using it to inform our regression choices next (free preview: Fpart is useful)
Baseball Example • Example comes from the Wooldridge text • I believe baseball player salaries are determined by years in the league, avg games played per year, career batting average, avg home runs per year, and avg RBIs per year • So the following regression is run: • log(salary) = β0 + β1 * years + β2 * games_yr + β3 * cavg + β4 * hr_yr + β5 * rbi_yr + μ • Ignore the log for now, that’s for next week. I just wanted to stay kosher with the example from my other book. Just think of it as “salary” if it really bothers you.
Baseball Example • Results • Plus N = 353 and SSEF = 183.186
Baseball Example • Simple t-test on the last three coefficients would say they are insignificant in explaining log(salary) • But any baseball fan knows that batting avg, home runs, and RBIs definitely are big factors in determining player salaries (and team performance for that matter) • So let’s run the reduced model where we drop out those three variables and check to see what the partial F statistic reveals
Baseball Example • Results • Plus N = 353 and SSER = 198.311