Lecture 7 Multiple Regression & Matrix Notation

Lecture 7Multiple Regression & Matrix Notation Quantitative Methods 2 Edmund Malesky, Ph.D., UCSD

Order of Presentation 1. Review of Variance of Beta Hat 2. Review of T-Tests 3. Review of Quadratic Equations 4. Introduction to Multiple Regression 5. The Role of Control Variables 6. Interpreting Regression Output

What does the variance of beta hat tell us? • Remember, we are working with a sample of the true population. • We are using that sample as a way to estimate the true relationships between variables (regression parameters in the population. • As in QM1, we must always remember that our estimates will be slightly different each time we sample from the population. • We know that the mean of repeated sampling, we equal the population parameter, but we still might want to have some sense of the variance. • The smaller the variance, the more efficient the estimate • As a result, we need some sense of the range that would occur after repeated sampling. • The confidence interval, derived from the standard error (SE) of the regression parameter is the way we estimate that range.

The Estimated Variance of β1hat Remember, how we did this is STATA in Lecture 4. We divided the Root Mean Squared Error of the Model by the Standard Deviation of the Independent Variable. This gave us the Standard Error (SE) or the variance of Beta Hat. • has nice intuitive qualities • As the size of the errors decreases, decreases • The line fits tightly through the data. Few other lines could fit as well • As the variation in x increases, decreases • Few lines will fit without large errors for extreme values of x

The Estimated Variance of β1hat • Because the variance of the estimated errors has n in the denominator, as n increases, the variance of β1hat decreases • The more data points we must fit to the line, the smaller the number of lines that fit with few errors • We have more information about where the line must go.

Variance of β1hat is Critical for Hypothesis Testing • T-test – tests that individual coefficients are not zero. • This is the central task for testing most policy theories

T-Tests • In general, our theories give us hypotheses that β0 >0 or β1 <0, etc. • We can estimate β1hat , but we need a way to assess the validity of statements that β1 is positive or negative, etc. • We can rely on our estimate of β1hat and its variance to use probability theory to test such statements.

Z – Scores & Hypothesis Tests • We know that β1hat ~ N(β1 , σβ) • Subtracting β1 from both sides, we can see that (β1hat - β1 ) ~ N( 0 , σβ ) • Then, if we divide by the standard deviation we can see that: (β1hat - β1 ) / β1hat ~ N( 0 , 1 ) • To test the “Null Hypothesis that β1 =0, we can see that: β1hat / σβ~ N( 0 , 1 )

Z-Scores & Hypothesis Tests • This variable is a “z-score” based on the standard normal distribution. • 95% of cases are within 1.96 standard deviations of the mean. • If β1hat / σβ > 1.96 then in a series of random draws there is a 95% chance that β1 >0 • The key problem is that we don’t actually know σβ, the true population parameter.

Z-Scores and t-scores • Obvious solution is to substitute in place of σβ • Problem: β1hat / is the ratio of two random variables, and this will not be normally distributed • Fortunately, an employee of Guinness Brewery figured out this distribution in 1919

The t-statistic • The statistic is called “Student’s t,” and the t-distribution looks similar to a normal distribution • Thus β1hat / ~ t(n-2) for bivariate regression. • More generally β1hat / ~ t(n-k) • where k is the # of parameters estimated

The t-statistic • Note the addition of a “degrees of freedom” constraint • Thus the more data points we have relative to the number of parameters we are trying to estimate, the more the t distribution looks like the z distribution. • When n>100 the difference is negligible

Limited Information in Statistical Significance Tests • Results often illustrative rather than precise • Only tests “not zero” hypothesis – does not measure the importance of the variable (look at confidence interval) • Generally reflects confidence that results are robust across multiple samples

As the degrees of freedom increase, the t-distribution approaches the normal distribution. T-distribution:The Statistical Workhorse df=6 df=4 df=2 -3 3 0

Quick Review: Hypothesis Testing • In STATA, the null hypothesis for a two-tailed t-test is:H0: βj=0

Quick Review: Hypothesis Testing • To test the hypothesis, I need to have a rejection rule. That is, I will reject the null hypothesis if, t is greater than some critical value (c) of the t distribution. You may know this in excel lingo as tcrit. c is up to me to some extent, I must determine what level of significance I am willing to accept. For instance, if my t-value is 1.85 with 40 df and I was willing to reject only at the 5% level, my c would equal 2.021 and I would not reject the null. On the other hand, if I was willing to reject at the 10% level, my c would be 1.684, and I would reject the null hypotheses.

t-distribution:5 % rejection rule for the that H0: βj=0 with 25 degrees of freedom Looking at table G-2, I find the critical value for a two-tailed test is 2.06 Rejection Region Area=.025 Rejection Region Area=.025 -2.06 2.06 0

Quick Review: • But this operation hides some very useful information. • STATA has decided that it is more useful to provide what is the smallest level of significance at which the null hypothesis would be rejected. This is known as the p-value. • In the previous example, we know that .05<p<.10. • To calculate the p, STATA computes the area under the probability density function.

T-distribution:Obtaining the p-value against a two-sided alternative, when t=1.85 and df=40. P-value=P(|T|>t) In this case, P(|T|>1.85)= 2P(T>1.85)=2(.0359) =.0718 Area=.9282 Rejection Region Area=.0359 Rejection Region Area=.0359 0

For Example… Presidential Approval and the CPI • reg approval cpi • Source | SS df MS Number of obs = 148 • ---------+------------------------------ F( 1, 146) = 9.76 • Model | 1719.69082 1 1719.69082 Prob > F = 0.0022 • Residual | 25731.4061 146 176.242507 R-squared = 0.0626 • ---------+------------------------------ Adj R-squared = 0.0562 • Total | 27451.0969 147 186.742156 Root MSE = 13.276 • ------------------------------------------------------------------------------ • approval | Coef. Std. Err. t P>|t| [95% Conf. Interval] • ---------+-------------------------------------------------------------------- • cpi | -.1348399 .0431667 -3.124 0.002 -.2201522 -.0495277 • _cons | 60.95396 2.283144 26.697 0.000 56.44168 65.46624 • ------------------------------------------------------------------------------ • . sum cpi • Variable | Obs Mean Std. Dev. Min Max • ---------+----------------------------------------------------- • cpi | 148 46.45878 25.36577 23.5 109

So the distribution of β1hat is:

Now Lets Look at Approval and the Unemployment Rate • . reg approval unemrate • Source | SS df MS Number of obs = 148 • ---------+------------------------------ F( 1, 146) = 0.85 • Model | 159.716707 1 159.716707 Prob > F = 0.3568 • Residual | 27291.3802 146 186.927262 R-squared = 0.0058 • ---------+------------------------------ Adj R-squared = -0.0010 • Total | 27451.0969 147 186.742156 Root MSE = 13.672 • ------------------------------------------------------------------------------ • approval | Coef. Std. Err. t P>|t| [95% Conf. Interval] • ---------+-------------------------------------------------------------------- • unemrate | -.5973806 .6462674 -0.924 0.357 -1.874628 .6798672 • _cons | 58.05901 3.814606 15.220 0.000 50.52003 65.59799 • ------------------------------------------------------------------------------ • . sum unemrate • Variable | Obs Mean Std. Dev. Min Max • ---------+----------------------------------------------------- • unemrate | 148 5.640541 1.744879 2.6 10.7

Now the Distribution of β1hat is:

Quadratic Review

Quadratic Review y x1 0 1

Quadratic Review • β0hat is the intercept as in the linear equation • β1hat is the slope when x is 0 to the first unit of x. • β2hat is used to calculate the slope at other points on the line. • A positive coefficient on β2hatmeans the curve turns upward. • A negative coefficient on β2hatmeans the curve turns downward • Use equation 1 to get predicted value for each point on the line. • Use equation 2 to get the slope for each point on the curve. • Use equation 3 to isolate the point where the slope is equal to 0

Figure 1: Kuznets Predictions and Actual Relationship between Growth and Inequality

Dealing with a Complicated World • Multiple regression to address multiple causes

Multiple Regression:What if y has more than just ONE cause? • We have found an estimator for the relationship between x and y • We have developed methods to use the estimator to test hypotheses derived from theories about x and y. • But we have only 1 x (and only 1 β) • The world is more complicated than that!

Multiple Regression Analysis • We can make a simple extension of the bivariate model to the multivariate case • Instead of a two dimensional space (x and y axes) we move into multi-dimensional space • If we have x1 and x2, then we are fitting a two dimensional plane through points in space.

The Bivariate Regression y: Votes for Candidate A in 2004 x1: Expenditures of Candidate A in $1000s (2000-2003)

Now, we add another variable (x2) x2: 100s of new jobs created (2000-2003) x1: Expenditures of Candidate A in $1000s (2000-2003) y: Votes for Candidate A in 2004

Explanation of Multivariate Analysis Because multiple linear regression includes more than a single independent variable, the result of an analysis is best visualized as a plane rather than as the line of a bivariate regression analysis. This plane is defined by a series of slopes and a y-intercept value, and oriented such that deviations between the observed data points and the plane are minimized in the direction of the dependent variable.

Two Dimensional Plane in 3D Space x2: 100s of new jobs created (2000-2003) x1: Expenditures of Candidate A in $1000s (2000-2003) y: Votes for Candidate A in 2004

Interpretation of Multiple Regression Q1. If we modeled only an equation with expenditures, where would the impact of Job Growth show up in our results? A1. Correct. It would show up in a larger residual size. A2. β1hat is the ceteris paribus effect of expenditures on vote changes, controlling for job growth. Q2. How do I interpret the coefficient β1hat in my STATA output? A3. It means that the effect of Job Growth is held fixed or constant. ▲Job Growth = 0 Q3. What does “controlling for” mean?

Another way to think of Partial Effects In other words, I regress x1 on x2. Basically, β1hat measures the sample relation between x1 and y, after x2 has been partialled out.

Illustrating this approach to β1hat Covariance between x1 and x2 • Coefficient β1hat is calculated based on area in yellow circle that overlaps with blue, but NOT with red. x1 y x2 Center area discarded – We can’t say which variable accounts for it

Relationship between Multiple Regression and Bivariate World.

An Example from HW1

Change in Gauss-Markov Assumption 4 – Zero Conditional Mean • Zero conditional mean, before we summarized GM4 as “the population error (u) has an expected value of 0 for any value of the explanatory variable (x).” Essentially, this meant that other factors having a direct impact on y (i.e. changes in votes) are unrelated on average to x (expenditures). The equation also implies that we have correctly specified the functional form between the independent and dependent variables! • Now, GM4 becomes “The population error (u) has an expected value of 0 for any combination of x1 & x2Other factors having a direct impact on y (i.e. changes in votes) are unrelated on average to x1 (expenditures) and x2 (job growth).

Change in Gauss-Markov Assumption 3 – No Perfect Collinearity • Before, GM3 was that there must be sample variation in explanatory variables. xi’s are not all the same value Essentially, if any one of our x’s is perfectly explained by the others, it will drop out of our model. • Now GM3 reads, none of the independent variables is constant and there are no exact linear relationships among the dependent variables.

Multiple Regression Analysis • Above 3 dimensions MR becomes difficult to visualize. • Logic of the process is the same. We are fitting SETS of x’s to each point on a y dimension. • β0hatremains the intercept and β1hat,β2hat.. are called slope estimates. • Though in a quadratic function, slope estimates for both coefficients is slightly incorrect. Why? • The basic equation of the true population model in scalar notation is:

And the Slope Coefficients? • In scalar terms the equation for βhat of variable k becomes: • where = the linear prediction of xik based on the other x’s • Similar to the bivariate estimator. Then we used because we lacked any better expectation about x Please note that Wooldridge indexes observations by “t” instead of “i” in the matrix algebra discussions. We use “i” to maintain the analogy with scalar algebra.

Shifting to Matrix Notation • Writing out these terms and multiplying them in scalar notation is clumsy. • Represented in simpler terms through linear (matrix) algebra • The basic equation becomes:

The Multiple Regression Equation • The vectors and matrices in are represented by • Note that we post-multiply X byβsince this order makes them conformable.

Math Tools With Matrices • To derive our vector of coefficients βhat, we will need to do some math with matrices • Multiplying matrices • Taking the transpose of a matrix • Inverting a matrix

We Can Multiply Matrices • Multiplication • Where

The Transpose of a Matrix A' • Taking the transpose is an operation that creates a new matrix based on an existing one. • The rows of A = the columns of A' • Hold upper left and lower right corners and rotate 180 degrees.

Example of a transpose

Lecture 7 Multiple Regression & Matrix Notation