Regression Analysis

Regression Analysis W&W Chapter 12 & 15 (1-3)

Correlation and Regression I • Bivariate regression shows us how variables are linearly related. Correlational analysis tells us the degree to which variables are related. • Recall how to calculate correlation, or r: r = (X-Mx)(Y-My) = covariance (X,Y)  (X-Mx)2  (Y-My)2 (sx)(sy)

Correlation and Regression II Compare this calculation to the one for a bivariate slope coefficient, b: b = (X – Mx)(Y – My) = covariance (X,Y) (X – Mx)2 variance (X) With correlation, we do not make a distinction between a dependent variable, Y, and an independent variable, X. Thus correlation takes into account the individual variance of X and Y in the denominator.

Correlation and Regression III What is correlation intuitively? We are subtracting X from its mean and Y from its mean, which tells us how far a pair (X,Y) is from the mean (Mx, My). If r is positive, then X and Y are both above or both below their mean. If r is negative, then when one variable is above its mean, the other is below its mean.

Correlation and Regression IV We have already seen that b and r are similar. In fact, we can write b in terms of r as: b = r (sy/sx)

Example Suppose we collect the following data on a sample (N=8) of students’ math (X) and verbal scores (Y) on a college entrance exam. X Y X Y 80 65 72 48 50 60 60 44 36 35 56 48 58 39 68 61

Example Let’s plot this data in a scatterplot. We could calculate the regression line using the formulas for slope and intercept we learned. Suppose we want to predict a student’s verbal score if their math score is unknown. What would be our best guess?

Total Deviation Our best guess would be the most typical value, or the mean (My). The prediction error for a given student would be the distance between their verbal score and the mean, or (Y-My). But we can do better with information on X by using the regression line (Yp) for our prediction.

Total Deviation (Y – My) = (Yp – My) + (Y – Yp) Total Explained Unexplained Deviation Deviation Deviation The same is true for their sum of squares: (Y – My)2 = (Yp – My)2 + (Y – Yp)2 SS(total) SSR SSE

Analysis of Variance ANOVA table for Linear Regression Source of df Sum of Mean Variationsquaressquares Regression k (Yp – My)2 MST= (Yp – My)2/k Error n-k-1 (Y – Yp)2 MSE= (Y – Yp)2 /(n-k-1) Total n-1 (Y – My)2 k = the number of independent variables

Analysis of Variance We can calculate the F-statistic like we did for ANOVA previously. F = MSR = variance explained by regression MSE unexplained variance With k, n-k-1 numerator and denominator degrees of freedom

F-test • For a bivariate regression model, the null hypothesis for the F-test is equivalent to the null hypothesis for a t-test of the slope coefficient. Y =  + X +  Ho:  = 0 HA:   0

F-test • For a multivariate regression model, the F test determines whether any of the slope coefficients are zero. Y= 0 + 1X1 + 2X2 + 3X3 +  Ho: 1 = 2 = 3 = 0 HA: 1  2  3  0

F-test We will see later on that we can also use F-tests to compare various models against each other. For example, in the previous model we might want to compare a model with all variables, to one with just X1 and X2.

Calculating R2 • Another way to evaluate the overall fit of our regression model (in addition to the F test) is to calculate R2, which is the proportion of total variance (in Y) that our regression model can explain. R2 = explained (regression) sum of squares total sum of squares

Calculating R2 R2 = (Yp – My)2 = SSR (Y – My)2 SS R2 measures the proportion of the total variation in Y explained by the regression model. 0  R2  1 Better models have higher R2.

Interpreting R2 R2 = 1 means Yp = Yi for all I R2 = 0 means no relationship, Yp = My In the bivariate model, R2 is the square of the correlation coefficient, r.

Some Assumptions In order to estimate a regression model, we must make several simplifying assumptions. Let's go back to the problem of determining if fertilizer affects crop yield. Suppose (in Figure 12a) that we set fertilizer at level X1 for many, many plots. The resulting yields will not all be the same; the weather might be better for some plots, the soil might be better for others, etc. Thus we would get a distribution of Y1 given X1 or p(Y1X1). There will similarly be a distribution of Y2 at X2 and so forth. We can visualize a whole set of Y populations such as those shown in Figure 12-1a.

Some Assumptions To analyze such populations, we make three assumptions about the regularity of these Y distributions. 1) Homogenous Variance: All the Y distributions have the same spread. Formally this means that the probability distribution p(Y1X1) has the same variance 2 as p(Y2X2) and so on. This assumption is often called fixed in repeated samples.

Some Assumptions 2) Linearity: For each Y distribution, the mean or E(Y1X1), or just 1 lies on a straight line known as the true (population) regression line: E(Yi) = i =  + Xi The population parameters are estimated with the sample data as a and b.

Some Assumptions 3) Independence: The random variables Y1, Y2 …are statistically independent. For example, if Y1 happens to be large, there is no reason to expect Y2 to be large. Their mean and variance are given by: Mean =  + XI Variance = 2 Or Yi =  + Xi + ei Where e1, e2, ….ei are independent errors with mean = 0 and variance = 2. For example, in Figure 12-1b, the first observed value Y1 is shown along with the corresponding error term e1.

Some Assumptions It is often assumed additionally that the errors are normally distributed: i N[0, 2] A violation of non-constant variance is called heteroskedasticity (errors are not the same across the range of X values). We also assume no auto-correlation, or that Cov[i, j] = 0 if i  j. A typical violation occurs with time series data where the errors are related over time.

The Nature of the Error Term The error may be regarded as the sum of two components: • Measurement Error: Sometimes we might measure things incorrectly; this can create a larger error. • Inherent variability: Sometimes we draw a sample that is not typical, or particular values may be far away from their expectation (like getting 10 heads in a row in 10 flips; possible but unlikely).

The Nature of the Error Term To summarize, we can see in Figure 12-2 that we have the true population regression line (black thick line), and since we do not have the population data, we must estimate  &  with our sample data as a & b. Because of error, sometimes the values of Y observed in the sample will be a little low too (Y1) and sometimes they will be a little too high (Y2 and Y3). The question then is how close did we come to the true population line? The best we can hope for is something reasonably close, and we go back to the notion of a sampling distribution to figure out how close we are.

Sampling distribution of b How is the slope estimate b distributed around its target ? Statisticians have been able to derive the theoretical sampling distribution and it is normally distributed with an expected value of b =  and the standard error of b = /[(X - Mx)2]. Here  represents the standard deviation of the Y observations about the population line.

Sampling distribution of b An easier way to express the standard error of b is as follows: Standard error = 1 n sx There are three ways that the standard error can be reduced to produce a more accurate estimate b: • By reducing , the inherent variability of the Y observations • By increasing n, the sample size • By increasing sx, the spread of the X values.

Sampling distribution of b The third point is particularly interesting because increasing the variance of our independent variable increases our leverage to explain the variance of Y. We can see this clearly in Figure 12-4 where a larger range of values allows us to fit a more accurate regression line. Because we do not generally know , we estimate standard error as: SE = s/[(X-Mx)2], where s2 = 1/(n-k-1)(Y - Yp)2

Regression Analysis