Regression Analysis: Part 2

Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers

Shoe Size Example • Consider the prediction of Shoe Size discussed before • As before, the response variable is Shoe Size, and the explanatory variable is age. • What inferences can we make about the regression coefficients?

Shoe Size and Age

Regression Coefficients • Regression coefficients estimate the true, but unobservable, population coefficients. • The standard error of bi indicates the accuracy of these point estimates. • For example, the average effect on shoe size of a one-unit increase in age is .612. • We are 95% confident that the coefficient is between .306 and .918.

Dummy Variables • You wish to check if the class (freshman, sophomore, junior, senior) and SAT score can be used to predict the number of hours per week a college student watches TV. • Data is collected through sampling, and a regression is to be performed. • How would you code the ‘Class’ variable? • How would you interpret the resulting coefficients?

Interactions • Example: How do gender and lack of sleep affect performance on a standard test? • If Male = 1, and Female = 0, what is the difference between a regression model without interaction and one with? • Y-hat = b0 + b1X1 + b2X2 • Y-hat = b0 + b1X1 + b2X2 + b3X1X2 • How is the coeffiecient b3 interpreted?

Multicollinearity • We want to explain a person’s height by means of foot length. • The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively. • What can occur when we regress Height on both Right and Left?

Multicollinearity • The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation. • This is especially true when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity. • By definition multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult.

Solution to multicollinearity • Admittedly, there is no need to include both Right and Left in an equation for Height - either one would do - but we include both to make a point. • It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job. • The R2 value will probably be large. But what about the coefficients of Right and Left? Here is a problem.

Solution -- continued • The coefficient of Right indicates that the right foot’s effect on Height in addition to the effect of the left foot. This additional effect is probably minimal. That is, after the effect of Left on Height has already been taken into account, the extra information provided by Right is probably minimal. But it goes the other way also. The extra effort of Left, in addition to that provided by Right, is probably minimal.

Height Data - Correlations • To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths in this file. • We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (all expressed in inches). As shown in the table to the right, the correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1.

Solution -- continued • The regression output when both Right and Left are entered in the equation for Height appears in this table.

Solution -- continued • This output tells a somewhat confusing story. • The multiple R and the corresponding R2 are about what we would expect, given the correlations between Height and either Right or Left. • In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the se value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.

Solution -- continued • However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length. • In fact, the coefficient of Left has the wrong sign - it is negative! • Besides this wrong sign, the tip-off that there is a problem is that the t-value of Left is quite small and the corresponding p-value is quite large.

Solution -- continued • Judging by this, we might conclude that Height and Left are either not related or are related negatively. But we know from the table of correlations that both of these are false. • In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level. • However, this happened mostly by chance, slight changes in the data could change the results completely.

Solution -- continued • Although both Right and Left are clearly related to Height, it is impossible for the least squares method to distinguish their separate effects. • Note that the regression equation does estimate the combined effect fairly well, the sum of the coefficients is 3.178 which is close to the coefficient of 3.2 we used to generate the data. • Therefore, the estimated equation will work well for predicting heights. It just does not have reliable estimates of the individual coefficients of Right and Left.

Solution -- continued • To see what happens when either Right or Left are excluded from the regression equation, we show the results of simple regression. • When Right is only variable in the equation, it becomesPredicted Height = 31.546 + 3.195Right • The R2 and se values are 81.6% and 2.005, and the t-value and p-value for the coefficient of Right are now 21.34 and 0.000 - very significant.

Solution -- continued • Similarly, when the Left is the only variable in the equation, it becomesPredicted Height = 31.526 + 3.197Left • The R2 and SE values are 81.1% and 2.033, and the t-value and p-value for the coefficient of Left are 20.99 and 0.0000 - again very significant. • Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.

Heteroscedasticity • Unequal variances – fan shape. • Use log transforms • Weighted Least Squares

Regression Analysis: Part 2

Regression Analysis: Part 2

Presentation Transcript

Regression Analysis

Regression Analysis Simple Regression

Regression Analysis

Regression Analysis

Regression Analysis

Regression: (2) Multiple Linear Regression and Path Analysis

Regression Analysis

Chapter 4 Multiple Regression Analysis (Part 2)

Regression analysis

Regression Analysis

Regression Analysis Part B Calculation Procedures

Logistic Regression Analysis of Matched Case-Control Data- Part 2

Multiple Regression Analysis: Part 4

Regression Analysis

Multiple Regression Analysis: Part 2

2. Regression Analysis

Multiple Regression Analysis: Part 1

Penalized Regression, Part 2

Part 1: Regression Analysis Estimating Relationships

Regression Analysis Simple Regression

Part 1: Regression Analysis Estimating Relationships

Regression analysis