- 321 Views
- Uploaded on
- Presentation posted in: General

Sociology 602 Martin Week 9, April 2, 2002

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

**1. **Sociology 602 (Martin) Week 9, April 2, 2002
Comparing predictor variables using Analysis of Variance
ANOVA for the general regression model NKNW 6.5
Extra Sums of Squares: Definitions NKNW 7.1
Extra Sums of Squares: Significance Tests NKNW 7.2, 7.3
Multicollinearity and its effects NKNW 7.6

**2. **Review of the first half of 602: Questions we can (start to) answer
1.) How do you set up a multiple regression model?
2.) What are possible problems with a regression model?
3.) How do you recognize problems with a regression model?
4.) How do you fix problems with a regression model?
5.) What do the “bi” coefficients mean?
6.) How do you determine the statistical significance of a “bi” coefficient?

**3. **Questions for the second half:
(More work on questions we have seen already, especially 2,3, and 4)
(A new question you might have anticipated…)
7.) What is the relative importance of different predictor variables?
There are many ways to frame question 7. For today, we work with the following:
“Is Xi important enough to be kept in a model?”

**4. **Possible ways to measure the importance of predictor variables: Example: Global Warming Data
regression of surface temperature on explanatory variables
parameter estimates (fictitious)
Variable df parameter s.e. T p > |T|
Intercept 1 70 3.5 20.0 .0000…0001
dayYN(1=Y) 1 22.0 4.4 5.00 .00001
latitude 1 -0.88 .11 -8.00 .00000001
CO2ppm 1 0.66 .33 2.00 .046

**5. **Possible ways to measure the importance of predictor variables: Large b coefficient:
advantage: indicates large change in Y for a change in X
disadvantage: depends on scaling of X, says nothing about error
Small p-value:
advantage: incorporates information about relationship between X and Y and error.
disadvantage: doesn’t really estimate a property of the population
Substantive importance:
advantage: focuses attention on theoretical interests
disadvantage: interesting variables may turn out to be statistically unimportant.

**6. **A new way to measure the importance of predictor variables:
Important variables are variables that explain a lot of the error (variance) in the model.
We will spend today studying the following:
How to measure the variance in the model that is due to Xi
How to decide how much variance is important.
Why measuring the variance in a model still leaves us uncertain about the importance of Xi.

**7. **Measuring error in a regression model relevant reading: section 6.5 NKNW
Begin with ANOVA for a simple regression model:
Y = bo + b1X1 + e
SSR = ? (Yi hat – Ybar)2 : The sum of squares for the regression model is measured using the difference between the predicted values of Yi and the overall average for Y. (Draw)
SSE = ? (Yi – Yi hat)2 : The sum of squares error is measured using the difference between the observed values of Yi and the predicted values of Yi. (Draw)
SSTO = ? (Yi – Ybar)2 : The sum of squares total is measured using the difference between the observed values of Yi and the overall average for Y. (Draw)
Note that SSR + SSE = SSTO

**8. **Measuring error in a regression model Continue with ANOVA for a general regression model:
Y = bX + e
SSR = ? (Yi hat – Ybar)2 , but this is measured for the whole matrix of X-variables, so we don’t know how much is due to any given X-variable.
SSE = ? (Yi – Yi hat)2 , but (again) this only tells us about the model with the whole matrix of X-variables.
SSTO = ? (Yi – Ybar)2
SSR + SSE = SSTO (still)
Thus, the ANOVA in a regression output tells us about the whole model, not about the individual x-variables.

**9. **Measuring error in a regression model Example: see SAS output for data on hospital patient satisfaction, age, severity of illness, and anxiety.
Model 1: anxiety is the explanatory variable
Model 2: anxiety, age, and severity are explanatory variables
How does SSR change from model 1 to model 2?
SSE?
SSTO?
Do you think age belongs in the model?
How about severity?
How about both age and severity?

**10. **R2 in a multiple regression model In a simple regression model, R2 refers to the amount of (squared) variation in Y explained by X1.
In a multiple regression model, R2 refers to the amount of (squared) variation in Y explained by all the X-variables together.
R2 = SSR / SSTO = 1 – SSE/SSTO
R2 is called the coefficient of multiple determination
0 <= R2 <= 1
R2 tells us nothing about the individual variables
R = SQRT(R2) is called the coefficient of multiple correlation.

**11. **Adjusted R2 in a regression model Some people use R2 to make statistical inferences (we tend to use other methods in this class).
This creates a problem: Even if Y is unassociated with X1 and X2 in a population, you tend to see some association between Y, X1, and X2 in a given random sample. (Draw)
Some researchers adjust for this problem using an adjusted R2, which is adjusted for the degrees of freedom in a multiple regression model.
R2a = 1 – (SSE/dfSSE) / (SSTO/dfSSTO)
= 1 – (n-1)/(n-p)(SSE/SSTO)
This is the adjusted R2 you see in the SAS output. (What happens to the difference between R2 and R2a as p increases?)

**12. **Using error to study individual X-variables
In general, the best way to use ANOVA to study individual X-variables is to do it this way:
run a model with several X – variables except Xi
run a second model with several X – variables including Xi
compare the two models’ SSE, SSR, and SSTO
As we learn the formal procedure for doing this, we need to learn some definitions and terms.
Readings: NKNW 7.1

**13. **Terms for ANOVA modeling Extra Sum of Squares: the marginal reduction in the error term when one or more predictor variables are added to the regression model.
Example: for the hospital patient data, find the extra sum of squares when X3 (anxiety) is added to a model that already has X2 (severity)
extra sum of squares = SSE(X2) – SSE(X2,X3)
= 4024.6 – 3718.3 = 306.3
Notation: SSR(X3|X2) = SSE(X2) – SSE(X2,X3)
“The extra sum of squares (regression) when X3 is added to a model that already contains X2 .”

**14. **Extra Sum of Squares: more examples Example: the extra sum of squares when X3 (anxiety) is added to a model that already has X2 (severity) and X1 (age)
SSR(X3|X2,X1) = SSE(X1,X2) – SSE(X1,X2,X3)
= 2064.0 – 2011.6 = 52.4
Note that the error explained by X3 depends on the other variables in the model!
Another example: the extra sum of squares when X3 (anxiety) and X2 (severity) are added to a model that already has X1 (age)
SSR(X3,X2|X1) = SSE(X1) – SSE(X1,X2,X3)
= 2466.8 – 2011.6 = 455.2

**15. **Taking stock of our situation
Our ultimate goal is to determine whether an x-variable is important by testing whether it should be kept in a regression model.
We are almost (but not quite) ready to make practical use our new concept of extra sums of squares (regression).
As a preliminary exercise, we will learn how to decompose a multiple regression model into its components.

**16. **Decomposing SSR for a multiple regression model Imagine you wish to study the effects of X2 (severity) and X3 (anxiety) on Y (satisfaction).
In a model with only X2, we can say that…
SSTO = SSR(X2) + SSE(X2)
We know that when we add X3 to this model, we get an extra sum of squares…
SSR(X3|X2) = SSE(X2) – SSE(X2,X3)
This allows us to rewrite the full model with X2 and X3 with the contribution of each x variable…
SSTO = SSR(X2) + SSR(X3|X2) + SSE (X2, X3)
6145.2 = 2120.7 + 306.3 + 3718.3

**17. **Decomposing SSR for a multiple regression model
Imagine you wish to study the effects of X1 (age), X2 (severity), and X3 (anxiety) on Y (satisfaction).
You can rewrite the full model with X1, X2,and X3 with the contribution of each x variable…
SSTO = SSR(X1) + SSR(X2|X1) + SSR(X3|X2,X1) + SSE(X1,X2,X3)
6145.2 = 3678.4 + 402.8 + 52.4 + 2011.6

**18. **Decomposing SSR for a multiple regression model
We are now conceptually ready to answer the question: “Is X3 important enough that we should keep it in the model?”
We put this question into operation by asking: “Is the extra sum of squares SSR(X3|X2,X1) large enough that we know that it explains a nonrandom amount of the variation in Y?”
A hypothesis test for ANOVA is the F-test, so we will construct a test by stages, based on the F-test for ANOVA. SEE NKNW 7.2, 7.3

**19. **F-test for ANOVA: the general form The simplest form of an F-test is the form that SAS does automatically in the regression output. This is a statistical inference for a regression equation, so we set it up as a formal hypothesis test.
Assumptions: all the standard assumptions of a regression model – random sample, E{?} = 0, independently distributed ?, homoskedasticity, linear relationship between all X and Y, etc.
Hypothesis: The null hypothesis is that none of the X-variables has a relationship with Y in the population.
Ho: ?1 = 0, ?2 = 0, ?3 = 0, … ?p-1 = 0
Test statistic: F* = MSR / MSE, provided by SAS
p-value is provided by SAS
Conclusion: reject Ho if the p-value shows that it is very unlikely that the score of F* could have happened by chance alone.

**20. **F-test for ANOVA: the general form, an example
We can do an F-test for the full model (X1,X2,X3)
assumptions: all assumptions for regression model
Null hypothesis: none of the variables for age, severity, or anxiety has a linear relationship with patient satisfaction.
Test statistic (from SAS output) F* = 1377.9 / 105.9 = 13.0
p-value prob > F = .0001
Conclusion: reject Ho and conclude that at least one of the X variables has a linear relationship with patient satisfaction.

**21. **F-test for ANOVA for extra sums of squares
Now we are ready to tackle an F-test for the hypothesis that X3 belongs in model (X1,X2,X3)
assumptions: all assumptions for regression model
Null hypothesis: the variable for anxiety has no linear relationship with patient satisfaction when age and severity are controlled.
Ho: ?3 = 0 in a model that includes X1 and X2.
Test statistic: Here we hit a bump, because SAS does not give the F* statistic automatically (at least, not until next week).

**22. **Calculating F* for extra sums of squares
F* for a general test = MSR/MSE = (SSR/dfR) / (SSE/dfE)
F* for a test of extra sums of squares for X3 in a model that already contains X1 and X2
= MSR(X3|X1,X2) / MSE(X1,X2,X3) = (SSR/dfR) / (SSE/dfE)
This brings us to our next problem: what are dfR and dfE when we add one more X-variable to a model that already has two x-variables?
Answer: when we add one variable to a model, dfR = 1.
Answer: a model with 3 x-variables has 4 parameters (including the intercept): dfE = n – p = n – 4.

**23. **Modified ANOVA table for extra sums of squares source DF SS MS
Model 4 – 1 = 3 4133.6 1377.9
- X1 1 3678.4 3678.4
- X2|X1 1 402.8 402.8
- X3|X1,X2 1 52.4 52.4
Error n – 4 = 19 2011.6 105.9
F* for this test = MSR(X3|X1,X2) / MSE(X1,X2,X3)
= 52.4 / 105.9 = 0.495

**24. **F test for extra sums of squares We now have F* = 0.495
To look up the p-value we need the df for the F-test:
For the numerator, MSR has 1 df.
For the denominator, MSE has 19 df
Thus, total df = 1,19
A proper alpha level is 5%, which means that .05 of the time when Ho is true for the population, one would get a F-statistic at least as large as the value in a table of F-statistics.
The Table of F-statistics is Table B.4 on pages 1339 to 1345.

**25. **Using Table B.4 to look up p-values: a step-by-step procedure 1.) Find the correct column for the numerator df
in this case, dfR = 1
2.) Find the correct row for the denominator df
in this case, dfE = 19, which is not in the table, so we use the nearest row for dfE = 20 (see page 1342)
3.) Find the correct row for the alpha level of .05.
The text uses A = 1-?, so 1 - .05 = 0.95.
4.) Look up the F-statistic in the correct cell.
In the text, F* = 4.35
5.) Compare the Tabled F* to the obtained F*
If F*table >= F*obtained, then p >= .05 (In this case, 4.35 >= 0.495)
If F*table < F*obtained, then p < .05

**26. **F-test for extra sums of squares Now we complete the F-test for the hypothesis that X3 belongs in model (X1,X2,X3)
assumptions: all assumptions for regression model
Null hypothesis: the variable for anxiety has no linear relationship with patient satisfaction when age and severity are controlled.
Ho: ?3 = 0 in a model that includes X1 and X2.
Test statistic: F* = 0.495
p-value: p> .05
Conclusion: do not reject H0: There is no evidence that X3 has a linear relationship with Y when X1and X2 are in the model, so we should leave X2 out of the model.

**27. **F-test for extra sums of squares: the basic idea
When you add one (or more) degrees of freedom to an existing model, you can use the F-test to find out whether that extra degree of freedom is justified.
Thus, to find out if a given X-variable belongs in a model, you do an F-test comparing that model to a comparable model without the given X-variable.
Question: why can’t you do an F-test to compare these two models? Y = bo + b1X1 + e Y = bo + b2X2 + e

**28. **F-test for extra sums of squares: the basic problem
You might think that we have solved the problem of determining whether a given X-variable belongs in a model, but that is premature.
The problem is that you can often justify the full model by adding the X-variables in a different order.
We will discuss this problem in detail next week, so an example will suffice for now.
Q: Is it appropriate to add X1 to a model that already has X2 and X3?
A: yes: F* = 16.12(1,19), p < .05

**29. **Adding variables in a different order: Q: How come our choice of whether X3 belongs in the model depends on what order we add the variables?
Suppose X1 and X3 explain variation in Y. Some of the variation is different variation, and some is the same variation. X1 explains a lot more variation than X3.
If we put X1 in the model first, it explains all the variation explained by X1 alone, plus all the variation explained jointly by X1 and X3.
Then, the variation explained only by X3 will not be enough to justify adding X3 to a model with X1.
If we put X3 in the model first, it explains all the variation explained by X3 alone, plus all the variation explained jointly by X1 and X3.
This total variation may be enough to justify adding X3, if we add it first. (Use model (X1,X3) as an example)

**30. **How does this problem arise? The problem of changing the order of variables is partly caused by multicollinearity. SEE NKNW 7.6
Multicollinearity occurs when the predictor variables are correlated among themselves.
X1 and X3 can only explain the same variation in Y if they are correlated with each other.
For the remaining few minutes, we will investigate some problems and opportunities presented by multicollinearity.

**31. **Thinking about multicollinearity and regression assumptions Q: If X1is correlated with Y, is that a violation of regression model assumptions?
A: NO, that is what we are testing for when we calculate b1
Q: If X1is correlated with X2, is that a violation of regression model assumptions?
A: NO, that is what we are testing for when we use X2 as a control variable.
Multicollinearity is an important part of regression analysis, although it occasionally presents special modeling challenges.

**32. **Ranges of multicollinearity If X1 and X2 are completely uncorrelated (a rare situation), then there is no multicollinearity. This means that:
b1 will not change when you add X2 to the model.
b2 will not change when you add X1 to the model.
If X1 and X2 are perfectly correlated (another rare situation), then there is complete multicollinearity. This means that:
SAS cannot compute a regression model for both variables, because there is no best model. (Many models can have an excellent fit!)
(Draw lines and planes)

**33. **Ranges of multicollinearity If X1 and X2 are partly correlated (the commonest situation), then several things happen:
Model building depends strongly on the order in which you add the variables. The last variable added tends to “lose out”.
It is still possible to have good model fit.
Standard errors for b1 and b2 will be inflated to the extent that X1 and X2 are correlated. In one sample, the data may line up showing that X1 is the best predictor of Y, while in another sample, the data may show that X2 is the best predictor of Y.
b1 may change considerably when you add X2 to the model.
It becomes less conceptually feasible to think of predicting the effect of X1 while X2 is “held constant”. (Example: using thigh circumference and midarm circumference to predict body fat.)

**34. **Ranges of multicollinearity: examples Look at the SAS output for correlation coefficients for the X-variables in the hospital patient example.
Which X-variables are most strongly correlated?
How do the values of one b-coefficient change when a correlated X-variable is added to the regression model?
How do the standard errors of one b-coefficient change when a correlated X-variable is added to the regression model?
Is it realistic to think of the effect of one variable, holding the other variable constant?

**35. **Summary This week we picked up just one statistical tool, but it was a critical and difficult tool:
Using extra sums of squares to test whether a variable belongs in a model.
Next week, we put this tool to use in practical model-building exercises.