1.94k likes | 2.17k Views
II. Multiple Regression. Recall that for regression analysis: The data must be from a probability sample. The univariate distributions need not be normal, but the usual warnings about pronounced skewness & outliers apply.
E N D
Recall that for regression analysis: • The data must be from a probability sample. • The univariate distributions need not be normal, but the usual warnings about pronounced skewness & outliers apply. • The key evidence about the distributions & outliers is provided not by the univariate graphs but by the y/xbivariate scatterplots.
Even so, bivariate scatterplots & correlations do not necessarily predict whether explanatory variables will test significant in a multiple regression model. • That’s because a multiple regression model expresses the joint, linear effects of a set of explanatory variables on an outcome variable.
On matters of causality in multiple regression, see Agresti/Finlay (chap. 10); King et al.; McClendon; and Berk. • To reiterate, when might regression analysis be useful even when causal order isn’t clear?
Let’s turn our attention now to multiple regression, in which the outcome variable y is a function of k explanatory variables. • For every one-unit increase in x, y increases/decreases by … units on average, holding the other explanatory variables fixed.
Hence slope (i.e. regression) coefficients in multiple regression are commonly called ‘partial coefficients.’ • They indicate the independent effect of a given explanatory variable x on y, holding the other explanatory variables constant.
Some other ways of saying ‘holding the other variables constant’: • holding the other variables fixed • adjusting for the other variables • net of the other variables
Statistical controls mimic experimental controls. • The experimental method, however, is unparalleled in its ability to isolate the effects of explanatory ‘treatment’ variables on an outcome variable, holding other variables constant.
A Multiple Regression Example • What’s the effect of the daily amount of Cuban coffee persons drink on their levels of displayed anger, holding constant income, education, gender, race-ethnicity, body weight, health, mental health, diet, exercise, & so on?
Here’s an example we’ll be using. What should we look at? How do we interpret it? . reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704
Here’s a standard deviation interpretation: . su science read write math . pcorr science read write math Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- science | 200 51.85 9.900891 26 74 read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 Partial correlation of science with Variable | Corr. Sig. -------------+------------------ read | 0.2992 0.000 write | 0.2041 0.004 math | 0.2849 0.000
Or easier: . listcoef, help regress (N=200): Unstandardized and Standardized Estimates Observed SD: 9.9008908 SD of Error: 7.0547574 ------------------------------------------------------------------------------- science | b t P>|t| bStdX bStdY bStdXY SDofX -------------+----------------------------------------------------------------- read | 0.30153 4.390 0.000 3.0916 0.0305 0.3123 10.2529 write | 0.20653 2.918 0.004 1.9576 0.0209 0.1977 9.4786 math | 0.31901 4.161 0.000 2.9886 0.0322 0.3019 9.3684 ------------------------------------------------------------------------------- b = raw coefficient t = t-score for test of b=0 P>|t| = p-value for t-test bStdX = x-standardized coefficient bStdY = y-standardized coefficient bStdXY = fully standardized coefficient SDofX = standard deviation of X
Although multiple regression is linear in its parameters, we’ll see that it readily accommodates non-linearity in y/xrelationships. • What would be possible non-linearity in the relationship between daily amounts of Cuban coffee people drink & their levels of displayed anger? • What about the preceding regression example?
. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704
We’ll later see that multiple regression coefficients are ‘partial coefficients.’ • That is, the value (i.e. slope) of a given estimated regression coefficient may vary according to which other particular explanatory variables are included in the model. • Why?
. reg read write ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .64553 .0616832 10.47 0.000 .5238896 .7671704 . reg read write math ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .3283984 .0695792 4.72 0.000 .1911828 .4656141 math | .5196538 .0703972 7.38 0.000 .380825 .6584826 . reg read write math science ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .2376706 .0696947 3.41 0.001 .1002227 .3751184 math | .3784015 .0746339 5.07 0.000 .2312129 .5255901 science | .2969347 .0676344 4.39 0.000 .1635501 .4303192
Sources of Error in Regression Analysis • What are the three basic sources of error in regression analysis?
The three basic sources of error in regression analysis are: • Sampling error • Measurement error (including all types of non-sampling error) • Omitted variables Source: Allison (pages 14-16) • How do we evaluate error in a model?
We evaluate error in a model by means of: • Evaluating the sample for sampling & non-sampling error • Our substantive knowledge about the topic (e.g., what are the most important variables; how they should be defined & measured) • Confidence intervals & hypothesis tests • F-test of global utility • Confidence intervals & hypothesis tests for regression coefficients • Post-model diagnostics of residuals (i.e. ‘error’)
The evaluation of error involves assumptions about the random error component. • These assumptions are the same for multiple regression as for simple linear regression. • What are these assumptions? Why are they important?
How large must sample size n be for multiple regression? • It must be at least 10 observations per explanatory variable estimated (not including the constant, or y-intercept). • In reality the sample size n should be even larger—quite large, in fact. Why? (See R2 below.)
As with simple linear regression, multiple regression fits a least squares line that minimizes the sum of squared errors (i.e. the sum of squared deviations between predicted and observed y values). • The formulas for multiple regression, however, are more complex than those for simple regression.
Estimating the variance of e (i.e. of yhat) with k explanatory variables: degrees of freedom = n – (# explanatory vars estimated + constant)
How has the denominator changed to compute the variance of yhat in multiple regression? • In multiple regression, Mean Square for Residual(Error), Model & Total are computed by dividing each component’s sum of squares by:
Testing an individual parameter coefficient in multiple regression: Ho: = 0 Ha: 0 (or one-sided test in either direction) • Assumptions about sample; assumptions about e: I.I.D.
Multiple coefficient of determination, R2 : predicted values of y
R2: fraction of the sample variation of the y values (measured by SSyy) that is attributable to the regression model (i.e. to the explanatory variables). • Note: r2 versus R2
. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • R2=SSModel/SSTotal=9752.65806/19507.50
Caution: R2 for a regression model may vary considerably from sample to sample (due to chance associations); i.e. it does not necessarily reveal the model’s fit for the population. • Caution: R2 will be overestimated if a sample doesn’t contain substantially more data points than the number of explanatory variables (rule of thumb: at least 30 observations per explanatory variable; an overall sample size of at least 400).
Caution: in view of the preceding, R2 gets larger with the addition of more explanatory variables. • Adjusted R2 adjusts both for the sample size n & for the number of explanatory variables; thus it gives a more stable & conservative estimate. • R2 & Adj R2, however, are sample statistics that do not have associated hypothesis tests.
. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • R2 versus Adj R2
Neither R2 nor Adj R2, then, should be the sole or primary measure for judging a model’s usefulness. • The first, & most basic, test for judging a model’s usefulness is the Analysis of Variance F Test.
Analysis of Variance F-Test for the overall utility of a multiple regression model: Ho: Ha: at least one differs from 0.
. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • F=MSModel/MSResidual
We either reject or fail to reject Ho for the F-test. • If we fail to reject Ho, then we don’t bother assessing the other indicators of model usefulness & fit: instead we go back to the drawing board, revise the model, & try again.
Regarding the F-test, note that the formula expresses the parsimony of explanation that’s fundamental to the culture of ‘scientific explanation’ (see King et al., page 20). • That is, too many explanatory variables relative to the number of observations decreases the degrees of freedom & thus makes statistical significance more difficult to obtain.
Why not assess the model’s overall utility by doing hypothesis tests based on t-values? • Probability of Type I error. • Why not use R2 or Adj R2? • Because there’s no hypothesis test for R2 or Adj R2.
If, on the other hand, the F-test does reject Ho, then do go on to conduct the t-value hypothesis tests. • But watch out for Type I errors.
In any case, rejecting Ho based on the F-test does not necessarily imply that this is the best model for predicting y. • Another model might also pass the F-test & prove even more useful in providing estimates & predictions.
Before going on with multiple regression, let’s review some basic issues of causality (see Agresti & Finlay, chapter 10; King et al., chapter 3). • In causal relations, one variable influences the other, but not vice versa. • We never definitively prove causality, but we can (more or less) disprove causality (over the long run of accumulated evidence).
A relationship must satisfy three criteria to be (tentatively) considered causal: • An association between the variables • An appropriate time order • The elimination of alternative explanations
Association & time order do not necessarily amount to causality. • Alternative explanations must be eliminated.
But first, does a finding of statistical insignificance necessarily mean that there’s no causal relationship between y & x? • It could be that: • The y/x relationship is nonlinear: perhaps linearize it via transformation. • The y/x relationship is contingent on controlling another explanatory variable that has been omitted, or else on the level of another variable (i.e. interaction). • The sample size is inadequate. • There’s sampling error. • There’s non-sampling error (including measurement error).
When there is statistical significance, an apparently causal relationship may in fact be: • Dependent on one or more lurking variables (i.e. spurious relationship: x2 causes x1 & x2 causes y, so there is no x1 causalrelationship with y). • Mediated by an intervening variable (i.e. chain: x1 indirectly causes y via x2). • Conditional on the level of another variable (i.e. interaction). • An artifact of an extreme value on the sampling distribution of the sample mean.
On such complications of causality, see Agresti/Finlay, chapter 10; McClendon; Berk; and King et al.
How to detect at least some of the possible lurking variables: • Add suspected lurking variables (x2-xk) to the model. • Does the originally hypothesized y/ x1 relationship remain significant or not?
E.g., for pre-adolescents, there’s a strong relationship between height & math achievement scores. • Does the relationship between height & math scores remain when we control for age? If not, the height/math relationship is spurious, because both math depends on age & height depends on age. • What other explanatory variables might be relevant?
The case of spurious associations (i.e. lurking variables) is one example of a multivariate relationship. • Another kind is a chain relationship. E.g.: • Race affects arrest rate, but controlling for specific levels of family income makes the unequal white/nonwhite arrest rate diminish. Race’s effect is mediated by family income.