II. Multiple Regression

II. Multiple Regression

Recall that for regression analysis: • The data must be from a probability sample. • The univariate distributions need not be normal, but the usual warnings about pronounced skewness & outliers apply. • The key evidence about the distributions & outliers is provided not by the univariate graphs but by the y/xbivariate scatterplots.

Even so, bivariate scatterplots & correlations do not necessarily predict whether explanatory variables will test significant in a multiple regression model. • That’s because a multiple regression model expresses the joint, linear effects of a set of explanatory variables on an outcome variable.

On matters of causality in multiple regression, see Agresti/Finlay (chap. 10); King et al.; McClendon; and Berk. • To reiterate, when might regression analysis be useful even when causal order isn’t clear?

Let’s turn our attention now to multiple regression, in which the outcome variable y is a function of k explanatory variables. • For every one-unit increase in x, y increases/decreases by … units on average, holding the other explanatory variables fixed.

Hence slope (i.e. regression) coefficients in multiple regression are commonly called ‘partial coefficients.’ • They indicate the independent effect of a given explanatory variable x on y, holding the other explanatory variables constant.

Some other ways of saying ‘holding the other variables constant’: • holding the other variables fixed • adjusting for the other variables • net of the other variables

Statistical controls mimic experimental controls. • The experimental method, however, is unparalleled in its ability to isolate the effects of explanatory ‘treatment’ variables on an outcome variable, holding other variables constant.

A Multiple Regression Example • What’s the effect of the daily amount of Cuban coffee persons drink on their levels of displayed anger, holding constant income, education, gender, race-ethnicity, body weight, health, mental health, diet, exercise, & so on?

Here’s an example we’ll be using. What should we look at? How do we interpret it? . reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704

Here’s a standard deviation interpretation: . su science read write math . pcorr science read write math Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- science | 200 51.85 9.900891 26 74 read | 200 52.23 10.25294 28 76 write | 200 52.775 9.478586 31 67 math | 200 52.645 9.368448 33 75 Partial correlation of science with Variable | Corr. Sig. -------------+------------------ read | 0.2992 0.000 write | 0.2041 0.004 math | 0.2849 0.000

Or easier: . listcoef, help regress (N=200): Unstandardized and Standardized Estimates Observed SD: 9.9008908 SD of Error: 7.0547574 ------------------------------------------------------------------------------- science | b t P>|t| bStdX bStdY bStdXY SDofX -------------+----------------------------------------------------------------- read | 0.30153 4.390 0.000 3.0916 0.0305 0.3123 10.2529 write | 0.20653 2.918 0.004 1.9576 0.0209 0.1977 9.4786 math | 0.31901 4.161 0.000 2.9886 0.0322 0.3019 9.3684 ------------------------------------------------------------------------------- b = raw coefficient t = t-score for test of b=0 P>|t| = p-value for t-test bStdX = x-standardized coefficient bStdY = y-standardized coefficient bStdXY = fully standardized coefficient SDofX = standard deviation of X

Although multiple regression is linear in its parameters, we’ll see that it readily accommodates non-linearity in y/xrelationships. • What would be possible non-linearity in the relationship between daily amounts of Cuban coffee people drink & their levels of displayed anger? • What about the preceding regression example?

. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704

We’ll later see that multiple regression coefficients are ‘partial coefficients.’ • That is, the value (i.e. slope) of a given estimated regression coefficient may vary according to which other particular explanatory variables are included in the model. • Why?

. reg read write ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .64553 .0616832 10.47 0.000 .5238896 .7671704 . reg read write math ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .3283984 .0695792 4.72 0.000 .1911828 .4656141 math | .5196538 .0703972 7.38 0.000 .380825 .6584826 . reg read write math science ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .2376706 .0696947 3.41 0.001 .1002227 .3751184 math | .3784015 .0746339 5.07 0.000 .2312129 .5255901 science | .2969347 .0676344 4.39 0.000 .1635501 .4303192

Sources of Error in Regression Analysis • What are the three basic sources of error in regression analysis?

The three basic sources of error in regression analysis are: • Sampling error • Measurement error (including all types of non-sampling error) • Omitted variables Source: Allison (pages 14-16) • How do we evaluate error in a model?

We evaluate error in a model by means of: • Evaluating the sample for sampling & non-sampling error • Our substantive knowledge about the topic (e.g., what are the most important variables; how they should be defined & measured) • Confidence intervals & hypothesis tests • F-test of global utility • Confidence intervals & hypothesis tests for regression coefficients • Post-model diagnostics of residuals (i.e. ‘error’)

The evaluation of error involves assumptions about the random error component. • These assumptions are the same for multiple regression as for simple linear regression. • What are these assumptions? Why are they important?

How large must sample size n be for multiple regression? • It must be at least 10 observations per explanatory variable estimated (not including the constant, or y-intercept). • In reality the sample size n should be even larger—quite large, in fact. Why? (See R2 below.)

As with simple linear regression, multiple regression fits a least squares line that minimizes the sum of squared errors (i.e. the sum of squared deviations between predicted and observed y values). • The formulas for multiple regression, however, are more complex than those for simple regression.

Estimating the variance of e (i.e. of yhat) with k explanatory variables: degrees of freedom = n – (# explanatory vars estimated + constant)

How has the denominator changed to compute the variance of yhat in multiple regression? • In multiple regression, Mean Square for Residual(Error), Model & Total are computed by dividing each component’s sum of squares by:

Estimating the standard error of e (i.e. of yhat):

Testing an individual parameter coefficient in multiple regression: Ho: = 0 Ha: 0 (or one-sided test in either direction) • Assumptions about sample; assumptions about e: I.I.D.

Multiple coefficient of determination, R2 : predicted values of y

R2: fraction of the sample variation of the y values (measured by SSyy) that is attributable to the regression model (i.e. to the explanatory variables). • Note: r2 versus R2

. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • R2=SSModel/SSTotal=9752.65806/19507.50

Caution: R2 for a regression model may vary considerably from sample to sample (due to chance associations); i.e. it does not necessarily reveal the model’s fit for the population. • Caution: R2 will be overestimated if a sample doesn’t contain substantially more data points than the number of explanatory variables (rule of thumb: at least 30 observations per explanatory variable; an overall sample size of at least 400).

Caution: in view of the preceding, R2 gets larger with the addition of more explanatory variables. • Adjusted R2 adjusts both for the sample size n & for the number of explanatory variables; thus it gives a more stable & conservative estimate. • R2 & Adj R2, however, are sample statistics that do not have associated hypothesis tests.

. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602 Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • R2 versus Adj R2

Neither R2 nor Adj R2, then, should be the sole or primary measure for judging a model’s usefulness. • The first, & most basic, test for judging a model’s usefulness is the Analysis of Variance F Test.

Analysis of Variance F-Test for the overall utility of a multiple regression model: Ho: Ha: at least one differs from 0.

An alternative formula uses R2:

. reg science read write math Source SS df MS Number of obs = 200 F( 3, 196) = 65.32 Model 9752.65806 3 3250.88602Prob > F = 0.0000 Residual 9754.84194 196 49.7696017 R-squared = 0.4999 Adj R-squared = 0.4923 Total 19507.50 199 98.0276382 Root MSE = 7.0548 science Coef. Std. Err. t P>t [95% Conf. Interval] read .3015317 .0686815 4.39 0.000 .1660822 .4369813 write .2065257 .0707644 2.92 0.004 .0669683 .3460831 math .3190094 .0766753 4.16 0.000 .167795 .4702239 _cons 8.407353 3.192799 2.63 0.009 2.110703 14.704 • F=MSModel/MSResidual

We either reject or fail to reject Ho for the F-test. • If we fail to reject Ho, then we don’t bother assessing the other indicators of model usefulness & fit: instead we go back to the drawing board, revise the model, & try again.

Regarding the F-test, note that the formula expresses the parsimony of explanation that’s fundamental to the culture of ‘scientific explanation’ (see King et al., page 20). • That is, too many explanatory variables relative to the number of observations decreases the degrees of freedom & thus makes statistical significance more difficult to obtain.

Why not assess the model’s overall utility by doing hypothesis tests based on t-values? • Probability of Type I error. • Why not use R2 or Adj R2? • Because there’s no hypothesis test for R2 or Adj R2.

If, on the other hand, the F-test does reject Ho, then do go on to conduct the t-value hypothesis tests. • But watch out for Type I errors.

In any case, rejecting Ho based on the F-test does not necessarily imply that this is the best model for predicting y. • Another model might also pass the F-test & prove even more useful in providing estimates & predictions.

Before going on with multiple regression, let’s review some basic issues of causality (see Agresti & Finlay, chapter 10; King et al., chapter 3). • In causal relations, one variable influences the other, but not vice versa. • We never definitively prove causality, but we can (more or less) disprove causality (over the long run of accumulated evidence).

A relationship must satisfy three criteria to be (tentatively) considered causal: • An association between the variables • An appropriate time order • The elimination of alternative explanations

Association & time order do not necessarily amount to causality. • Alternative explanations must be eliminated.

But first, does a finding of statistical insignificance necessarily mean that there’s no causal relationship between y & x? • It could be that: • The y/x relationship is nonlinear: perhaps linearize it via transformation. • The y/x relationship is contingent on controlling another explanatory variable that has been omitted, or else on the level of another variable (i.e. interaction). • The sample size is inadequate. • There’s sampling error. • There’s non-sampling error (including measurement error).

When there is statistical significance, an apparently causal relationship may in fact be: • Dependent on one or more lurking variables (i.e. spurious relationship: x2 causes x1 & x2 causes y, so there is no x1 causalrelationship with y). • Mediated by an intervening variable (i.e. chain: x1 indirectly causes y via x2). • Conditional on the level of another variable (i.e. interaction). • An artifact of an extreme value on the sampling distribution of the sample mean.

On such complications of causality, see Agresti/Finlay, chapter 10; McClendon; Berk; and King et al.

How to detect at least some of the possible lurking variables: • Add suspected lurking variables (x2-xk) to the model. • Does the originally hypothesized y/ x1 relationship remain significant or not?

E.g., for pre-adolescents, there’s a strong relationship between height & math achievement scores. • Does the relationship between height & math scores remain when we control for age? If not, the height/math relationship is spurious, because both math depends on age & height depends on age. • What other explanatory variables might be relevant?

The case of spurious associations (i.e. lurking variables) is one example of a multivariate relationship. • Another kind is a chain relationship. E.g.: • Race affects arrest rate, but controlling for specific levels of family income makes the unequal white/nonwhite arrest rate diminish. Race’s effect is mediated by family income.

II. Multiple Regression

II. Multiple Regression

Presentation Transcript

Multiple Regression

Multiple Regression II

Multiple Regression

Multiple Regression

Multiple Regression

Module II Lecture 1: Multiple Regression

Multiple Regression

Multiple Regression

Multiple Regression

Multiple Regression II 4/11/12

Multiple Regression

MULTIPLE REGRESSION

Multiple Regression

Multiple regression

Chapter 7: Multiple Regression II

Multiple Regression II

Multiple Regression

Multiple Regression

Multiple Regression

Multiple regression:

Multiple Regression

Multiple Regression II