Lecture 11 Multicollinearity

Lecture 11Multicollinearity BMTRY 701Biostatistical Methods II

Multicollinearity Introduction • Some common questions we ask in MLR • what is the relative importance of the effects of the different covariates? • what is the magnitude of effect of a given covariate on the response? • can any covariate be dropped from the model because it has little effect or no effect on the outcome? • should any covariates not yet included in the model be considered for possible inclusion?

Easy answers? • If the candidate covariates are uncorrelated with one another: yes, these are simple questions • If the candidate covariates are correlated with one another: no, these are not easy. • Most commonly: • observational studies have correlated covariates • we need to adjust for these when assessing relationships • “adjusting” for confounders • Experimental designs? • less problematic • patients are randomized in common designs • no confounding exists because factors are ‘balanced’ across arms

Multicollinearity • Also called “intercorrelation” • refers to the situation when the covariates are related to each other and to the outcome of interest • like confounding, but a statistical terminology for it because of the effects it has on regression modeling

No Multicollinearity Example: Mouse experiment

Linear modeling • Interested in seeing which factors influence tumor size in mice • Notice that the experiment is perfectly balanced. • What does that mean?

Dose of Drug A on Tumor > reg.a <- lm(Tumor.size ~ Dose.A, data=data) > summary(reg.a) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.50000 12.29041 2.644 0.0246 * Dose.A -0.05250 0.05689 -0.923 0.3779 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.09 on 10 degrees of freedom Multiple R-Squared: 0.07847, Adjusted R-squared: -0.01368 F-statistic: 0.8515 on 1 and 10 DF, p-value: 0.3779 >

Dose of Drug B on Tumor > reg.b <- lm(Tumor.size ~ Dose.B, data=data) > summary(reg.b) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 58.0000 9.4956 6.108 0.000114 *** Dose.B -0.9600 0.2402 -3.996 0.002533 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.4 on 10 degrees of freedom Multiple R-Squared: 0.6149, Adjusted R-squared: 0.5764 F-statistic: 15.97 on 1 and 10 DF, p-value: 0.002533 >

Diet on Tumor > reg.diet <- lm(Tumor.size ~ Diet, data=data) > summary(reg.diet) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.000 6.296 4.448 0.00124 ** Diet -12.000 8.903 -1.348 0.20745 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15.42 on 10 degrees of freedom Multiple R-Squared: 0.1537, Adjusted R-squared: 0.06911 F-statistic: 1.817 on 1 and 10 DF, p-value: 0.2075

All in the model together > reg.all <- lm(Tumor.size ~ Dose.A + Dose.B + Diet, data=data) > summary(reg.all) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.50000 8.72108 8.543 2.71e-05 *** Dose.A -0.05250 0.02591 -2.027 0.077264 . Dose.B -0.96000 0.16921 -5.673 0.000469 *** Diet -12.00000 4.23035 -2.837 0.021925 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.327 on 8 degrees of freedom Multiple R-Squared: 0.8472, Adjusted R-squared: 0.7898 F-statistic: 14.78 on 3 and 8 DF, p-value: 0.001258

Correlation matrix of predictors and outcome > cor(data[,-1]) Dose.A Dose.B Diet Tumor.size Dose.A 1.0000000 0.0000000 0.0000000 -0.2801245 Dose.B 0.0000000 1.0000000 0.0000000 -0.7841853 Diet 0.0000000 0.0000000 1.0000000 -0.3920927 Tumor.size -0.2801245 -0.7841853 -0.3920927 1.0000000 >

Result • For perfectly balanced designs, adjusting does not affect the coefficients • However, it can affect the significance • Why? • residual sum of squares is affected • if you explain more of the variance in the outcome, less is left to chance/error • when you adjust for another related factor, you will likely improve the significance

The other extreme: perfect collinearity

The model has infinitely many solutions • Too much flexibility • What happens? • The fitting algorithm usually gives you some indication of this • will not fit the model and gives an error • drops one of the predictors • “perfectly collinear” = “perfect confounding”

Effects of Multicollinearity • Most common result • two covariates are independently associated with Y in simple linear regression models • in MLR model with both covariates, one or both is insignificant • the magnitude of the regression coefficients is attenuated • why? • recall the adjusted variable plot • if the two are related, removing the systematic part of one from Y may leave too little left to explain

Effects of Multicollinearity • Other situations • Neither is significant alone, but they are both significant together (somewhat rare) • Both are significant alone and both retain signficance in the model • The regression coefficient for one of the covariates may change direction • Magnitude of coefficient may increase (in absolute value) • It is usually hard to predict exactly what will happen when both are in the model

Implications in inference • the interpretation of a regression coefficient measuring the change in the expected value of Y when the covariate is increased while all other are held constant is not quite applicable • It may be conceptually feasible to think of ‘holding all constant’ • but, practically, it may not be possible if the covariates are related. • Example: amount of rainfall and hours of sunshine

Implications in inference • multicollinearity tends to inflate the standard errors on the regression coefficients • when multicollinearity is present, you will see that coefficient of partial determination will have little increase with the addition of the collinear covariate • Predictions tend to be relatively unaffected for better or worse when a highly collinear covariate is added to the model.

Implications in Inference • Recall the interpretation of the t-statistics in MLR • The represent the significance of a variable, adjusting for all else in the model • If two covariates are highly correlated, then both are likely to end up insignificant • Marginal nature of t-tests! • ANOVA can be more useful due to conditional nature of tables.

So, which is the ‘correct’ variable? • Almost impossible to tell • Usually, people choose the one that is ‘more’ significant. • but that does not mean it is the correct choice • it could be the correct choice • it could be the one that is less associated • why might it be less associated? • measurement issues • the correct ‘culprit’ could be a variable that is related to the ones in the model but not in the model itself.

Example • Let’s look at our classic example of logLOS • What variables are associated with logLOS? • What variables have the potential to create multicollinearity?

SENIC

> data <- read.csv("senicfull.csv") > data$logLOS <- log(data$LOS) > data$nurse2 <- data$NURSE^2 > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > > data.cor <- data[,-1] > round(cor(data.cor),2) LOS AGE INFRISK CULT XRAY BEDS MEDSCHL REGION CENSUS NURSE FACS logLOS nurse2 ms LOS 1.00 0.19 0.53 0.33 0.38 0.41 -0.30 -0.49 0.47 0.34 0.36 0.98 0.25 0.30 AGE 0.19 1.00 0.00 -0.23 -0.02 -0.06 0.15 -0.02 -0.05 -0.08 -0.04 0.17 -0.04 -0.15 INFRISK 0.53 0.00 1.00 0.56 0.45 0.36 -0.23 -0.19 0.38 0.39 0.41 0.55 0.26 0.23 CULT 0.33 -0.23 0.56 1.00 0.42 0.14 -0.24 -0.31 0.14 0.20 0.19 0.35 0.15 0.24 XRAY 0.38 -0.02 0.45 0.42 1.00 0.05 -0.09 -0.30 0.06 0.08 0.11 0.39 0.04 0.09 BEDS 0.41 -0.06 0.36 0.14 0.05 1.00 -0.59 -0.11 0.98 0.92 0.79 0.42 0.86 0.59 MEDSCHL -0.30 0.15 -0.23 -0.24 -0.09 -0.59 1.00 0.10 -0.61 -0.59 -0.52 -0.32 -0.56 -1.00 REGION -0.49 -0.02 -0.19 -0.31 -0.30 -0.11 0.10 1.00 -0.15 -0.11 -0.21 -0.52 -0.07 -0.10 CENSUS 0.47 -0.05 0.38 0.14 0.06 0.98 -0.61 -0.15 1.00 0.91 0.78 0.48 0.84 0.61 NURSE 0.34 -0.08 0.39 0.20 0.08 0.92 -0.59 -0.11 0.91 1.00 0.78 0.37 0.95 0.59 FACS 0.36 -0.04 0.41 0.19 0.11 0.79 -0.52 -0.21 0.78 0.78 1.00 0.38 0.66 0.52 logLOS 0.98 0.17 0.55 0.35 0.39 0.42 -0.32 -0.52 0.48 0.37 0.38 1.00 0.28 0.32 nurse2 0.25 -0.04 0.26 0.15 0.04 0.86 -0.56 -0.07 0.84 0.95 0.66 0.28 1.00 0.56 ms 0.30 -0.15 0.23 0.24 0.09 0.59 -1.00 -0.10 0.61 0.59 0.52 0.32 0.56 1.00 >

Let’s try an example with serious multicollinearity • To anticipate multicollinearity, ALWAYS good to look at scatterplots and correlation matrices of potential covariates • What covariates would give rise to a good example?

Lecture 11 Multicollinearity

Lecture 11 Multicollinearity

Presentation Transcript

Lecture 11

Lecture 11:

Lecture 11

Module II Lecture 7: Multicollinearity, and Modeling Strategies

Module II Lecture 7: multicollinearity, and Modelling Strategies

Multicollinearity

Lecture 11

Lecture # 11

MultiCollinearity

MULTICOLLINEARITY

Multicollinearity

Lecture 11

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

MultiCollinearity

Multicollinearity

LECTURE 14 OUTLIERS AND MULTICOLLINEARITY

Multicollinearity

Lecture 11

Lecture 11

Multicollinearity