Lecture 11 multicollinearity
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Lecture 11 Multicollinearity PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 11 Multicollinearity. BMTRY 701 Biostatistical Methods II. Multicollinearity Introduction. Some common questions we ask in MLR what is the relative importance of the effects of the different covariates? what is the magnitude of effect of a given covariate on the response?

Download Presentation

Lecture 11 Multicollinearity

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 11 multicollinearity

Lecture 11Multicollinearity

BMTRY 701Biostatistical Methods II


Multicollinearity introduction

Multicollinearity Introduction

  • Some common questions we ask in MLR

    • what is the relative importance of the effects of the different covariates?

    • what is the magnitude of effect of a given covariate on the response?

    • can any covariate be dropped from the model because it has little effect or no effect on the outcome?

    • should any covariates not yet included in the model be considered for possible inclusion?


Easy answers

Easy answers?

  • If the candidate covariates are uncorrelated with one another: yes, these are simple questions

  • If the candidate covariates are correlated with one another: no, these are not easy.

  • Most commonly:

    • observational studies have correlated covariates

    • we need to adjust for these when assessing relationships

    • “adjusting” for confounders

  • Experimental designs?

    • less problematic

    • patients are randomized in common designs

    • no confounding exists because factors are ‘balanced’ across arms


Multicollinearity

Multicollinearity

  • Also called “intercorrelation”

  • refers to the situation when the covariates are related to each other and to the outcome of interest

  • like confounding, but a statistical terminology for it because of the effects it has on regression modeling


No multicollinearity example mouse experiment

No Multicollinearity Example: Mouse experiment


Linear modeling

Linear modeling

  • Interested in seeing which factors influence tumor size in mice

  • Notice that the experiment is perfectly balanced.

  • What does that mean?


Dose of drug a on tumor

Dose of Drug A on Tumor

> reg.a <- lm(Tumor.size ~ Dose.A, data=data)

> summary(reg.a)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 32.50000 12.29041 2.644 0.0246 *

Dose.A -0.05250 0.05689 -0.923 0.3779

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.09 on 10 degrees of freedom

Multiple R-Squared: 0.07847, Adjusted R-squared: -0.01368

F-statistic: 0.8515 on 1 and 10 DF, p-value: 0.3779

>


Dose of drug b on tumor

Dose of Drug B on Tumor

> reg.b <- lm(Tumor.size ~ Dose.B, data=data)

> summary(reg.b)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 58.0000 9.4956 6.108 0.000114 ***

Dose.B -0.9600 0.2402 -3.996 0.002533 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.4 on 10 degrees of freedom

Multiple R-Squared: 0.6149, Adjusted R-squared: 0.5764

F-statistic: 15.97 on 1 and 10 DF, p-value: 0.002533

>


Diet on tumor

Diet on Tumor

> reg.diet <- lm(Tumor.size ~ Diet, data=data)

> summary(reg.diet)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 28.000 6.296 4.448 0.00124 **

Diet -12.000 8.903 -1.348 0.20745

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.42 on 10 degrees of freedom

Multiple R-Squared: 0.1537, Adjusted R-squared: 0.06911

F-statistic: 1.817 on 1 and 10 DF, p-value: 0.2075


All in the model together

All in the model together

> reg.all <- lm(Tumor.size ~ Dose.A + Dose.B + Diet, data=data)

> summary(reg.all)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 74.50000 8.72108 8.543 2.71e-05 ***

Dose.A -0.05250 0.02591 -2.027 0.077264 .

Dose.B -0.96000 0.16921 -5.673 0.000469 ***

Diet -12.00000 4.23035 -2.837 0.021925 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.327 on 8 degrees of freedom

Multiple R-Squared: 0.8472, Adjusted R-squared: 0.7898

F-statistic: 14.78 on 3 and 8 DF, p-value: 0.001258


Correlation matrix of predictors and outcome

Correlation matrix of predictors and outcome

> cor(data[,-1])

Dose.A Dose.B Diet Tumor.size

Dose.A 1.0000000 0.0000000 0.0000000 -0.2801245

Dose.B 0.0000000 1.0000000 0.0000000 -0.7841853

Diet 0.0000000 0.0000000 1.0000000 -0.3920927

Tumor.size -0.2801245 -0.7841853 -0.3920927 1.0000000

>


Result

Result

  • For perfectly balanced designs, adjusting does not affect the coefficients

  • However, it can affect the significance

  • Why?

    • residual sum of squares is affected

    • if you explain more of the variance in the outcome, less is left to chance/error

    • when you adjust for another related factor, you will likely improve the significance


The other extreme perfect collinearity

The other extreme: perfect collinearity


The model has infinitely many solutions

The model has infinitely many solutions

  • Too much flexibility

  • What happens?

  • The fitting algorithm usually gives you some indication of this

    • will not fit the model and gives an error

    • drops one of the predictors

  • “perfectly collinear” = “perfect confounding”


Effects of multicollinearity

Effects of Multicollinearity

  • Most common result

    • two covariates are independently associated with Y in simple linear regression models

    • in MLR model with both covariates, one or both is insignificant

    • the magnitude of the regression coefficients is attenuated

    • why?

      • recall the adjusted variable plot

      • if the two are related, removing the systematic part of one from Y may leave too little left to explain


Effects of multicollinearity1

Effects of Multicollinearity

  • Other situations

    • Neither is significant alone, but they are both significant together (somewhat rare)

    • Both are significant alone and both retain signficance in the model

    • The regression coefficient for one of the covariates may change direction

    • Magnitude of coefficient may increase (in absolute value)

  • It is usually hard to predict exactly what will happen when both are in the model


Implications in inference

Implications in inference

  • the interpretation of a regression coefficient measuring the change in the expected value of Y when the covariate is increased while all other are held constant is not quite applicable

  • It may be conceptually feasible to think of ‘holding all constant’

  • but, practically, it may not be possible if the covariates are related.

  • Example: amount of rainfall and hours of sunshine


Implications in inference1

Implications in inference

  • multicollinearity tends to inflate the standard errors on the regression coefficients

  • when multicollinearity is present, you will see that coefficient of partial determination will have little increase with the addition of the collinear covariate

  • Predictions tend to be relatively unaffected for better or worse when a highly collinear covariate is added to the model.


Implications in inference2

Implications in Inference

  • Recall the interpretation of the t-statistics in MLR

  • The represent the significance of a variable, adjusting for all else in the model

  • If two covariates are highly correlated, then both are likely to end up insignificant

  • Marginal nature of t-tests!

  • ANOVA can be more useful due to conditional nature of tables.


So which is the correct variable

So, which is the ‘correct’ variable?

  • Almost impossible to tell

  • Usually, people choose the one that is ‘more’ significant.

  • but that does not mean it is the correct choice

    • it could be the correct choice

    • it could be the one that is less associated

      • why might it be less associated?

      • measurement issues

    • the correct ‘culprit’ could be a variable that is related to the ones in the model but not in the model itself.


Example

Example

  • Let’s look at our classic example of logLOS

  • What variables are associated with logLOS?

  • What variables have the potential to create multicollinearity?


Senic

SENIC


Lecture 11 multicollinearity

> data <- read.csv("senicfull.csv")

> data$logLOS <- log(data$LOS)

> data$nurse2 <- data$NURSE^2

> data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL)

>

> data.cor <- data[,-1]

> round(cor(data.cor),2)

LOS AGE INFRISK CULT XRAY BEDS MEDSCHL REGION CENSUS NURSE FACS logLOS nurse2 ms

LOS 1.00 0.19 0.53 0.33 0.38 0.41 -0.30 -0.49 0.47 0.34 0.36 0.98 0.25 0.30

AGE 0.19 1.00 0.00 -0.23 -0.02 -0.06 0.15 -0.02 -0.05 -0.08 -0.04 0.17 -0.04 -0.15

INFRISK 0.53 0.00 1.00 0.56 0.45 0.36 -0.23 -0.19 0.38 0.39 0.41 0.55 0.26 0.23

CULT 0.33 -0.23 0.56 1.00 0.42 0.14 -0.24 -0.31 0.14 0.20 0.19 0.35 0.15 0.24

XRAY 0.38 -0.02 0.45 0.42 1.00 0.05 -0.09 -0.30 0.06 0.08 0.11 0.39 0.04 0.09

BEDS 0.41 -0.06 0.36 0.14 0.05 1.00 -0.59 -0.11 0.98 0.92 0.79 0.42 0.86 0.59

MEDSCHL -0.30 0.15 -0.23 -0.24 -0.09 -0.59 1.00 0.10 -0.61 -0.59 -0.52 -0.32 -0.56 -1.00

REGION -0.49 -0.02 -0.19 -0.31 -0.30 -0.11 0.10 1.00 -0.15 -0.11 -0.21 -0.52 -0.07 -0.10

CENSUS 0.47 -0.05 0.38 0.14 0.06 0.98 -0.61 -0.15 1.00 0.91 0.78 0.48 0.84 0.61

NURSE 0.34 -0.08 0.39 0.20 0.08 0.92 -0.59 -0.11 0.91 1.00 0.78 0.37 0.95 0.59

FACS 0.36 -0.04 0.41 0.19 0.11 0.79 -0.52 -0.21 0.78 0.78 1.00 0.38 0.66 0.52

logLOS 0.98 0.17 0.55 0.35 0.39 0.42 -0.32 -0.52 0.48 0.37 0.38 1.00 0.28 0.32

nurse2 0.25 -0.04 0.26 0.15 0.04 0.86 -0.56 -0.07 0.84 0.95 0.66 0.28 1.00 0.56

ms 0.30 -0.15 0.23 0.24 0.09 0.59 -1.00 -0.10 0.61 0.59 0.52 0.32 0.56 1.00

>


Let s try an example with serious multicollinearity

Let’s try an example with serious multicollinearity

  • To anticipate multicollinearity, ALWAYS good to look at scatterplots and correlation matrices of potential covariates

  • What covariates would give rise to a good example?


  • Login