Stats 330: Lecture 25

1 / 44

# Stats 330: Lecture 25 - PowerPoint PPT Presentation

Stats 330: Lecture 25. Multiple Logistic Regression: Prediction and a case study. Plan of the day. In today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be Prediction in logistic regression In-sample and out-of-sample error rates

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Stats 330: Lecture 25' - gabe

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Stats 330: Lecture 25

Multiple Logistic

Regression:

Prediction and a case study

Plan of the day

In today’s lecture we discuss prediction and present a logistic regression case study.

Topics covered will be

Prediction in logistic regression

In-sample and out-of-sample error rates

Cross-validation and bootstrap estimates of error rates

Sensitivity and specificity

ROC curves

Then, a case study

Housekeeping
• Error in slide 34 in lecture 23:

Function is now influenceplots

Prediction
• Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1?
• Work out the estimated log-odds for the case
• Work out probability: Prob = exp(log-odds)/(1+exp(log.odds))
• Predict
• Y=1 if prob >= 0.5 (equivalently log.odds >=0)
• Y=0 if prob < 0.5 (equivalently log.odds <0)
Estimating the prediction error
• Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s)
• As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate
• We can use cross-validation or the bootstrap to improve the estimate –see the case study
Sensitivity and specificity
• Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate”
• Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1-specificity is called the “false positive rate”)
• Ideally, want both to be close to 1
• We would like to know what these would be for new data – use cross-validation and the bootstrap as for normal regression
Calculating sensitivity and specificity

Specificity = 100/(100+200) = 33%

Sensitivity = 600/(600+250) = 70%

In-sample error rate = (200+250)/1150

ROC curves
• We have predicted a “success” (Y=1) if the log-odds are positive.
• We can generalize this to predict a success if log-odds >=c for some constant c
• If c is large and –ve, almost every case will be predicted as a success (1)
• Sensitivity close to 1, specificity close to 0
• If c is large and +ve, almost every case will be predicted as a failure (0)
• Sensitivity close to 0, specificity close to 1
• Allows a trade-off between sensitivity and specificity
• As c varies, the sensitivity and specificity change.
• ROC curve is a plot of the points (1-specificity, sensitivity)

as c changes.

ROC curves - cont

Perfect prediction

Worst case prediction

True positive rate

True positive rate

False positive rate

Predictor no help

False positive rate

True positive rate

False positive rate

Area under the curve
• For a perfect predictor, the area under the ROC curve (AUC) is 1.
• If the predictor is independent of the response, the sensitivity and specificity are both 0.5.
• AUC curve serves as a measure of how good the model is at predicting.
Case study

The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, amedical study performed in the US in the early 90’s.

The study aimed to evaluate two different treatments for drug addiction.

Reference: Hosmer and Lemeshow, Applied Logistic Regression (2nd Ed), p28

List of variables

Variable Description Codes/ValuesName

Identification Code 1-575 ID

Age at Enrollment Years AGE

Beck Depression Score 0-54 BECK

IV Drug Use History 1 = Never, IVHX

at Admission 2 = Previous, 3 = Recent

No of prior Treatments 0-40 NDRUGTX

Subject's Race 0 = White , RACE

1 = Other

Treatment Duration 0=short, 1 = Long TREAT

Treatment Site 0 = A, 1 = B SITE

Remained Drug Free 1 = Yes, 0 = No DFREE

The variables
• The response DFREE is binary: records if subject is drug-free after conclusion of treatment.
• There is a mix of categorical and continuous explanatory variables
• Categorical: IVHX, RACE, TREAT, SITE
• Continuous: AGE, BECK, NDRUGTX.
Questions
• Is the longer treatment more effective?
• Did Site A deliver the program more effectively than site B?
• What other variables have an effect on successful rehabilitation of addicts?
• Can we predict who is likely to be drug-free in 12 months?
Analysis strategy
• Preliminary plots, tables
• Variable selection
• Model fitting
• Interpretation of coefficients
• Evaluation as a predictor of recovery from addiction.
Preliminary plots (3)
• Seems like number of previous drug treatments have an effect
• Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect
Preliminary fits (1)

Call:

glm(formula = DFREE ~ . - IVHX - ID + factor(IVHX), family = binomial,

data = drug.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.3465 -0.8091 -0.6326 1.1834 2.4231

Don’t include ID!

Preliminary fits (2)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 ***

AGE 0.0504143 0.0174057 2.896 0.00377 **

BECK 0.0002759 0.0107982 0.026 0.97961

NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 *

RACE 0.2260262 0.2233685 1.012 0.31159

TREAT 0.4424802 0.1992922 2.220 0.02640 *

SITE 0.1489209 0.2176062 0.684 0.49375

factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 *

factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 **

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 653.73 on 574 degrees of freedom

Residual deviance: 619.25 on 566 degrees of freedom

AIC: 637.25

Number of Fisher Scoring iterations: 4

Preliminary conclusions
• Important variables seem to be AGE, NDRUGTX, TREAT, IVHX
• Data are ungrouped, can’t assess goodness of fit with residual deviance
• No extremely large residuals
Hosmer-Lemeshow test

>HLstat(drug.glm)

Value of HL statistic = 5.05

P-value = 0.752

No evidence of a bad fit using this test

Variable selection (1)

> anova(drug.glm, test="Chisq")

Analysis of Deviance Table

Response: DFREE

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 574 653.73

AGE 1 1.40 573 652.33 0.24

BECK 1 0.57 572 651.76 0.45

NDRUGTX 1 14.07 571 637.69 0.0001760

RACE 1 3.06 570 634.630.08

TREAT 1 4.96 569 629.67 0.03

SITE 1 1.07 568 628.60 0.30

factor(IVHX) 2 9.35 566 619.25 0.01

Variable selection (2)

Step: AIC= 632.59

DFREE ~ NDRUGTX + IVHX + AGE + TREAT

Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT,

family = binomial, data = drug.df)

Degrees of Freedom: 574 Total (i.e. Null); 569 Residual

Null Deviance: 653.7

Residual Deviance: 620.6 AIC: 632.6

Sub-model

> sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df)

> summary(sub.glm)

Call:

glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.2598 -0.8051 -0.6284 1.1401 2.4574

Sub-model (ii)

All variables significant,

but use caution

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***

NDRUGTX -0.06376 0.02563 -2.488 0.012844 *

factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 *

factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 ***

AGE 0.05259 0.01721 3.056 0.002244 **

TREAT 0.45134 0.19860 2.273 0.023048 *

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 653.73 on 574 degrees of freedom

Residual deviance: 620.59 on 569 degrees of freedom

AIC: 632.59

Number of Fisher Scoring iterations: 4

Do we need interaction terms?

> sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df)

> sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT , family=binomial, data=drug.df)

>

> anova(sub.glm, sub2.glm, test="Chisq")

Analysis of Deviance Table

Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREAT

Model 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT + NDRUGTX * TREAT

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 569 620.59

2 563 616.16 6 4.42 0.62

Model with interactions

Big p-value so interactions not required

Do we need to transform?

par(mfrow=c(1,2))

sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT , family=binomial, data=drug.df)

plot(sub.gam)

Transforming
• Suggests a possible quadratic in NDRUGTX:

> subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df)

> summary(subq.glm)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.70763 0.56715 -4.774 1.80e-06 ***

poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 *

poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814

IVHX2 -0.59018 0.28608 -2.063 0.03912 *

IVHX3 -0.76015 0.24725 -3.074 0.00211 **

AGE 0.05458 0.01730 3.154 0.00161 **

TREAT 0.44379 0.19904 2.230 0.02577 *

But term is not significant, so we stick with no transformation

Diagnostics

Pt 85

7, 471

7, 471

Influence of 7, 471, 85

Effect on coefficients of removing cases:

None 7 85 471 All

(Intercept) -2.333 -2.295 -2.222 -2.447 -2.293

NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100

IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662

IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747

AGE 0.053 0.053 0.049 0.057 0.054

TREAT 0.451 0.434 0.441 0.479 0.450

None seem particularly influential! We will not delete them

Over-dispersion

qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=quasibinomial, data=drug.df)

> summary(qsub.glm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.33276 0.55435 -4.208 2.99e-05 ***

NDRUGTX -0.06376 0.02591 -2.461 0.01414 *

IVHX2 -0.62366 0.28780 -2.167 0.03065 *

IVHX3 -0.80561 0.24720 -3.259 0.00118 **

AGE 0.05259 0.01740 3.023 0.00262 **

TREAT 0.45134 0.20076 2.248 0.02495 *

---

(Dispersion parameter for quasibinomial family taken to be 1.021892)

Very close to 1 so no overdispersion

Interpretation

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.33276 0.54838 -4.254 2.1e-05 ***

NDRUGTX -0.06376 0.02563 -2.488 0.012844 *

IVHX2 -0.62366 0.28470 -2.191 0.028484 *

IVHX3 -0.80561 0.24453 -3.294 0.000986 ***

AGE 0.05259 0.01721 3.056 0.002244 **

TREAT 0.45134 0.19860 2.273 0.023048 *

• As the number of prior treatments goes up, the probability of a drug-free recovery goes down
• The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use
• The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use.
• The probability of a drug-free recovery goes up with age
• The probability of a drug-free recovery is higher for the long treatment
Interpreting p-values after model selection
• We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients.
• We can use the bootstrap to examine the revised distribution
• Leave TREAT in the model, use forward selection to select the other variables.
Procedure
• Draw a bootstrap sample
• Do forward selection, record value of regression coef for TREAT (forced to be in every model)
• Repeat 200 times, draw histogram of the results
R code

# bootstrap sample

n = dim(drug.df)[1]

B=200

beta.boot = numeric(B)

for(b in 1:B){

ni = rmultinom(1, n ,prob=rep(1/n,n))

newdata = drug.df[rep(1:n,ni),]

drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE,

family=binomial, data= newdata)

chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)),

direction = “forward", trace=0)

k = match("TREAT",names(coef(chosen)))

beta.boot[b] = summary(chosen)\$coefficients[k,1]

}

Histogram

> mean(beta.boot)

[1] 0.4540209

> sd(beta.boot)

[1] 0.2019222

> z.val =

mean(beta.boot)/

sd(beta.boot)

> 2*(1-pnorm(z.val))

[1] 0.02454468

Compare

Beta = 0.45134

SE = 0.19860

P-value = 0.023048

Prediction
• Sensitivity: chance the model predicts a successful recovery (drug-free at end of program), when one will actually occur
• Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur
R code

sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df)

> pred = predict(sub.glm, type="response")

> predcode = ifelse(pred<0.5, 0,1)

> table(drug.df\$DFREE,predcode)

predicted

0 1

Actual 0 426 2

1 144 3

Sensitivity = 3/147 = 0.02040816

Specificity = 426/428 = 0.9953271

Error rate = 146/575 = 0.2539130

Proportion correctly classified = 429/575 = 0.746087

ROC curve

ROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)

# in the R330 package

Prediction (2)
• Use 10-fold cross-validation
• Split data into 10 parts
• Calculate sensitivity and specificity for each part, using model fitted to the remaining parts
• Average results
• Repeat for different splits, average repeats
• E.g. for one part
CV and bootstrap Results

> cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df)

Mean Specificity = 0.9908424

Mean Sensitivity = 0.02854005

Mean Correctly classified = 0.7446491

> err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df)

\$err

[1] 0.2539130

\$Err

[1] 0.2552974

A poor classifier, but this doesn’t mean

that the model fits poorly – there are very few

cases with fitted probs over 0.5, and many

with fitted probabilities between 0.2 and 0.5.

We expect a moderate number of these to

be misclassified, as some events (being drug free)

with probs 0.2 to 0.5 have occurred.

Bootstrap estimate

Training set estimate

Overall conclusions
• Model seems to fit well
• Strong evidence that longer treatments are better
• No apparent difference between sites
• Age and prior IV drug use affect recovery
• Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free