Stats 330: Lecture 25

Stats 330: Lecture 25 Multiple Logistic Regression: Prediction and a case study

Plan of the day In today’s lecture we discuss prediction and present a logistic regression case study. Topics covered will be Prediction in logistic regression In-sample and out-of-sample error rates Cross-validation and bootstrap estimates of error rates Sensitivity and specificity ROC curves Then, a case study

Housekeeping • Error in slide 34 in lecture 23: Function is now influenceplots • Bug in ROC.curve – download replacement from web page

Prediction • Suppose we have fitted a logistic model and we want to use the model to predict new cases. If a new case presents with explanatory variables x, how do we predict the y-value, 0 or 1? • Work out the estimated log-odds for the case • Work out probability: Prob = exp(log-odds)/(1+exp(log.odds)) • Predict • Y=1 if prob >= 0.5 (equivalently log.odds >=0) • Y=0 if prob < 0.5 (equivalently log.odds <0)

Estimating the prediction error • Prediction error is the probability of a wrong classification ( 0’s predicted as 1’s, 1’s predicted as 0’s) • As in linear regression, using the training data to estimate these proportions tends to give an optimistic estimate • We can use cross-validation or the bootstrap to improve the estimate –see the case study

Sensitivity and specificity • Sensitivity: probability of predicting a 1 when the case is truly a 1: the “true positive rate” • Specificity: probability of predicting a 0 when the case is truly a 0: the “true negative rate” (1-specificity is called the “false positive rate”) • Ideally, want both to be close to 1 • We would like to know what these would be for new data – use cross-validation and the bootstrap as for normal regression

Calculating sensitivity and specificity Specificity = 100/(100+200) = 33% Sensitivity = 600/(600+250) = 70% In-sample error rate = (200+250)/1150

ROC curves • We have predicted a “success” (Y=1) if the log-odds are positive. • We can generalize this to predict a success if log-odds >=c for some constant c • If c is large and –ve, almost every case will be predicted as a success (1) • Sensitivity close to 1, specificity close to 0 • If c is large and +ve, almost every case will be predicted as a failure (0) • Sensitivity close to 0, specificity close to 1 • Allows a trade-off between sensitivity and specificity • As c varies, the sensitivity and specificity change. • ROC curve is a plot of the points (1-specificity, sensitivity) as c changes.

ROC curves - cont Perfect prediction Worst case prediction True positive rate True positive rate False positive rate Predictor no help False positive rate True positive rate False positive rate

Area under the curve • For a perfect predictor, the area under the ROC curve (AUC) is 1. • If the predictor is independent of the response, the sensitivity and specificity are both 0.5. • AUC curve serves as a measure of how good the model is at predicting.

Case study The data comes from the University of Massachusetts AIDS Research Unit IMPACT study, amedical study performed in the US in the early 90’s. The study aimed to evaluate two different treatments for drug addiction. Reference: Hosmer and Lemeshow, Applied Logistic Regression (2nd Ed), p28

List of variables Variable Description Codes/ValuesName Identification Code 1-575 ID Age at Enrollment Years AGE Beck Depression Score 0-54 BECK IV Drug Use History 1 = Never, IVHX at Admission 2 = Previous, 3 = Recent No of prior Treatments 0-40 NDRUGTX Subject's Race 0 = White , RACE 1 = Other Treatment Duration 0=short, 1 = Long TREAT Treatment Site 0 = A, 1 = B SITE Remained Drug Free 1 = Yes, 0 = No DFREE

The variables • The response DFREE is binary: records if subject is drug-free after conclusion of treatment. • There is a mix of categorical and continuous explanatory variables • Categorical: IVHX, RACE, TREAT, SITE • Continuous: AGE, BECK, NDRUGTX.

Questions • Is the longer treatment more effective? • Did Site A deliver the program more effectively than site B? • What other variables have an effect on successful rehabilitation of addicts? • Can we predict who is likely to be drug-free in 12 months?

Analysis strategy • Preliminary plots, tables • Variable selection • Model fitting • Interpretation of coefficients • Evaluation as a predictor of recovery from addiction.

Preliminary plots

Preliminary plots (2)

Preliminary plots (3) • Seems like number of previous drug treatments have an effect • Seems like factors IVHX (Recent IV drug use), SITE (Site A or Site B) and TREAT (short or long treatment) have an effect

Preliminary fits (1) Call: glm(formula = DFREE ~ . - IVHX - ID + factor(IVHX), family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max -1.3465 -0.8091 -0.6326 1.1834 2.4231 Don’t include ID!

Preliminary fits (2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.4111283 0.5983427 -4.030 5.59e-05 *** AGE 0.0504143 0.0174057 2.896 0.00377 ** BECK 0.0002759 0.0107982 0.026 0.97961 NDRUGTX -0.0615329 0.0256441 -2.399 0.01642 * RACE 0.2260262 0.2233685 1.012 0.31159 TREAT 0.4424802 0.1992922 2.220 0.02640 * SITE 0.1489209 0.2176062 0.684 0.49375 factor(IVHX)2 -0.6036962 0.2875974 -2.099 0.03581 * factor(IVHX)3 -0.7336591 0.2549893 -2.877 0.00401 ** (Dispersion parameter for binomial family taken to be 1) Null deviance: 653.73 on 574 degrees of freedom Residual deviance: 619.25 on 566 degrees of freedom AIC: 637.25 Number of Fisher Scoring iterations: 4

Preliminary conclusions • Important variables seem to be AGE, NDRUGTX, TREAT, IVHX • Data are ungrouped, can’t assess goodness of fit with residual deviance • No extremely large residuals

Hosmer-Lemeshow test >HLstat(drug.glm) Value of HL statistic = 5.05 P-value = 0.752 No evidence of a bad fit using this test

Variable selection (1) > anova(drug.glm, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: DFREE Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 574 653.73 AGE 1 1.40 573 652.33 0.24 BECK 1 0.57 572 651.76 0.45 NDRUGTX 1 14.07 571 637.69 0.0001760 RACE 1 3.06 570 634.630.08 TREAT 1 4.96 569 629.67 0.03 SITE 1 1.07 568 628.60 0.30 factor(IVHX) 2 9.35 566 619.25 0.01

Variable selection (2) Step: AIC= 632.59 DFREE ~ NDRUGTX + IVHX + AGE + TREAT Call: glm(formula = DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family = binomial, data = drug.df) Degrees of Freedom: 574 Total (i.e. Null); 569 Residual Null Deviance: 653.7 Residual Deviance: 620.6 AIC: 632.6

Sub-model > sub.glm<-glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family=binomial, data=drug.df) > summary(sub.glm) Call: glm(formula = DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, family = binomial, data = drug.df) Deviance Residuals: Min 1Q Median 3Q Max -1.2598 -0.8051 -0.6284 1.1401 2.4574

Sub-model (ii) All variables significant, but use caution Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 *** NDRUGTX -0.06376 0.02563 -2.488 0.012844 * factor(IVHX)2 -0.62366 0.28470 -2.191 0.028484 * factor(IVHX)3 -0.80561 0.24453 -3.294 0.000986 *** AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 * (Dispersion parameter for binomial family taken to be 1) Null deviance: 653.73 on 574 degrees of freedom Residual deviance: 620.59 on 569 degrees of freedom AIC: 632.59 Number of Fisher Scoring iterations: 4

Do we need interaction terms? > sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df) > sub2.glm<-glm(DFREE ~ NDRUGTX*IVHX + AGE*IVHX + AGE*TREAT + NDRUGTX*TREAT , family=binomial, data=drug.df) > > anova(sub.glm, sub2.glm, test="Chisq") Analysis of Deviance Table Model 1: DFREE ~ NDRUGTX + IVHX + AGE + TREAT Model 2: DFREE ~ NDRUGTX * IVHX + AGE * IVHX + AGE * TREAT + NDRUGTX * TREAT Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 569 620.59 2 563 616.16 6 4.42 0.62 Model with interactions Big p-value so interactions not required

Do we need to transform? par(mfrow=c(1,2)) sub.gam<-gam(DFREE ~ s(NDRUGTX) + factor(IVHX) + s(AGE) + TREAT , family=binomial, data=drug.df) plot(sub.gam)

Transforming • Suggests a possible quadratic in NDRUGTX: > subq.glm<-glm(DFREE ~ poly(NDRUGTX,2) + IVHX + AGE + TREAT, family=binomial, data=drug.df) > summary(subq.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.70763 0.56715 -4.774 1.80e-06 *** poly(NDRUGTX, 2)1 -7.24501 2.93274 -2.470 0.01350 * poly(NDRUGTX, 2)2 4.21319 2.69624 1.563 0.11814 IVHX2 -0.59018 0.28608 -2.063 0.03912 * IVHX3 -0.76015 0.24725 -3.074 0.00211 ** AGE 0.05458 0.01730 3.154 0.00161 ** TREAT 0.44379 0.19904 2.230 0.02577 * But term is not significant, so we stick with no transformation

Diagnostics Pt 85 7, 471 7, 471

Influence of 7, 471, 85 Effect on coefficients of removing cases: None 7 85 471 All (Intercept) -2.333 -2.295 -2.222 -2.447 -2.293 NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100 IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662 IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747 AGE 0.053 0.053 0.049 0.057 0.054 TREAT 0.451 0.434 0.441 0.479 0.450 None seem particularly influential! We will not delete them

Over-dispersion qsub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT, family=quasibinomial, data=drug.df) > summary(qsub.glm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.33276 0.55435 -4.208 2.99e-05 *** NDRUGTX -0.06376 0.02591 -2.461 0.01414 * IVHX2 -0.62366 0.28780 -2.167 0.03065 * IVHX3 -0.80561 0.24720 -3.259 0.00118 ** AGE 0.05259 0.01740 3.023 0.00262 ** TREAT 0.45134 0.20076 2.248 0.02495 * --- (Dispersion parameter for quasibinomial family taken to be 1.021892) Very close to 1 so no overdispersion

Interpretation Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.33276 0.54838 -4.254 2.1e-05 *** NDRUGTX -0.06376 0.02563 -2.488 0.012844 * IVHX2 -0.62366 0.28470 -2.191 0.028484 * IVHX3 -0.80561 0.24453 -3.294 0.000986 *** AGE 0.05259 0.01721 3.056 0.002244 ** TREAT 0.45134 0.19860 2.273 0.023048 * • As the number of prior treatments goes up, the probability of a drug-free recovery goes down • The probability of a drug-free recovery for persons with no IV drug use is more than for persons with previous IV drug use • The probability of a drug-free recovery for persons with previous IV drug use is more than for persons with recent IV drug use. • The probability of a drug-free recovery goes up with age • The probability of a drug-free recovery is higher for the long treatment

Interpreting p-values after model selection • We have seen that this is not valid, as model selection changes the distribution of the estimated coefficients. • We can use the bootstrap to examine the revised distribution • Leave TREAT in the model, use forward selection to select the other variables.

Procedure • Draw a bootstrap sample • Do forward selection, record value of regression coef for TREAT (forced to be in every model) • Repeat 200 times, draw histogram of the results

R code # bootstrap sample n = dim(drug.df)[1] B=200 beta.boot = numeric(B) for(b in 1:B){ ni = rmultinom(1, n ,prob=rep(1/n,n)) newdata = drug.df[rep(1:n,ni),] drug.boot.glm = glm(DFREE ~ NDRUGTX + factor(IVHX) + AGE + BECK + TREAT + RACE + SITE, family=binomial, data= newdata) chosen = step(drug.boot.glm, list(lower = DFREE ~ TREAT, upper= formula(drug.boot.glm)), direction = “forward", trace=0) k = match("TREAT",names(coef(chosen))) beta.boot[b] = summary(chosen)$coefficients[k,1] }

Histogram > mean(beta.boot) [1] 0.4540209 > sd(beta.boot) [1] 0.2019222 > z.val = mean(beta.boot)/ sd(beta.boot) > 2*(1-pnorm(z.val)) [1] 0.02454468 Compare Beta = 0.45134 SE = 0.19860 P-value = 0.023048

Prediction • Sensitivity: chance the model predicts a successful recovery (drug-free at end of program), when one will actually occur • Specificity: chance the model predicts a failure (return to drug use before end of program), when one actually will occur

R code sub.glm<-glm(DFREE ~ NDRUGTX + IVHX + AGE + TREAT , family=binomial, data=drug.df) > pred = predict(sub.glm, type="response") > predcode = ifelse(pred<0.5, 0,1) > table(drug.df$DFREE,predcode) predicted 0 1 Actual 0 426 2 1 144 3 Sensitivity = 3/147 = 0.02040816 Specificity = 426/428 = 0.9953271 Error rate = 146/575 = 0.2539130 Proportion correctly classified = 429/575 = 0.746087

ROC curve ROC.curve(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) # in the R330 package

Prediction (2) • Use 10-fold cross-validation • Split data into 10 parts • Calculate sensitivity and specificity for each part, using model fitted to the remaining parts • Average results • Repeat for different splits, average repeats • E.g. for one part

CV and bootstrap Results > cross.val(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, drug.df) Mean Specificity = 0.9908424 Mean Sensitivity = 0.02854005 Mean Correctly classified = 0.7446491 > err.boot(DFREE ~ NDRUGTX + factor(IVHX) + AGE + TREAT, data= drug.df) $err [1] 0.2539130 $Err [1] 0.2552974 A poor classifier, but this doesn’t mean that the model fits poorly – there are very few cases with fitted probs over 0.5, and many with fitted probabilities between 0.2 and 0.5. We expect a moderate number of these to be misclassified, as some events (being drug free) with probs 0.2 to 0.5 have occurred. Bootstrap estimate Training set estimate

Overall conclusions • Model seems to fit well • Strong evidence that longer treatments are better • No apparent difference between sites • Age and prior IV drug use affect recovery • Model predicts poorly for the covariates in the data set – effectively always predicts patients will not be drug free

Stats 330: Lecture 25