Logistic Regression

Logistic Regression

Example: Survival of Titanic passengers • We want to know if the probability of survival is higher among children • Outcome (y) = survived/not • Explanatory (x) = age at death > titan<-read.table("D:/RShortcourse/titanic.txt",sep=",",header=T, na=".") > titan2<-na.omit(titan) > log1<-glm(titan2$Survived~titan2$Age,family=binomial("logit")) Since “logit” is the default, you can actually use: > log1<-glm(titan2$Survived~titan2$Age,binomial) > summary(log1)

Example in R Call: glm(formula = titan$Survived ~ titan$Age, family = binomial("logit")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.081428 0.173862 -0.468 0.6395 titan$Age -0.008795 0.005232 -1.681 0.0928 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1025.6 on 755 degrees of freedom Residual deviance: 1022.7 on 754 degrees of freedom AIC: 1026.7

Plot > plot(titan2$Age~fitted(log1), xlab="Age", ylab="P(Survival)", pch=15)

Example: Categorical x • Sometimes, it’s easier to interpret logistic regression output if the x variables are categorical • Suppose we categorize maternal age into 3 categories: > titan$agecat4 <- ifelse(titan$Age>35, c(1),c(0)) > titan$agecat3 <- ifelse(titan$Age>=18 & titan$Age<=35, c(1),c(0)) > titan$agecat2 <- ifelse(titan$Age>=6 & titan$Age<=17, c(1),c(0)) > titan$agecat1 <- ifelse(titan$Age<6, c(1),c(0))

Example in R > log2<-glm(titan2$Survived~titan2$agecat2 + titan2$agecat3 + titan2$agecat4, binomial) • summary(log2) • Remember, with a set of dummy variables, you always put in one less variable than category

Example in R Call: glm(formula = titan2$Survived ~ titan2$agecat1 + titan2$agecat2 + titan2$agecat3, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.2924 0.1284 -2.278 0.022734 * titan2$agecat1 1.5452 0.4209 3.671 0.000242 *** titan2$agecat2 0.2924 0.2883 1.014 0.310573 titan2$agecat3 -0.2758 0.1643 -1.679 0.093172 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1025.57 on 755 degrees of freedom Residual deviance: 999.07 on 752 degrees of freedom AIC: 1007.1

Interpretation: odds ratios > exp(cbind(OR=coef(log2),confint(log2))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 3.5000000 1.67176790 8.2301627 titan2$agecat2 0.2857143 0.10674004 0.7040320 titan2$agecat3 0.1618685 0.06737731 0.3485255 titan2$agecat4 0.2132797 0.08772147 0.4663720 • This tells us that: • Passengers <6 years have a 3.5 times greater odds of survival than passengers >35 years • Passengers in the other two age groups have no significant difference in odds of survival from passengers >35 years

Now let’s make it more complicated • Do sex and passenger class predict survival? • Let’s try a saturated model: > log3<-glm(titan$Survived~titan$Sex + titan$PClass + titan$Sex:titan$PClass, binomial) > summary(log3)

Example in R Call: glm(formula = titan$Survived ~ titan$Sex + titan$PClass+ titan$Sex:titan$PClass, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.7006 0.3443 7.843 4.41e-15 *** titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 *** titan$PClass2nd -0.7223 0.4540 -1.591 0.112 titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 *** titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512 titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1688.1 on 1312 degrees of freedom Residual deviance: 1155.9 on 1307 degrees of freedom AIC: 1167.9

What variables can we consider dropping? > anova(log3,test="Chisq") Analysis of Deviance Table Response: titan$Survived Terms added sequentially (first to last) Df Deviance Resid. DfResid. Dev P(>|Chi|) NULL 1312 1688.1 titan$Sex 1 332.53 1311 1355.5 < 2.2e-16 *** titan$PClass 2 156.48 1309 1199.0 < 2.2e-16 *** titan$Sex:titan$PClass 2 43.19 1307 1155.9 4.176e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Small p-values indicate that all variables are needed to explain the variation in y

Goodness of fit statistics Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.7006 0.3443 7.843 4.41e-15 *** titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 *** titan$PClass2nd -0.7223 0.4540 -1.591 0.112 titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 *** titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512 titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1688.1 on 1312 degrees of freedom Residual deviance: 1155.9 on 1307 degrees of freedom AIC: 1167.9 • -2 log L • AIC

Binned residual plot install.packages("arm") library(arm) x<-predict(log3) y<-resid(log3) binnedplot(x,y) Category is based on the fitted values 95% of all values should fall within the dotted line • Plots the average residual and the average fitted (predicted) value for each bin, or category

Poisson Regression Using count data

What is a Poisson distribution?

Example: children ever born • The dataset has 70 rows representing group-level data on the number of children ever born to women in Fiji: • Number of children ever born • Number of women in the group • Duration of marriage • 1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 • Residence • 1=Suva (capital city), 2=Urban, 3=Rural • Education • 1=none, 2=lower primary, 3=upper primary, 4=secondary+ > ceb<-read.table("D:/RShortcourse/ceb.dat",header=T)

Poisson regression in R > ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson", data = ceb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom AIC: Inf Need to account for different population sizes in each area/group unless data are from same-size populations

Assessing model fit 1. Examine AIC score – smaller is better 2. Examine the deviance as an approximate goodness of fit test • Expect the residual deviance/degrees of freedom to be approximately 1 > ceb1$deviance/ceb1$df.residual [1] 41.35172 3. Compare residual deviance to a 2 distribution > pchisq(2646.5, 64, lower=F) [1] 0

Model fitting: analysis of deviance • Similar to logistic regression, we want to compare the differences in the size of residuals between models > ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb) > ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb) > 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2)) [1] 0.0007124383 • Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

Overdispersion in Poission models • A characteristic of the Poisson distribution is that its mean is equal to its variance • Sometimes the observed variance is greater than the mean • Known as overdispersion • Another common problem with Poisson regression is excess zeros • Are more zeros than a Poisson regression would predict

Overdispersion • Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter • Doesn't change the estimates for the coefficients, but may change their standard errors > ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

Poisson vs. quasipoisson Family = “poisson” Family = “quasipoisson” Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.43029 0.10999 13.004 < 2e-16 *** educnone 0.21462 0.13378 1.604 0.11358 educsec+ -1.00900 0.31968 -3.156 0.00244 ** educupper -0.40485 0.18115 -2.235 0.02892 * resSuva -0.05997 0.17277 -0.347 0.72965 resurban 0.06204 0.14966 0.415 0.67988 --- (Dispersion parameter for quasipoisson taken to be 37.55359) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom

Models for overdispersion • When overdispersion is a problem, use a negative binomial model • Will adjust  estimates and standard errors > install.packages(“MASS”) > library(MASS) > ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb) OR > ceb.nb<-glm.nb(ceb2) > summary(ceb.nb)

NB model in R glm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.490043 0.160589 9.279 < 2e-16 *** educnone 0.002317 0.183754 0.013 0.98994 educsec+ -0.630343 0.200220 -3.148 0.00164 ** educupper -0.173138 0.184210 -0.940 0.34727 resSuva -0.149784 0.165622 -0.904 0.36580 resurban 0.055610 0.165391 0.336 0.73670 --- (Dispersion parameter for Negative Binomial(3.3872) family taken to be 1) Null deviance: 85.001 on 69 degrees of freedom Residual deviance: 71.955 on 64 degrees of freedom AIC: 740.55 Theta: 3.387 Std. Err.: 0.583 2 x log-likelihood: -726.555 > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297

What if your data looked like…

Zero-inflated Poisson model (ZIP) • If you have a large number of 0 counts… > install.packages(“pscl”) > library(pscl) > ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)

Logistic Regression