1 / 27

Logistic Regression

Logistic Regression. Example: Survival of Titanic passengers. We want to know if the probability of survival is higher among children Outcome (y) = survived/not Explanatory (x) = age at death > t itan<- read.table ("D:/ RShortcourse /titanic.txt ", sep =",",header=T, na =".")

sasha
Download Presentation

Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regression

  2. Example: Survival of Titanic passengers • We want to know if the probability of survival is higher among children • Outcome (y) = survived/not • Explanatory (x) = age at death > titan<-read.table("D:/RShortcourse/titanic.txt",sep=",",header=T, na=".") > titan2<-na.omit(titan) > log1<-glm(titan2$Survived~titan2$Age,family=binomial("logit")) Since “logit” is the default, you can actually use: > log1<-glm(titan2$Survived~titan2$Age,binomial) > summary(log1)

  3. Example in R Call: glm(formula = titan$Survived ~ titan$Age, family = binomial("logit")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.081428 0.173862 -0.468 0.6395 titan$Age -0.008795 0.005232 -1.681 0.0928 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1025.6 on 755 degrees of freedom Residual deviance: 1022.7 on 754 degrees of freedom AIC: 1026.7

  4. Plot > plot(titan2$Age~fitted(log1), xlab="Age", ylab="P(Survival)", pch=15)

  5. Example: Categorical x • Sometimes, it’s easier to interpret logistic regression output if the x variables are categorical • Suppose we categorize maternal age into 3 categories: > titan$agecat4 <- ifelse(titan$Age>35, c(1),c(0)) > titan$agecat3 <- ifelse(titan$Age>=18 & titan$Age<=35, c(1),c(0)) > titan$agecat2 <- ifelse(titan$Age>=6 & titan$Age<=17, c(1),c(0)) > titan$agecat1 <- ifelse(titan$Age<6, c(1),c(0))

  6. Example in R > log2<-glm(titan2$Survived~titan2$agecat2 + titan2$agecat3 + titan2$agecat4, binomial) • summary(log2) • Remember, with a set of dummy variables, you always put in one less variable than category

  7. Example in R Call: glm(formula = titan2$Survived ~ titan2$agecat1 + titan2$agecat2 + titan2$agecat3, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.2924 0.1284 -2.278 0.022734 * titan2$agecat1 1.5452 0.4209 3.671 0.000242 *** titan2$agecat2 0.2924 0.2883 1.014 0.310573 titan2$agecat3 -0.2758 0.1643 -1.679 0.093172 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1025.57 on 755 degrees of freedom Residual deviance: 999.07 on 752 degrees of freedom AIC: 1007.1

  8. Interpretation: odds ratios > exp(cbind(OR=coef(log2),confint(log2))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 3.5000000 1.67176790 8.2301627 titan2$agecat2 0.2857143 0.10674004 0.7040320 titan2$agecat3 0.1618685 0.06737731 0.3485255 titan2$agecat4 0.2132797 0.08772147 0.4663720 • This tells us that: • Passengers <6 years have a 3.5 times greater odds of survival than passengers >35 years • Passengers in the other two age groups have no significant difference in odds of survival from passengers >35 years

  9. Now let’s make it more complicated • Do sex and passenger class predict survival? • Let’s try a saturated model: > log3<-glm(titan$Survived~titan$Sex + titan$PClass + titan$Sex:titan$PClass, binomial) > summary(log3)

  10. Example in R Call: glm(formula = titan$Survived ~ titan$Sex + titan$PClass+ titan$Sex:titan$PClass, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.7006 0.3443 7.843 4.41e-15 *** titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 *** titan$PClass2nd -0.7223 0.4540 -1.591 0.112 titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 *** titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512 titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1688.1 on 1312 degrees of freedom Residual deviance: 1155.9 on 1307 degrees of freedom AIC: 1167.9

  11. What variables can we consider dropping? > anova(log3,test="Chisq") Analysis of Deviance Table Response: titan$Survived Terms added sequentially (first to last) Df Deviance Resid. DfResid. Dev P(>|Chi|) NULL 1312 1688.1 titan$Sex 1 332.53 1311 1355.5 < 2.2e-16 *** titan$PClass 2 156.48 1309 1199.0 < 2.2e-16 *** titan$Sex:titan$PClass 2 43.19 1307 1155.9 4.176e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • Small p-values indicate that all variables are needed to explain the variation in y

  12. Goodness of fit statistics Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.7006 0.3443 7.843 4.41e-15 *** titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 *** titan$PClass2nd -0.7223 0.4540 -1.591 0.112 titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 *** titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512 titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1688.1 on 1312 degrees of freedom Residual deviance: 1155.9 on 1307 degrees of freedom AIC: 1167.9 • -2 log L • AIC

  13. Binned residual plot install.packages("arm") library(arm) x<-predict(log3) y<-resid(log3) binnedplot(x,y) Category is based on the fitted values 95% of all values should fall within the dotted line • Plots the average residual and the average fitted (predicted) value for each bin, or category

  14. Poisson Regression Using count data

  15. What is a Poisson distribution?

  16. Example: children ever born • The dataset has 70 rows representing group-level data on the number of children ever born to women in Fiji: • Number of children ever born • Number of women in the group • Duration of marriage • 1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 • Residence • 1=Suva (capital city), 2=Urban, 3=Rural • Education • 1=none, 2=lower primary, 3=upper primary, 4=secondary+ > ceb<-read.table("D:/RShortcourse/ceb.dat",header=T)

  17. Poisson regression in R > ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson", data = ceb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom AIC: Inf Need to account for different population sizes in each area/group unless data are from same-size populations

  18. Assessing model fit 1. Examine AIC score – smaller is better 2. Examine the deviance as an approximate goodness of fit test • Expect the residual deviance/degrees of freedom to be approximately 1 > ceb1$deviance/ceb1$df.residual [1] 41.35172 3. Compare residual deviance to a 2 distribution > pchisq(2646.5, 64, lower=F) [1] 0

  19. Model fitting: analysis of deviance • Similar to logistic regression, we want to compare the differences in the size of residuals between models > ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb) > ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb) > 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2)) [1] 0.0007124383 • Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

  20. Overdispersion in Poission models • A characteristic of the Poisson distribution is that its mean is equal to its variance • Sometimes the observed variance is greater than the mean • Known as overdispersion • Another common problem with Poisson regression is excess zeros • Are more zeros than a Poisson regression would predict

  21. Overdispersion • Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter • Doesn't change the estimates for the coefficients, but may change their standard errors > ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

  22. Poisson vs. quasipoisson Family = “poisson” Family = “quasipoisson” Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.43029 0.10999 13.004 < 2e-16 *** educnone 0.21462 0.13378 1.604 0.11358 educsec+ -1.00900 0.31968 -3.156 0.00244 ** educupper -0.40485 0.18115 -2.235 0.02892 * resSuva -0.05997 0.17277 -0.347 0.72965 resurban 0.06204 0.14966 0.415 0.67988 --- (Dispersion parameter for quasipoisson taken to be 37.55359) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom

  23. Models for overdispersion • When overdispersion is a problem, use a negative binomial model • Will adjust  estimates and standard errors > install.packages(“MASS”) > library(MASS) > ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb) OR > ceb.nb<-glm.nb(ceb2) > summary(ceb.nb)

  24. NB model in R glm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.490043 0.160589 9.279 < 2e-16 *** educnone 0.002317 0.183754 0.013 0.98994 educsec+ -0.630343 0.200220 -3.148 0.00164 ** educupper -0.173138 0.184210 -0.940 0.34727 resSuva -0.149784 0.165622 -0.904 0.36580 resurban 0.055610 0.165391 0.336 0.73670 --- (Dispersion parameter for Negative Binomial(3.3872) family taken to be 1) Null deviance: 85.001 on 69 degrees of freedom Residual deviance: 71.955 on 64 degrees of freedom AIC: 740.55 Theta: 3.387 Std. Err.: 0.583 2 x log-likelihood: -726.555 > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297

  25. What if your data looked like…

  26. Zero-inflated Poisson model (ZIP) • If you have a large number of 0 counts… > install.packages(“pscl”) > library(pscl) > ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)

More Related