- 85 Views
- Uploaded on
- Presentation posted in: General

Logistic Regression- Dichotomous Dependent Variables

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Logistic Regression- Dichotomous Dependent Variables

March 21 & 23, 2011

By the end of this meeting, participants should be able to:

- Explain why OLS regression is inappropriate for dichotomous dependent variables.
- List the assumptions of a logistic regression model.
- Estimate a logistic regression model in R.
- Interpret the results of a logistic regression model.

- With a dichotomous (zero/one or similar) dependent variable the assumptions of least squares regression (OLS) are violated.
- OLS assumes a linear relationship between the dependent variable and the independent variable which cannot be true with only two categories for the dependent variable (more of a conceptual than a technical issue).

- Least squares regression (OLS) assumes normally distributed variables which with a dichotomous dependent variable cannot be true (a not that difficult problem).
- OLS assumes that the variances of the error terms are the same, which cannot be true for dichotomous dependent variables. This makes hypothesis testing very difficult (major problem).
- More intuitively, regression with dichotomous dependent variables will frequently predict values outside of the actual range of the dependent variable.

- OLS is computed with a simple single formula
- A straight linear model
- Fit a line to various points
- The fit will not be perfect but least squares will compute the smallest possible difference

- That method cannot work for dichotomous dependent variables

- The way to compute regression with a dichotomous dependent variables is through a procedure known as maximum likelihood.
- Maximum likelihood is an iterative process based on probability theory that needs the use of a computer.

- Instead of fitting a single line, maximum likelihood models are a guided trial and error process where a set of coefficients are chosen and a likelihood function computed.
- From that initial function, different likelihood functions are computed to try to get closer and closer to the true probability of the dependent variable.
- When a function can get no closer to the dependent variable’s probability the process ends.

- Logistic regression is the most commonly used of all of the maximum likelihood estimation methods. The other primary method for computing models involving dichotomous dependent variables is probit. Generally, probit results are comparable to logit results.

- It does not assume a linear relationship.
- The dependent variable needs to be binary and coded in a meaningful way. It is standard to code the category of interest as the higher value (for example: voter, Democrat, etc.).
- All the relevant variables need to be included in the model.
- The variables in the model need to be relevant.
- Error terms need to be independent.
- There should be low error rates on the predicting variables.

- Independent variables should not be highly correlated with each other (multicollinearity). This problem will likely present as high standard errors.
- There should be no major outliers on the independent variables.
- Samples need to be relatively large (such as 10 cases per predictor). If the samples are too small, standard errors can be very large and in some cases very large coefficients will also occur.

- On first glance, the formula for logit is not all that much different than that for OLS:yi*=a+bx1i+bx2i+ei
- Where Y is the dependent variable, X is the independent variable(s), and e is the error term.
- It is important to note that Y is a probability rather than a strict value
- Also key: Y* is a transformation of Y, so you are modeling probability indirectly. (Technical: Y* is log of odds.)

- The b values can be thought of like in OLS but their intuition is somewhat different.

- a can be thought of as a shift parameter, it shifts the term to the left or the right.
- a<0 shifts the curve to the right
- a>0 shifts the curve to the left

- The value of b1 can be thought of as the stretch parameter, it stretches the curve or shrinks it.
- The sign of b1 can be thought of as the direction parameter, they determine the direction of the curve.

- Logit coefficients cannot be interpreted in the same way as in OLS. Since the relationship described is not a linear one, it cannot be said that a unit change in the independent variable leads to <blank> change in dependent variable.
- Logit coefficients can tell you the direction of the relationship between the dependent and independent variable, whether it is statistically significant and give you a general sense of the magnitude.

- Statistical significance in logit is based on whether the effect of the independent variable on the dependent variable is statistically different from zero. (Similar to OLS.)
- Since logit coefficients lack the direct interpretation of OLS coefficients, many people prefer to use odds ratios instead.
- Odds ratios show the effect of the independent variable on the odds of the dependent variable occurring.
- Values greater than 1 mean that the predictor makes the dependent variable more likely to occur.
- Values less than 1 mean that the predictor is less likely to occur.
- Example: Kentucky odds of winning pre & post Kansas

- Coefficients in logit are the effect of the predictor on the log of the odds (for the dependent variable).
- Odds ratios remove the log component of the coefficient and compute the effect of the predictor on the odds of the dependent variable occurring

- Unlike linear regression, there is no intuitive equivalent of the R2 statistic for logit models.
- The desire to create comparable measure has led to the creation of a variety of so called pseudo R2 measures.
- In general, the findings are that these pseudo R2 measures perform poorly, so R doesn’t even report them.
- If you do calculate pseudo R2 values, you can report it as a general sense of the fit but it does not have the same direct interpretation as in OLS.

- Data from the 2003 Carolina poll (N=423)
- The dependent variable is a measure of whether the person thinks the country was heading on the right (0) or wrong track (1)
- The predictors are:
- Evaluation of Bush: (1)Excellent- (4)Poor
- Party: (1)Democrat (2)Independent (3)Republican
- Ideology: (1) Very Liberal- (5) Very Conservative

- What do the results tell us about the relationship between evaluations of Bush and whether or not a person thinks the country is on the wrong track? How sure are we of this result?
- What do the results tell us about the relationship between partisanship and whether or not a person thinks the country is on the wrong track? How sure are we of this result?
- What do the results tell us about the relationship between ideology and whether or not a person thinks the country is on the wrong track? How sure are we of this result?

library(foreign)

ps.are<-read.spss('http://j.mp/classdata',

use.value.labels=FALSE,to.data.frame=TRUE)

ps.are$voted<-as.numeric(ps.are$po_4==1)

ps.are$strength<-abs(ps.are$po_party-4)

logit.model<-glm(voted~dm_income+strength,

data=ps.are, family=binomial(link="logit"))

odds.ratios<-exp(logit.model$coefficients)

summary(logit.model)

odds.ratios

(odds.ratios-1)*100

- What do the results tell us about the relationship between partisan strength and whether or not a person voted in 2004? How sure are we of this result?
- What do the results tell us about the relationship between income and whether or not a person voted in 2004? How sure are we of this result?

- If your dependent variable is continuous or has a range longer than 4, use OLS regression. It is the simplest and the findings are most intuitive. Even when some of the assumptions are violated OLS tends to be a very robust method.
- If your dependent variable is dichotomous, use logit. It is simplest and most common of the maximum likelihood estimation methods.

- If your dependent variable is short ordered (i. e. less than 4 but more than 2 categories), try to reduce the number of categories to 2 or increase them to 4 or more
- Drop DK/NA responses
- Drop middle categories (unless they are the categories of interest)
- Combine multiple categories
- Split the data into two parts and run separate analyses
- Create a scale to increase the range of the dependent variable

- Other circumstances: ordered logit (beyond the scope of this course).

Turn-in your preliminary data analysis (one copy per group).

Read WKB chapter 15.

Based on your reading of chapter 15, what insight did you find most relevant for your final paper? (Turn-in individually.)