1 / 42

Logistic Regression

STAT E-150 Statistical Methods. Logistic Regression. So far we have considered regression analyses where the response variables are quantitative . What if this is not the case? If a response variable is categorical a different regression model applies, called logistic regression .

grover
Download Presentation

Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAT E-150Statistical Methods Logistic Regression

  2. So far we have considered regression analyses where the response variables are quantitative. What if this is not the case? If a response variable is categorical a different regression model applies, called logistic regression.

  3. A categorical variable which has only two possible values is called a binary variable. We can represent these two outcomes as 1 to represent the presence of some condition (“success”) and 0 to represent the absence of the condition (“failure”): The logistic regression model describes how the probability of “success” is related to the values of the explanatory variables, which can be categorical or quantitative.

  4. Logistic regression models work with odds rather than proportions. The odds are just the ratio of the proportions for the two possible outcomes: If π is the proportion for one outcome, then 1 − π is the proportion forthe second outcome. The odds of the first outcome occurring are .

  5. Here's an example: Suppose that a coin is weighted so that heads are more likely than tails, with P(heads) = .6. Then P(tails) = 1 - P(heads) = 1 - .6 = .4 The odds of getting heads in a toss of this coin are The odds of getting tails in a toss of this coin are The odds ratio is This tells us that you are 2.25 times more likely to get "heads" than to get "tails".

  6. You can also convert the odds of an event back to the probability of the event: For an event A, P(A) = For example, if the odds of a horse winning are 9 to 1, then the probability of the horse winning are 9/(1+9) = .9

  7. The Logistic Regression Model The relationship between a categorical response variable and a single quantitative predictor variable is an S-shaped curve. Here is a plot of p vs. x for different logistic regression models: The points on the curve represent P(Y=1) for each value of x. The associated model is the logistic or logit model:

  8. The general logistic regression model is where and E(Y) = π, the probability of success. The xi are independent quantitative or qualitative variables.

  9. Odds and log(odds) • Letπ =P(Y = 1) be a probability with 0 < π< 1 Then the odds that Y = 1 is the ratio odds = and so the log (odds) =

  10. This transformation from π to log(odds) is called the logistic or logit transformation. • The relationship is one-to-one: • For every value of π (except for 0 and 1) there is one and only one value of .

  11. The log(odds) can have any value from -∞ to ∞, and so we can use a linear predictor. • That is, we can model the log odds as a linear function of the explanatory variable: • y = β0 + β1x • (To verify this, solve πand then take the log of both sides.)

  12. For any fixed value of the predictor x, there are four probabilities: If the model is exactly correct, then p = π and the two fitted values estimate the same number.

  13. To go from log(odds) to odds, use the exponential function ex: 1. odds = elog(odds) 2. You can check that if odds = 1/(1 - π ), then you can solve for π to find that π = odds/(1 + odds). 3. Since log(odds) = we have the result π= elog(odds) / (1 + elog(odds))

  14. The Logistic Regression Model The Logistic Regression Model for the probability of success of a binary response variable based on a single predictor x is: Logit form: Probability form:

  15. Example: A study was conducted to analyze behavioral variables and stress in people recently diagnosed with cancer. For our purposes we will look at patients who have been in the study for at least a year, and the dependent variable (Outcome) is coded 1 to indicate that the patient is improved or in complete remission, and 0 if the patient has not improved or has died. The predictor variable is the survival rating assigned by the patient's physician at the time of diagnosis. This is a number between 0 and 100 and represents the estimated probability of survival at five years. Out of 66 cases there are 48 patients who have improved and 18 who have not.

  16. The scatterplot shows us that a linear regression analysis is not appropriate for this data. This scatterplot clearly has no linear trend, but it does show that the proportion of people who improve is much higher when the survival rate is high, as would be expected. However, if we transformation from whether the patient has improved to the odds of improvement, and then consider the log of the odds, we will have a variable that is a linear function of the survival rate, and we will be able to use linear regression.

  17. Let p = the probability of improvement Then 1 - p is the probability of no improvement We will look for an equation of the form Here β1will be the amount of increase in the log odds for a one-unit increase in SurvRate.

  18. Here are the results of this analysis: We can see that the logistic regression equation is log(odds) = .081xSurvRate - 2.684

  19. Assessing the Model In linear regression, we used the p-values associated with the test statistic t to assess the contribution of each predictor. In logistic regression, we can use the Wald statistic in the same way. Note that in this example, the Wald statistic for the predictor is 17.755, which is significant at the .05 level of significance. This is evidence that this predictor is a significant predictor in this model.

  20. H0: β1 = 0 Ha: β1 ≠ 0 Since p is close to zero, the null hypothesis is rejected. This indicates that the predicted survival rate is a useful indicator of the patient's outcome. The resulting regression equation is log(odds) = .081xSurvRate - 2.684

  21. Here are scatterplots of the data and of the values predicted by the model:

  22. Note how well the results fit the data: The suggested curve is quite close to the points in the lower left, rises rapidly across the points in the center, where the values of SurvRate that have a roughly equal number of patients who improve and don't improve, and finally comes close to the cluster of points in the upper right. The values all fall between 0 and 1.

  23. SPSS takes an iterative approach to this solution: it will begin with some starting values for β0 and β1, see how well the estimated log odds fit the data, adjust the coefficients, and then reexamine the fit. This will continue until no further adjustments will produce a better fit What do all of the SPSS results tell us?

  24. Starting with Block 0: Beginning Block The Case Processing Summary tells us that all 66 cases were included:

  25. The Variables in the Equation table shows that in this first iteration only the constant was used. The second table lists the variables that were not included in this model; it indicates that if the second variable were to be included, it would be a significant predictor:

  26. The Iteration History shows what the results would be with only this single predictor. Since the second variable, SurvRate, is not included, there is little change. The -2 Log likelihood can be used to assess how well a model would fit the data. It is based on summing the probabilities associated with the expected and observed outcomes. The lower the -2LL value, the better the fit.

  27. You can see from the Classification Table that the values were not classified by a second variable at this point. You can also see that there were 48 patients who improved and 18 who did not.

  28. One way to test the overall model is the Hosmer-Lemeshow goodness-of-fit test, which is a Chi-Square test comparing the observed and expected frequencies of subjects falling in the two categories of the response variable. Large values of 2 (and the corresponding small p-values) indicate a lack of fit for the model. This table tells us that our model is a good fit, since the p-value is large:

  29. Now consider the next block, Block 1: Method = Enter The Iteration History table shows the progress as the model is reassessed; the value of the coefficient of SurvRate converges to .081. To assess whether this larger model provides a significantly better fit than the smaller model, consider the difference between the -2LL values. The value for the smaller model was 77.346, which is larger than 37.323, the value for the model with SurvRate included, indicating that the larger model is a significantly better fit.

  30. This table now shows that SurvRate is a significant predictor (p is close to 0), and we can find the coefficients in the resulting regression equation, y = log(odds) = .081xSurvRate - 2.684

  31. Why is the odds ratio Exp(B)? Suppose that we have the logistic regression equation y = log(odds) = β1x + β0 Then β1 represents the change in y associated with a unit change in x. That is, y will increase by β1 when x increases by 1. But y is log(odds). So log(odds) will increase by β1 when x increases by 1.

  32. Exp(B) is an indicator of the change in odds resulting from a unit change in the predictor. Let's see how this happens: Suppose we start with the regression equation y = β1x + β0 Now if x increases by 1, we have y = β1(x +1) + β0 How much has y changed? New value - old value = [β1(x +1) + β0] - [β1 x + b0] = [β1x + β1 + β0] - [β1 x + β0] = β1 So y has increased by β1. That is, β1 is the change in y associated with a unit change in x. But y = log(odds), so now we know that log(odds) will increase by β1 when x increases by 1.

  33. If log(odds) changes by β1 then odds increases by eβ1 In other words, the change in odds associated with a unit change in x is eβ1, which can be denoted as Exp(β1) -- or by Exp(B) in SPSS. In our example, then, with each unit increase in SurvRate, y = log(odds) will increase by .081. That is, the odds of improving will increase by a factor of 1.085 for each unit increase in SurvRate.

  34. Another example: • The sales director for a chain of appliance stores wants to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase. • The response variable is an indicator of whether or not a warranty is purchased. • The predictor variables are • - Customer gender • - Age of the customer • - Whether a gift is offered with the warranty • - Price of the appliance • - Race of the customer (this is coded with three indicator variables to represent White, African-American, and Hispanic)

  35. Here are the results with all predictors: The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10

  36. Here are the results with all predictors: The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10

  37. If the analysis is rerun with only three predictors these are the results: In this model, all three predictors are significant. These results indicate that the odds that a customer who is offered a gift will purchase a warranty is more than ten times greater than the corresponding odds for a customer having the same other characteristics but who is not offered a gift. Also, the odds ratio for Age is greater than 1. This tells us that older buyers are more likely to purchase a warranty.

  38. If the analysis is rerun with only three predictors these are the results: The resulting regression equation is Log(odds) = 2.339xGift + .064xAge - 6.096

  39. If the analysis is rerun with only three predictors these are the results: Note: in this example, the coefficient of Price is too small to be expressed in three decimal places. This situation can be remedied by dividing the price by 100, and creating a new model.

  40. The resulting equation is now Log(odds) = 2.339xGift + .064xAge + .040xPrice100 - 6.096

  41. To produce the output for your analysis, • Choose >Analyze >Regression >Binary Logistic. • Choose the response and predictor variables. • Click on Options and check CI for Exp(B) to create the confidence intervals. Click on Continue • Click on Save and check the Probabilities box. Click on Continue and then on OK.

  42. To produce the graph of the results, create a simple scatterplot using Predicted Probability as the dependent variable, and the predictor as a covariate.

More Related