Logistic Regression

Logistic Regression APKC – STATS AFAC (2016)

What is Logistic Regression? • In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical. • Form of regression that allows the prediction of discrete variables by a mix of continuous and discrete predictors. • No distributional assumptions on the predictors (the predictors do not have to be normally distributed, linearly related or have equal variance in each group)

What is Logistic Regression? • Logistic regression is often used because the relationship between the DV (a discrete variable) and a predictor is non-linear • Example : Occurrence of T Storm Vs any specific atmospheric parameter

Questions • Are there interactions among predictors? • Does adding interactions among predictors (continuous or categorical) improve the model? • Continuous predictors should be centered before interactions made in order to avoid multicollinearity. • Can parameters be accurately predicted? • How good is the model at classifying cases for which the outcome is known ?

Assumptions • The only “real” limitation on logistic regression is that the outcome must be discrete.

Assumptions • Linearity in the logit – the regression equation should have a linear relationship with the logit form of the DV. There is no assumption about the predictors being linearly related to each other.

What is a log and an exponent? • Log is the power to which a base of 10 must be raised to produce a given number. The log of 1000 is 3 as 103=1000. • The log of an odds ratio of 1.0 is 0 as 100 = 1 • Exponent (e) or 2.718 raised to a certain power is the antilog of that number. Thus, (expβ) = antilog β • Antilog of log odds 0 is 2.718o =1 • Exponential increases are curvilinear.

Background LOGITS ARE CONTINOUS, LIKE Z SCORES p = 0.50, then logit = 0 p = 0.70, then logit = 0.84 p = 0.30, then logit = -0.84

Plain old regression • Y = A BINARY RESPONSE (DV) • 1 POSITIVE RESPONSE (Success) P • 0 NEGATIVE RESPONSE (failure) Q = (1-P) • MEAN(Y) = P, observed proportion of successes • VAR(Y) = PQ, maximized when P = .50, variance depends on mean (P) • XJ = ANY TYPE OF PREDICTOR  Continuous, Dichotomous, Polytomous

Plain old regression • and it is assumed that errors are normally distributed, with mean=0 and constant variance (i.e., homogeneity of variance)

Plain old regression • an expected value is a mean, so • The predicted value equals the proportion of observations for which Y|X = 1; P is the probability of Y = 1(A SUCCESS) given X, and Q = 1- P (A FAILURE) given X.

Plain old regression • – you can’t use regular old regression when you have discrete outcomes because you don’t meet homoskedasticity.

The logistic function

The logistic function • Where Y-hat is the estimated probability that the ith case is in a category and u is the regular linear regression equation:

The logistic function

The logistic function • Change in probability is not constant (linear) with constant changes in X • This means that the probability of a success (Y = 1) given the predictor variable (X) is a non-linear function, specifically a logistic function

Logistic regression • It can be binomial, ordinal or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, “Occ" vs. “Non Occ" or "win" vs. "loss"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types that are not ordered.

An Example • Predictors of a treatment intervention. • Participants • 113 adults with a medical problem • Outcome: • Cured (1) or not cured (0). • Predictors: • Intervention: intervention or no treatment. • Duration: the number of days before treatment that the patient had the problem.

Click Categorical Click First, then Change. Identify any categorical Covariates (Predictors). With a categorical predictor with more than 2 categories you should use either the highest number to code your control category, then select last for your indicator contrast. In this data set 1 is cured, 0 not cured

Enter Interaction Term(s) You can specify main effects and interactions. Highlight both predictors, then click the >a*b> If you don’t have previous literature, choose Stepwise Forward LR LR is Likelihood Ratio

Save Settings for Logistic Regression

Option Settings for Logistic Regression Hosmer-Lemeshow assesses how well the model fits the data. Look for outliers +/- 2 SD Request the 95% CI for the odds ratio (odds of Y occurring)

Output for Step 0, Constant Only Initially the model will always select the option with the highest frequency, in this case it selects the intervention (treated). Large values for -2 Log Likelihood (-2 LL) indicate a poor fitting model. The -2 LL will get smaller as the fit improves.

Example of How to Write the Logistic Regression Equation from Coefficients Using the constant only the model above predicts a 57% probability of Y occurring.

Output: Step 1

Equation for Step 1 See p 288 for an Example of using equation to compute Odds ratio. We can say that the odds of a patient who is treated being cured are 3.41 times higher than those of a patient who is not treated, with a 95% CI of 1.561 to 7.480. The important thing about this confidence interval is that it doesn’t cross 1 (both values are greater than 1). This is important because values greater than 1 mean that as the predictor variable(s) increase, so do the odds of (in this case) being cured. Values less than 1 mean the opposite: as the predictor increases, the odds of being cured decreases.

Output: Step 1 Removing Intervention from the model would have a significant effect on the predictive ability of the model, in other words, it would be very bad to remove it.

Classification Plot Further away from .5 is better. The .5 line represents a coin toss you have a 50/50 chance. If the model fits the data, then the histogram should show all of the cases for which the event has occurred on the right hand side (C), and all the cases for which the event hasn’t occurred on the left hand side (N). This model is better at predicting cured cases than it is for non cured cases, as the non cured cases are closer to the .5 line.

Choose Analyze – Reports – Case Summaries Use the Case Summaries function to create a table of the first 15 cases showing the values of Cured, Intervention, Duration, the predicted probability (PRE_1) and the predicted group membership (PGR_1).

Case Summaries

Summary • The overall fit of the final model is shown by the −2 log-likelihood statistic. • If the significance of the chi-square statistic is less than .05, then the model is a significant fit of the data. • Check the table labelled Variables in the equation to see which variables significantly predict the outcome. • Use the odds ratio, Exp(B), for interpretation. • OR > 1, then as the predictor increases, the odds of the outcome occurring increase. • OR < 1, then as the predictor increases, the odds of the outcome occurring decrease. • The confidence interval of the OR should not cross 1! • Check the table labelled Variables not in the equation to see which variables did not significantly predict the outcome.

Logistic Regression