Logisticregression QuantitativeStatisticalMethods Dr. Szilágyi Roland
Logisticregression • The Logisticregressionis a method that helps to predict the classification of cases into groups on the basis of independent variables. So those independent variables (x) are identified in the analysis, which cause significant difference in the dependent variablescategories. • binary (thedependentvariable has twocategories) • Multinomial
Logisticregressioninpractice • Binary logistic regression is most useful when you want to model the event probability for a categorical response variable with two outcomes. • Market research • Modelling (by or no) • Segmentationreliability • Enterpriseanalysis (default, non default) • etc. For example: 1-A catalog company wants to increase the proportion of mailings that result in sales. 2- A loan officer wants to know whether the next customer is likely to default.
General purposes • To create logisticregresionfunction, which is the best split of the categories of dependent variables as linear combination of independent variables. • To determine whether there is a significant difference among groups according to independent variables. • To determine which independent variables explain the most the differences among groups. • Based on the experience obtained by a known classification, we can predict the group membership of new cases analyzing their independent variables. • To measure the accuracy of classification
The AssumptionsforLogisticregression1. Measure of variables • The dependent variable should be categorized by m (at least 2) text values (e.g.: 1-good student, 2-bad student; or 1-prominent student, 2-average, 3-bad student). • Independent variables could be measuredonwhateverscale.
The AssumptionsforLogisticregression2. Independence Not only the explanatory variables, but also all cases must be independent.Therefore, panel, longitudinal research, or pre-test data cannot be used for logisticregression analysis.
The AssumptionsforLogisticregression3. Sample size It is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have atleast 60 observations.
The AssumptionsforLogisticregression4. Multivariate normal distribution In case of normal distribution, the estimation of parameters are easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing.
The AssumptionsforLogisticregression5. Multicollinearity Independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis.
Binary Logistic Regression • For a binary response Y and the explanatory variables ,…,, let • F(x) is interpreted as the probability of the dependent variable equaling a "success" or "case" rather than a failure or non-case.
Binary Logistic Regression • Since the probability of an event must lie between 0 and 1, it is impractical to model probabilities with linear regression techniques, because the linear regression model allows the dependent variable to take values greater than 1 or less than 0. • The logistic function is a type of generalized linear model and it is useful because it can take any input linear combination of independent variables (Xi), whereas the output always takes values between zero and one and hence is interpretable as a probability. The logistic function is defined as follows:
The Logistic Regression Model is: = Equivalently, the logit (log odds), has the linear relationship:
BinaryLogisticRegression • The “odds” of the dependent variable equaling a case (given some linear combination xi of the predictors) is equivalent to the exponential function of the linear regression expression
BinaryLogisticRegression after exponentiating
Maximum LikelihoodMethod • The maximum likelihood method finds a set of coefficients (β), called the maximum likelihood estimates, at which the log-likelihood function attains its local maximum: Forrás: Hajdu Ottó: Többváltozós statisztikai számítások; KSH, Budapest, 2003.
Tests of Model Fit • After building a model, you need to determine whether it reasonably approximates the behavior of your data. • Tests of Model Fit. The Binary Logistic Regression procedure reports the Hosmer-Lemeshow goodness-of-fit statstic. It helps you to determine whether the model adequately describes the data Ho: model fits H1: model don’t fit • Residual Plots. Using variables specified in the Save dialog box, you can construct various diagnostic plots.
Choosingthe Right Model • Basedonresidual sum of squares (linearregression) • BasedonLikelihood ratio (comparetheLikelihood of themodelwiththeLikelihood of a baseline (minimal) model) • Proportion of goodpredictions.
Pseudo R2 • Cox and Snell's R2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model. • Nagelkerke'sR2 is an adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1.
Example • If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit risks.
Example • Suppose we have the information on 850 past and prospective customers • The first 700 cases are customers who were previously given loans. • We use a random sample of these 700 customers to create a logistic regression model, setting the remaining customers aside to validate the analysis. • Then use the model to classify the 150 prospective customers as good or bad credit risks.
Outputs Source: Help- IBM SPSS Statistics
OutputsGoodness-of-fit statistic Source: Help- IBM SPSS Statistics
Outputs Source : Help- IBM SPSS Statistics
Meaning of coefficients • The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Exp(B) is easier to interpret. • Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit change in the predictor(Xi ) (while all other things being equal). • Or • Exp(B) is the odds ratio: the odds of success at Xi=xi+1 divided by the odds of success at Xi=xi (while all other things being equal). • For example, Exp(B) for “Years with current employer” is equal to 0.781, which means that the odds of default for a person who has been employed at their current job for two years are 0.781 times the odds of default for a person who has been employed at their current job for 1 year, all other things being equal. Source : Help- IBM SPSS Statistics