Logistic Regression

Logistic Regression

overview

Remember? Applications: Predictionvs. Explanatory Analysis • The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance. • The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs. The predicted value of Y is given by this formula: • The focus is on understanding the relationship between the dependent variable and the independent variables. • Consequently, the statistical significance of the coefficients is important as well as the magnitudes and signs of the coefficients.

Target Marketing Attrition Prediction Credit Scoring Fraud Detection Logistic Regression Примеры задач

Logistic regression Regression and other models

Logistic regression Types of Logistic Regression 

Logistic regression Supervised (binary) Classification (Binary) Target Input Variables y x1 x2 x3 x4 x5 x6 ... xk 1 ... 2 ... 3 ... Cases 4 ... 5 ... . . . . . . . . . . . . . . . . . . . . . . . . . . . n ...

Logistic regression Задача и данные Other product usage in a three month period Did customer purchase variable annuity product? 1= yes 0= no Demographics ~32’000 obs 47 vars

Logistic regression Задача и данные

Analytical Challenges

Analytical Challenges Opportunistic Data Operational / Observational Massive • Analytical data preparation step: • BENCHMARK: 80/20 • [MY] LIFE: 99/1 Errors and Outliers 2+2=5 Missing Values

Analytical Challenges Mixed Measurement Scales sales, executive, homemaker, ... 88.60, 3.92, 34890.50, 45.01, ... F, D, C, B, A 0, 1, 2, 3, 4, 5, 6, ... M, F 27513, 21737, 92614, 10043, ...

Analytical Challenges High Dimensionality

Analytical Challenges Rare Target Event Event respond churn default fraud No Event not respond stay pay off legitimate

Analytical Challenges Nonlinearities and Interactions E(y) E(y) x1 x1 x2 x2 Nonlinear Nonadditive Linear Additive

Analytical Challenges Model Selection I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Underfitting Overfitting Just Right I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

THE MODEL & its Interpretation

Logistic REGRESSION Why not linear? • OLS Reg: Yi=0+1X1i+i Linear Prob. Model: pi=0+1X1i • Probabilities are bounded, but linear functions can take on any value. (Once again, how do you interpret a predicted value of -0.4 or 1.1?) • Given the bounded nature of probabilities, can you assume a linear relationship between X and p throughout the possible range of X? • Can you assume a random error with constant variance? • What is the observed probability for an observation? • If the response variable is categorical, then how do you code the response numerically? • If the response is coded (1=Yes and 0=No) and your regression equation predicts 0.5 or 1.1 or -0.4, what does that mean practically? • If there are only two (or a few) possible response levels, is it reasonable to assume constant variance and normality?

Logistic REGRESSION Functional Form posterior probability parameter input

Logistic REGRESSION The Logit Link Function pi = 1 pi = 0 smaller    larger

Logistic REGRESSION The Fitted Surface

Logistic REGRESSION LOGISTIC Procedure proclogisticdata=develop plots(only)=(effect(clbandx=(ddabaldepamt checks res)) oddsratio (type=horizontalstat)); class res (param=ref ref='S'); model ins(event='1') = ddaddabaldepdepamtcashbkchecks res / stbclodds=pl; unitsddabal=1000depamt=1000 / default=1; oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl; run;

Logistic REGRESSION Properties of the Odds Ratio No Association Группа в числителеимеет более высокие шансы Группа в знаменателеимеет более высокие шансы наступления события 0 1 Estimated logistic regression model: logit(p) = .7567 + .4373*(gender) where females are coded 1 and males are coded 0 Estimated odds ratio (Females to Males): odds ratio = (e-.7567+.4373)/(e-.7567) = 1.55

Logistic REGRESSION Results from oddsratio oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl;

Logistic REGRESSION Results from PLOTS =(EFFECT(… plots(only)=(effect(clbandx=(ddabaldepamt checks res))

Logistic REGRESSION Logistic Discrimination

oversampling

oversampling Sampling Designs Joint (x,y),(x,y),(x,y), (x,y),(x,y),(x,y), (x,y),(x,y),(x,y), (x,y),(x,y),... {(x,y),(x,y),(x,y),(x,y)} Separate x,x,x, x,x,x, x,x,x, x,x,... x,x,x, x,x,x, x,x,x, x,x,... {(x,0),(x,0),(x,1),(x,1)} y = 0 y = 1

oversampling The Effect of Oversampling

oversampling Offset Два способа корректировки Включить параметр «сдвига» в модель Скорректировать вероятности на выходе модели Adjusted Probability: model… / offset=X - в действительности - в выборке

oversampling Корректировка вероятностей /* Specify the prior probability */ /* to correct for oversampling */ %let pi1=.02; /* Correct predicted probabilities */ proclogisticdata=develop; model ins(event='1')=ddaddabaldepdepamtcashbk checks; scoredata = pmlr.newout=scored priorevent=&pi1; run;

Preparing the Input Variables

14 2 2 67 1 4 ? 3 1 33 1 7 18 2 1 6 0 1 31 3 8 51 1 8 Missing values Does Pr(missing) Depend on the Data? • No • MCAR (missing completely at random) • Yes • that unobserved value • other unobserved values • other observed values(including the target)

Missing values Complete Case Analysis Input Variables Cases ...

Missing values Complete Case Analysis Input Variables Cases

Missing values New Missing Values Fitted Model: New Case: Predicted Value:

Missing values Missing Value Imputation 6 03 2.6 0 8.3 42 66 C03 12 04 1.8 0 0.5 86 65 C14 6.5 01 2.3 .33 4.8 37 66 C00 8 01 2.1 1 4.8 37 64 C08 6 01 2.8 1 9.6 22 66 C99 3 01 2.7 0 1.1 28 64 C00 2 02 2.1 1 5.9 21 63 C03 10 03 2.0 0 0.8 0 63 C99 7 01 2.5 0 5.5 62 67 C12 6.5 01 2.4 0 0.9 29 63 C05

Missing values Imputation + Indicators Incomplete Data Completed Data Missing Indicator 34 63 . 22 26 54 18 . 47 20 34 63 30 22 26 54 18 30 49 20 0 0 1 0 0 0 0 1 0 0 Median = 30

Missing values Imputation + Indicators datadevelop1; /* Create missing indicators */ set develop; /* name the missing indicator variables */ array mi{*} MIAcctAgMIPhone … MICRScor; /* select variables with missing values */ array x{*} acctage phone … crscore; doi=1to dim(mi); mi{i}=(x{i}=.); end; run; procstdizedata=develop1 reponly method=median /* Impute missing values with the median */ out=imputed; var &inputs; run;

X1 = X2 = ? Missing values Cluster Imputation [at later lectures]

Categorical inputs

Categorical Inputs Dummy Variables X DA DB DC DD D B C C A A D C A . . . 0 0 0 0 1 1 0 0 1 . . . 0 1 0 0 0 0 0 0 0 . . . 0 0 1 1 0 0 0 1 0 . . . 1 0 0 0 0 0 1 0 0 . . .

Categorical Inputs Smarter Variables ZIP Urbanicity HomeVal Local ... 99801 99622 99523 99523 99737 99937 99533 99523 99622 . . . 75 100 150 150 150 75 100 150 100 . . . 1 1 1 0 1 1 1 0 1 . . . 1 2 1 1 3 3 2 1 3 . . .

0 1 DA DB Dc DD 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 28 7 A 16 0 B 94 11 C 23 21 D Categorical Inputs Quasi-Complete Separation

Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

0 1 28 7 110 11 23 21 B & C 30.7 97% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

0 1 0 1 28 7 138 18 110 11 23 21 23 21 B & C A & BC 30.7 28.6 97% 90% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

0 1 0 1 28 7 0 1 138 18 110 11 161 39 23 21 23 21 Merged: B & C A & BC ABC & D 2= 31.7 30.7 28.6 0 100% 97% 90% 0% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Greenacre (1988, 1993) PROC MEANS – PROC CLUSTER – PROC TREE -… HOME WORK

Variable Clustering

Variable Clustering Redundancy

Logistic Regression