1 / 75

Logistic Regression

Logistic Regression. overview. Remember?. Applications: Prediction vs. Explanatory Analysis. The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance .

ashlyn
Download Presentation

Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regression

  2. overview

  3. Remember? Applications: Predictionvs. Explanatory Analysis • The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance. • The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs. The predicted value of Y is given by this formula: • The focus is on understanding the relationship between the dependent variable and the independent variables. • Consequently, the statistical significance of the coefficients is important as well as the magnitudes and signs of the coefficients.

  4. Target Marketing Attrition Prediction Credit Scoring Fraud Detection Logistic Regression Примеры задач

  5. Logistic regression Regression and other models

  6. Logistic regression Types of Logistic Regression 

  7. Logistic regression Supervised (binary) Classification (Binary) Target Input Variables y x1 x2 x3 x4 x5 x6 ... xk 1 ... 2 ... 3 ... Cases 4 ... 5 ... . . . . . . . . . . . . . . . . . . . . . . . . . . . n ...

  8. Logistic regression Задача и данные Other product usage in a three month period Did customer purchase variable annuity product? 1= yes 0= no Demographics ~32’000 obs 47 vars

  9. Logistic regression Задача и данные

  10. Analytical Challenges

  11. Analytical Challenges Opportunistic Data Operational / Observational Massive • Analytical data preparation step: • BENCHMARK: 80/20 • [MY] LIFE: 99/1 Errors and Outliers 2+2=5 Missing Values

  12. Analytical Challenges Mixed Measurement Scales sales, executive, homemaker, ... 88.60, 3.92, 34890.50, 45.01, ... F, D, C, B, A 0, 1, 2, 3, 4, 5, 6, ... M, F 27513, 21737, 92614, 10043, ...

  13. Analytical Challenges High Dimensionality

  14. Analytical Challenges Rare Target Event Event respond churn default fraud No Event not respond stay pay off legitimate

  15. Analytical Challenges Nonlinearities and Interactions E(y) E(y) x1 x1 x2 x2 Nonlinear Nonadditive Linear Additive

  16. Analytical Challenges Model Selection I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Underfitting Overfitting Just Right I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

  17. THE MODEL & its Interpretation

  18. Logistic REGRESSION Why not linear? • OLS Reg: Yi=0+1X1i+i Linear Prob. Model: pi=0+1X1i • Probabilities are bounded, but linear functions can take on any value. (Once again, how do you interpret a predicted value of -0.4 or 1.1?) • Given the bounded nature of probabilities, can you assume a linear relationship between X and p throughout the possible range of X? • Can you assume a random error with constant variance? • What is the observed probability for an observation? • If the response variable is categorical, then how do you code the response numerically? • If the response is coded (1=Yes and 0=No) and your regression equation predicts 0.5 or 1.1 or -0.4, what does that mean practically? • If there are only two (or a few) possible response levels, is it reasonable to assume constant variance and normality?

  19. Logistic REGRESSION Functional Form posterior probability parameter input

  20. Logistic REGRESSION The Logit Link Function pi = 1 pi = 0 smaller    larger

  21. Logistic REGRESSION The Fitted Surface

  22. Logistic REGRESSION LOGISTIC Procedure proclogisticdata=develop plots(only)=(effect(clbandx=(ddabaldepamt checks res)) oddsratio (type=horizontalstat)); class res (param=ref ref='S'); model ins(event='1') = ddaddabaldepdepamtcashbkchecks res / stbclodds=pl; unitsddabal=1000depamt=1000 / default=1; oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl; run;

  23. Logistic REGRESSION Properties of the Odds Ratio No Association Группа в числителеимеет более высокие шансы Группа в знаменателеимеет более высокие шансы наступления события 0 1 Estimated logistic regression model: logit(p) = .7567 + .4373*(gender) where females are coded 1 and males are coded 0 Estimated odds ratio (Females to Males): odds ratio = (e-.7567+.4373)/(e-.7567) = 1.55

  24. Logistic REGRESSION Results from oddsratio oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl;

  25. Logistic REGRESSION Results from PLOTS =(EFFECT(… plots(only)=(effect(clbandx=(ddabaldepamt checks res))

  26. Logistic REGRESSION Logistic Discrimination

  27. oversampling

  28. oversampling Sampling Designs Joint (x,y),(x,y),(x,y), (x,y),(x,y),(x,y), (x,y),(x,y),(x,y), (x,y),(x,y),... {(x,y),(x,y),(x,y),(x,y)} Separate x,x,x, x,x,x, x,x,x, x,x,... x,x,x, x,x,x, x,x,x, x,x,... {(x,0),(x,0),(x,1),(x,1)} y = 0 y = 1

  29. oversampling The Effect of Oversampling

  30. oversampling Offset Два способа корректировки Включить параметр «сдвига» в модель Скорректировать вероятности на выходе модели Adjusted Probability: model… / offset=X - в действительности - в выборке

  31. oversampling Корректировка вероятностей /* Specify the prior probability */ /* to correct for oversampling */ %let pi1=.02; /* Correct predicted probabilities */ proclogisticdata=develop; model ins(event='1')=ddaddabaldepdepamtcashbk checks; scoredata = pmlr.newout=scored priorevent=&pi1; run;

  32. Preparing the Input Variables

  33. 14 2 2 67 1 4 ? 3 1 33 1 7 18 2 1 6 0 1 31 3 8 51 1 8 Missing values Does Pr(missing) Depend on the Data? • No • MCAR (missing completely at random) • Yes • that unobserved value • other unobserved values • other observed values(including the target)

  34. Missing values Complete Case Analysis Input Variables Cases ...

  35. Missing values Complete Case Analysis Input Variables Cases

  36. Missing values New Missing Values Fitted Model: New Case: Predicted Value:

  37. Missing values Missing Value Imputation 6 03 2.6 0 8.3 42 66 C03 12 04 1.8 0 0.5 86 65 C14 6.5 01 2.3 .33 4.8 37 66 C00 8 01 2.1 1 4.8 37 64 C08 6 01 2.8 1 9.6 22 66 C99 3 01 2.7 0 1.1 28 64 C00 2 02 2.1 1 5.9 21 63 C03 10 03 2.0 0 0.8 0 63 C99 7 01 2.5 0 5.5 62 67 C12 6.5 01 2.4 0 0.9 29 63 C05

  38. Missing values Imputation + Indicators Incomplete Data Completed Data Missing Indicator 34 63 . 22 26 54 18 . 47 20 34 63 30 22 26 54 18 30 49 20 0 0 1 0 0 0 0 1 0 0 Median = 30

  39. Missing values Imputation + Indicators datadevelop1; /* Create missing indicators */ set develop; /* name the missing indicator variables */ array mi{*} MIAcctAgMIPhone … MICRScor; /* select variables with missing values */ array x{*} acctage phone … crscore; doi=1to dim(mi); mi{i}=(x{i}=.); end; run; procstdizedata=develop1 reponly method=median /* Impute missing values with the median */ out=imputed; var &inputs; run;

  40. X1 = X2 = ? Missing values Cluster Imputation [at later lectures]

  41. Categorical inputs

  42. Categorical Inputs Dummy Variables X DA DB DC DD D B C C A A D C A . . . 0 0 0 0 1 1 0 0 1 . . . 0 1 0 0 0 0 0 0 0 . . . 0 0 1 1 0 0 0 1 0 . . . 1 0 0 0 0 0 1 0 0 . . .

  43. Categorical Inputs Smarter Variables ZIP Urbanicity HomeVal Local ... 99801 99622 99523 99523 99737 99937 99533 99523 99622 . . . 75 100 150 150 150 75 100 150 100 . . . 1 1 1 0 1 1 1 0 1 . . . 1 2 1 1 3 3 2 1 3 . . .

  44. 0 1 DA DB Dc DD 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 28 7 A 16 0 B 94 11 C 23 21 D Categorical Inputs Quasi-Complete Separation

  45. Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

  46. 0 1 28 7 110 11 23 21 B & C 30.7 97% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

  47. 0 1 0 1 28 7 138 18 110 11 23 21 23 21 B & C A & BC 30.7 28.6 97% 90% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Merged: 2= 31.7 100% ...

  48. 0 1 0 1 28 7 0 1 138 18 110 11 161 39 23 21 23 21 Merged: B & C A & BC ABC & D 2= 31.7 30.7 28.6 0 100% 97% 90% 0% Categorical Inputs Clustering Levels 0 1 28 7 A 16 0 B 94 11 C 23 21 D Greenacre (1988, 1993) PROC MEANS – PROC CLUSTER – PROC TREE -… HOME WORK

  49. Variable Clustering

  50. Variable Clustering Redundancy

More Related