Logistic Regression

Logistic Regression

overview. Remember?. Applications: Prediction vs. Explanatory Analysis. The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance.

## PowerPoint Slideshow about 'Logistic Regression' - ashlyn

Remember?

Applications: Predictionvs. Explanatory Analysis

• The terms in the model, the values of their coefficients, and their statistical significance are of secondary importance.
• The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs. The predicted value of Y is given by this formula:
• The focus is on understanding the relationship between the dependent variable and the independent variables.
• Consequently, the statistical significance of the coefficients is important as well as the magnitudes and signs of the coefficients.

Target Marketing

Attrition Prediction

Credit Scoring

Fraud Detection

Logistic Regression

Примеры задач

Logistic regression

Regression and other models

Logistic regression

Types of Logistic Regression

Logistic regression

Supervised (binary) Classification

(Binary) Target

Input Variables

Logistic regression

Задача и данные

Other product usage in a three month period

Did customer purchase variable annuity product?

1= yes

0= no

Demographics

~32’000 obs

47 vars

Logistic regression

Задача и данные

Analytical Challenges

Opportunistic Data

Operational / Observational

Massive

• Analytical data preparation step:
• BENCHMARK: 80/20
• [MY] LIFE: 99/1

Errors and Outliers

2+2=5

Missing Values

Analytical Challenges

Mixed Measurement Scales

sales, executive, homemaker, ...

88.60, 3.92, 34890.50, 45.01, ...

F, D, C, B, A

0, 1, 2, 3, 4, 5, 6, ...

M, F

27513, 21737, 92614, 10043, ...

Analytical Challenges

High Dimensionality

Analytical Challenges

Rare Target Event

Event

respond

churn

default

fraud

No Event

not respond

stay

pay off

legitimate

Analytical Challenges

Nonlinearities and Interactions

E(y)

E(y)

x1

x1

x2

x2

Nonlinear

Linear

Analytical Challenges

Model Selection

Underfitting

Overfitting

Just Right

Logistic REGRESSION

Functional Form

posterior probability

parameter

input

Logistic REGRESSION

pi = 1

pi = 0

smaller    larger

Logistic REGRESSION

The Fitted Surface

Logistic REGRESSION

LOGISTIC Procedure

proclogisticdata=develop plots(only)=(effect(clbandx=(ddabaldepamt checks res))

oddsratio (type=horizontalstat));

class res (param=ref ref='S');

model ins(event='1') =

unitsddabal=1000depamt=1000 / default=1;

oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl;

run;

Logistic REGRESSION

Properties of the Odds Ratio

No Association

Группа в числителеимеет более высокие шансы

Группа в знаменателеимеет более высокие шансы наступления события

0 1

Estimated logistic regression model:

logit(p) = .7567 + .4373*(gender)

where females are coded 1 and males are coded 0

Estimated odds ratio (Females to Males):

odds ratio = (e-.7567+.4373)/(e-.7567) = 1.55

Logistic REGRESSION

Results from oddsratio

oddsratio'Comparisons of Residential Classification' res / diff=allcl=pl;

Logistic REGRESSION

Results from PLOTS =(EFFECT(…

plots(only)=(effect(clbandx=(ddabaldepamt checks res))

Logistic REGRESSION

Logistic Discrimination

oversampling

Sampling Designs

Joint

(x,y),(x,y),(x,y),

(x,y),(x,y),(x,y),

(x,y),(x,y),(x,y),

(x,y),(x,y),...

{(x,y),(x,y),(x,y),(x,y)}

Separate

x,x,x,

x,x,x,

x,x,x,

x,x,...

x,x,x,

x,x,x,

x,x,x,

x,x,...

{(x,0),(x,0),(x,1),(x,1)}

y = 0

y = 1

oversampling

The Effect of Oversampling

oversampling

Offset

Два способа корректировки

Включить параметр «сдвига» в модель

Скорректировать вероятности на выходе модели

model… / offset=X

- в действительности

- в выборке

oversampling

Корректировка вероятностей

/* Specify the prior probability */

/* to correct for oversampling */

%let pi1=.02;

/* Correct predicted probabilities */

proclogisticdata=develop;

scoredata = pmlr.newout=scored priorevent=&pi1;

run;

Missing values

Does Pr(missing) Depend on the Data?

• No
• MCAR (missing completely at random)
• Yes
• that unobserved value
• other unobserved values
• other observed values(including the target)
Missing values

Complete Case Analysis

Input Variables

Cases

...

Missing values

Complete Case Analysis

Input Variables

Cases

Missing values

New Missing Values

Fitted Model:

New Case:

Predicted Value:

Missing values

Missing Value Imputation

Missing values

Imputation + Indicators

Incomplete

Data

Completed

Data

Missing

Indicator

Missing values

Imputation + Indicators

datadevelop1; /* Create missing indicators */

set develop;

/* name the missing indicator variables */

array mi{*} MIAcctAgMIPhone … MICRScor;

/* select variables with missing values */

array x{*} acctage phone … crscore;

doi=1to dim(mi);

mi{i}=(x{i}=.);

end;

run;

procstdizedata=develop1

reponly

method=median /* Impute missing values with the median */

out=imputed;

var &inputs;

run;

X1 =

X2 = ?

Missing values

Cluster Imputation [at later lectures]

Categorical Inputs

Dummy Variables

Categorical Inputs

Smarter Variables

ZIP

Urbanicity

HomeVal

Local

...

99801

99622

99523

99523

99737

99937

99533

99523

99622

0

1

DA

DB

Dc

DD

1

0

0

0

0

1

0

0

0

0

1

0

0

0

0

1

Categorical Inputs

Quasi-Complete Separation

Categorical Inputs

Clustering Levels

Categorical Inputs

Clustering Levels

Categorical Inputs

Clustering Levels

Categorical Inputs

Clustering Levels

Greenacre (1988, 1993) PROC MEANS – PROC CLUSTER – PROC TREE -… HOME WORK

Variable Clustering

Procvarclass[later Lecture]

Checking Deposits

Mortgage Balance

Number of Checks

Teller Visits

Credit Card Balance

Age

Variable Screening

Univariate Screening

Variable Screening

Univariate Smoothing

Empirical Logits

where

mi= number of events

Mi = number of cases

1. Hand-Crafted New Input Variables

2. Polynomial Models

3. Flexible Multivariate Function Estimators

4. Do Nothing

Empirical Logit Plots

All

Subsets

Stepwise

Time

Fast Backward

25

50

75

100

150

200

Number of Variables

Subset Selection

Scalability in PROC LOGISTIC

Training

Test

Accuracy = 70%

Accuracy = 47%

x1

 gray

black 

 gray

black 

x2

x2

Honest Assessment

The Optimism Principle

Honest Assessment

Data Splitting

Validation

Training

Test

A

B

C

D

E

1)

2)

3)

4)

5)

Train

BCDE

ACDE

ABDE

ABCE

ABCD

Validate

A

B

C

D

E

Honest Assessment

Other Approaches

Predicted Class

0

1

True

Negative

False

Positive

Actual

Negative

0

Actual Class

False

Negative

True

Positive

Actual

Positive

1

Predicted

Negative

Predicted

Positive

Misclassification

Confusion Matrix

Predicted Class

0

1

0

Actual Class

True

Positive

Actual

Positive

1

Predicted

Positive

Sensitivity and Positive Predicted Value

Predicted

Predicted

0

1

0

1

29

21

56

41

0

50

97

Actual

17

33

1

2

1

50

3

46

54

57

43

Sample

Population

Oversampled Test Set

Predicted Class

0

1

0·Sp

0(1—Sp)

0

0

Actual Class

1(1—Se)

1·Se

1

1

Total Profit

70

5

Predicted

16*99 - 5 = \$1579

9

16

0

1

\$0

-\$1

0

66

9

Actual

21*99 - 9 = \$2070

\$0

\$99

4

21

1

57

18

24*99 - 18 = \$2358

1

24

Allocation Rules

Profit Matrix

Decision

Bayes Rule:

Decision 1 if

0

1

0

Actual Class

1

Allocation Rules

Profit Matrix

Allocation Rules

Classifier Performance

Allocation Rules

Using Profit to Assess Fit

Overall Predictive Power

Class Separation

Overall Predictive Power

Area under the ROC Curve

ROC and ROCCONTRAST Statements

ROC<'label'> <specification> </ options>;

ROCCONTRAST<'label'><contrast></ options>;