1 / 18

Logistic Regression using SAS prepared by Voytek Grus for

Logistic Regression using SAS prepared by Voytek Grus for. SAS user group, Halifax February 24, 2006. What is Logistic Regression?. Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors.

gallagher
Download Presentation

Logistic Regression using SAS prepared by Voytek Grus for

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regression using SAS prepared byVoytek Grusfor SAS user group, Halifax February 24, 2006

  2. What is Logistic Regression? • Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors. • Linear regression equation of the type yi=α+βxi+εi is not appropriate … • … but like in linear regression analysis logistic regression is used to • test statistical significance of relationship between response and predictor variables • predict the category of outcomes given its predictors • Falls into the category of generalized linear models and either complements or offers flexible alternative to • Multiple linear regression – similarity in equations, statistical diagnostics • Contingency tables (cross tabulation) • Loglinear models • Discriminant analysis – answers similar questions but is less restrictive • Relatively New statistical tool for the analysis of categorical data • Contingency tables – 1900’s • Regression Analysis – 1970’s • Loglinear modes – 1975 • Logistic Regression – late 70’s early 80’s but became more popular in the 90’s

  3. Fields of application. • Health sciences - questions about disease: yes or no? • Social Sciences: deals with great deal of dichotomous variables: employed vs unemployed, married vs unmarried,etc • Attitude to work as based on demographic or behavioral predictors • Racial bias in judicial decisions, etc • Political science: • Which party voters will vote for and why? • Which voters will vote for a particular party? • Public Opinion Polls • Used in economics and marketing to study consumer choice. • Banks use it to assess credit rating of customers • Some regulators require that utilities submit customer choice studies on energy conservation options. • Choice of mode of transportation • Used in demand forecasting

  4. PART I Conceptual Framework of Logistic Regression

  5. Why not to use OLS for the estimation of the categorical response equation? • Multiple Linear Regression of categorical response variables does not satisfy two assumptions of a Linear Model necessary to produce unbiased and efficient coefficients. • Linearity of coefficients: yi=α+βxi+εi • E(εi)=0 • Heteroscedasticity: var(εi)≠σ2 • E(yi)=1*P(yi=1)+0*P(yi=0)=pi= α+βxi • var(εi)= var(yi)=pi*(1- pi)=(α+βxi)*(1-α-βxi) • Errors are uncorrelated: cov(εi, εj)=0 • Errors are not normally distributed: εi~ Binomial • Errors take on only two values: εi=1-α-βxi or εi=0- α-βxi and are bounded by 0 and 1. • As a result • coefficient estimates are no longer efficient • Standard error estimates are no longer consistent • Estimated values of the response variable Y may be implausible because • Linear function is unbounded (estimates will be outside of the (0, 1) interval but the Binary regression is a linear probability model:E(yi)=pi= α+βxi

  6. Logit Transformation a remedy to violation of OLS assumptions • Instead of estimating this linear equation: yi=α+ βxi1+βxi1 + …+ βxk1 +εican apply logit transformation: log[pi/(1- pi)] =α+β1xi1+β2xi2 +. + βkxk1 where pi/(1- pi) is an odds ratio that an event of y=1 will occur. • Consequences: • pi=exp(α+β1xi1+β2xi2 +. + βkxk1 )/(1+exp(α+β1xi1+β2xi2 +. + βkxk1)) happens to be a cumulative logistic distribution function. • No matter what the coefficients are pi is always between 0 and 1 • Absence of εi complicates stats analysis: standardized coefficients? • Derivative of x is a function of p: Dpi/dxi= βpi(1-pi) and reflects changing slope of the S curve making interpreation of coefficients difficult. Need to be cautious when interpreting coefficients from the prob. perspective

  7. Alternatives to logit transformation in the context of latent variables: probit and complementary log log • In a perfect world there is a model for a continuous response variable zi. The dichotomous logit model is only its simplification. There is a true equation zi=α0+ α1xi1+ α2xi1 + …+ α3xk1 + σεi but it can not be observed. It is latent. Instead we observe dichotomous y whose values of 1 and 0 depend on probability z. Y’s relationship with predictors X’s depends on the probability distribution of ε. • Assumption of distribution of ε help determine standardized coefficients.

  8. Logistic Regression in the context of the generalized linear models.

  9. I Logistic Regression compared to ordinary linear regression

  10. II Logistic Regression compared to ordinary linear regression

  11. PART II SAS Application of Logistic Regression

  12. Summary of SAS procedures for logistic regression analysis • Binary Logit Analysis: • PROCS: LOGISTIC; GENMOD; CATMOD; PROBIT, MDC, NLMIXED. • Multinomial Logit Analysis • Predictors are characteristics of the individual • Nominal (no ordering of Y’s): proc logistic; proc catmod • Ordinal (inherent ordering of Y’s): proc logistic; proc catmod; proc genmod. • Conditional Logit Analysis • Predictors are the characteristics of the response variable • Can use mdc proc & phreg proc. • Logit Analysis of Clustered data: • Proc Logistic or (Proc Phreg) • Proc Genmod (gee)

  13. Binary Logit Models • PROC LOGISTIC at its simplest: Main effect Model • Individual-level data: PROCLOGISTIC DATA=input; FREQ frequency; /* optional */ MODEL y=X1 X2;RUN; or 2. Grouped data: PROCLOGISTIC DATA=input; MODEL events/trials=X1 X2;RUN; • PROC LOGISTIC with more features • PROCLOGISTICDATA=lrdata.penalty DESCENDING; • CLASS culp; • MODEL death=blackd|whitvic|culp / STB LACKFITAGGREGATERSQlink=logit technique=newton CLODDS=PL CLODDS=WALD SELECTION=stepwise SCALE=WILLIAMS CORRB influenceiplots; • UNITS culp=2 / DEFAULT=1;Outputout=results pred=phat lower=lb upper=up reschi=stres dfbetas=dfs;RUN; • PROC GENMOD at its simplest • PROCGENMOD DATA=lrdata.penalty; • MODEL y=X1 X2 /Dist=Binomial;RUN;

  14. Multinomial Logit Models • Multinomial logit for nominal response (Generalized Logit) • The logit transformation of the type log (pi/(1-pi)) for more than 2 categories does not work because Σi=1kpi ≠1 • K-1 equations are estimated: log (pij/(pik)= +βjxi where j=1,2, … k-1. • Multinomial logit for ordinal response (Cumulative, adjacent categories, continuation ratio) • Inherent ordering of Y responses allows to relax the assumption of multiple odds equations. • Estimate k-1 equations of odds of Cum. Probabilities Fij • Log (Fij/(1-Fij)= αj+βxi - all coefficients except for intercept stay the same • Because there is a hierarchy in the categories of response variable • The model is easier to estimate and interpret • Hypothesis test are more powerful • one coefficient of each predictor but k-1 intercepts. Available tools in SAS: 1. PROCLOGISTIC DATA=lrdata.wallet; MODEL wallet = male business punish explain / link=glogit; /* or link=clogit */ RUN; 2. PROCCATMOD DATA=lrdata.wallet; DIRECT male business punish explain; MODEL wallet = male business punish explain / NOITER PRED; RUN;

  15. Conditional logit Models • Consumer Choice Studies • Consumer taste preferences, choice of mode of transportation, locational characteristics for a retail store, • Conditional Logit: proc mdc; model decision = x1 x2 / type=clogit choice=(mode 1 2 3); id pid; run; • Nested Logit: proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run; • Analysis of clustered data • Observations within clusters can often be dependent: longitudinal data, students clustered in classrooms or schools, husbands & wives clustered in families, etc • Dependent observations produce underestimated errors and overestimated test statistics and coefficient estimates which are inefficient. • Remedies: Can use GEE (PROC GENMOD) or Conditional Logit (PROC LOGISTIC or PROC PHREG) and other methods such as Mixed Models or hybrids of the above.

  16. Consumer choice Modeling: Nested Logit Example Decision Tree • Example • procmdc data=travel2 maxit=200 outest=a; • model choice = ttime time cost / type=nlogit choice=(mode); id id; • utility u(1,123 @ 1) = ttime time cost, • u(1,4 @ 2) = time cost; • nest level(1) = (123 @ 1, 4 @ 2), • level(2) = (12 @ 1);run; Level 2 Level 1

  17. Literature • Logistic Regression Using The SAS system by Paul D. Allison (4th edition August, 2003) • Categorical Data Analysis Using The SAS System by Maura E. Stokes, Charles S. Davis, Gary G. Koch. (4th edition January, 2005) • Multivariate Statistical Methods by B. Tabachnik (1996) • SAS Help Examples

  18. Questions?

More Related