Chocolate Cake Seminar Series on Statistical Applications

Chocolate Cake SeminarSeries on Statistical Applications Today’s Talk: Multinomial Logistic Regression Models, Your Turn! By Dr. Olga Korosteleva

Outline of Presentation • Review of binary logistic regression model • Cumulative logit model for ordinal outcome • Generalized logit model for nominal outcome

Review: Binary Logistic Regression Model • Abinary logistic regression model is used when the outcome variable assumes only two values (0 and 1, say). • The model with predictors has the form where the fraction is referred to as the odds ratio, and are unknown regression coefficients.

Review: Goodness of Model Fit There are three ways to check how well the model fits the data: • Pseudo R-square (looks like regular R-square in linear regression, but cannot be interpreted as the proportion of variation explained by the model). • Max-rescaled R-square is defined as pseudo R-square divided by its maximum. • Hosmer-Lemeshow goodness-of-fit test with the null hypothesis that the model has a good fit. P-value in excess of 0.05 is desirable.

Review: Interpretation of Beta Coefficients • When is continuous, then the quantity represents the estimated percent change in oddswhen is increased by one unit, and the other variables are held fixed. • If is a categorical variable with levels, then represents the estimated percent ratio in oddsfor the level and that for the reference level provided the other variables are unchanged.

Multinomial Logistic Regression • A natural extension of the binary logistic regression is when the outcome variable is categorical assuming more than two values, e.g., 0, 1, or 2. This model is called a multinomial (polytomous) logistic regression model. • Two models are distinguished: for ordinal outcome and for nominal outcome.

Definitions and Examples • A categorical variable is measured on the ordinal scale if the categories have natural ordering. For example, size (XS, S, M, L, XL); health (poor, fair, good, excellent); grades (A, B, C, D, F); education (<HS, HSgrad, HS+). • A categorical variable is measured on the nominal scale if there is no natural ordering to the categories (they can be treated as names). For example, opinion (yes, no, don’t know); political affiliation (democrat, republican, independent, other); religion (protestant, catholic, etc.); race (white, hispanic, black, asian, etc.)

Cumulative Logit Model for Ordinal Outcome • Suppose is an ordinal outcome variable with levels. • Introduce the cumulative probability which represents the probability of the outcome assuming one of the values • For example, if , the cumulative probabilities are and

Cumulative Logit Model for Ordinal Outcome • Define the odds of outcome in category j or belowas the ratio These are termed cumulative odds. • Define the logits of the cumulative probabilities (calledcumulative logits) by For instance, if , the cumulative logits are

Cumulative Logit Model for Ordinal Outcome and Since , the logit is not defined.

Cumulative Logit Model for Ordinal Outcome • The cumulative logit model for an ordinal outcome and predictors has the form • Note that this model requires a separate intercept parameter for each cumulative probability. SAS uses this model.

Cumulative Logit Model for Ordinal Outcome • Note that some software packages (in particular, SPSS) use the model

Review: Goodness of Model Fit There are only two quantities that may be used to check the model fit. They are: • Pseudo R-square • Max-rescaled R-square The Hosmer-Lemeshowgoodness-of-fit test cannot be performed in the case of multinomial logistic regression.

Interpretation of Beta Coefficients • When is continuous, then the quantity represents the estimated percent change in cumulative oddswhen is increased by one unit, and the other predictors are held fixed. • If is a categorical variable with several levels, then represents the estimated percent ratio of cumulative oddsfor the level and that for the reference level, controlling for the other predictors.

Examples of Ordinal Outcomes • Example 1. A marketing research firm wants to investigate what factors influence the size of soda (small, medium, large or extra large) that people order at a fast-food chain. These factors may include what type of sandwich is ordered (burger or chicken), whether or not fries are also ordered, and age of the consumer. • Example 2. A researcher is interested in what factors influence medaling in Olympic swimming (gold, silver, bronze). Relevant predictors include training hours, diet, age, and popularity of swimming in the athlete's home country. • Example 3. A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories. Predictors may be parental educational status, whether the undergraduate institution is public or private, and current GPA.

Numeric Example Among variables collected by California Health Institute Survey (CHIS) there were demographic variables: • gender (M/F) • age (in years) • marital status (Married/Not Married) • highest educational degree obtained (<HS/Hsgrad/HS+) and • health condition (Poor/Fair/Good/Excellent) The following SAS code runs a cumulative logit model for the ordinal outcome variable healthfor the data on 32 respondents.

SAS Application: Code data CHIS; input gender$ age marital$ educ$ health$ @@; datalines; M 46 yes 1 3 M 62 yes 1 1 M 52 yes 2 4 M 50 no 1 2 F 44 no 3 1 F 68 no 2 2 F 50 no 3 2 F 93 no 1 1 M 60 yes 2 4 M 88 no 3 3 M 58 yes 2 4 M 62 yes 2 3 F 64 yes 3 3 F 49 yes 2 3 F 71 yes 3 4 M 32 no 3 3 F 88 no 2 1 F 36 yes 3 4 M 85 no 3 3 F 38 no 3 2 M 49 yes 3 4 F 43 no 1 3 M 61 yes 2 3 M 47 yes 3 4 F 36 yes 1 3 M 44 yes 1 4 M 41 no 2 3 M 55 yes 1 3 M 37 no 3 2 M 58 yes 2 4 F 40 yes 2 3 F 97 no 2 1 ; proc format; value $maritalfmt 'yes'='married' 'no'='not married'; value $educfmt '1'='<HS' '2'='HSgrad' '3'='HS+'; value $healthfmt '1'='poor' '2'='fair' '3'='good' '4'='excellent'; run; proc logistic; class gender (ref='M') marital (ref='yes') educ (ref='3')/param=ref; model health=gender age marital educ/link=clogitrsq; run;

The Important Features of the SAS Code • Ordinal variables should be entered into SAS as numbers 1, 2, etc. Otherwise SAS orders them alphabetically. • Option link=clogit specifies the cumulative logit link function. Note that by default, link=logit. • Option lackfitcannot be specified because the Hosmer-Lemeshow goodness-of-fit test cannot be performed in the case of multinomial logistic regression.

Relevant SAS Output

Results • Gender, marital status and education are associated with health status. Age is not. • This model has a reasonably good fit because the pseudo R-square and max-rescaled R-square are pretty large. • The fitted model is OVER

Results • The fitted model is

Interpretation of Beta Coefficients • The estimated odds of worse health for females are 6.363 times those for males (or 636.6%). • As age increases by one year, the estimated odds of worse health increase by 2.5%=(1.025-1)100% (not significant). • The estimated odds of worse health for not married people are 63.501 times those for married (or 6,350.1%). • The estimated odds of worse health for <HS are 9.912 times those for HS+ (or 991.2%). • The estimated odds of worse health for HSgrad are 2.525 times those for HS+ (or 252.2%) (not significant). • These ratios apply to all of the three cumulative probabilities P(poor health), P(poor or fair health) and P(poor, fair, or good health).

SPSS Application: Syntax Hyperlink to SPSS Data File Hyperlink to SPSS Syntax File PLUM health BY gender marital educ WITH age /LINK=LOGIT

Relevant SPSS Output

Generalized Logit Model for Nominal Outcome • Suppose is a nominal outcome with levels, and assume that the mth level is the reference. • Define the generalized logit function as For example if , and

Generalized Logit Model for Nominal Outcome The generalized logit model for nominal outcome with levels, and response variables has the form Note that ALL the regression coefficients differ for different j’s.

Interpretation of Beta Coefficients • When is continuous, then the quantity represents the estimated percent change in odds in favor of as opposed to when is increased by one unit, and the other predictors are held fixed. • If is a categorical variable with several levels, then represents the estimated percent ratio of odds in favor of as opposed to for the level and that for the reference level, controlling for the other predictors.

Examples of Nominal Outcomes • Example 1. People's occupational choices might be influenced by their parents' occupations and their own education level. We can study the relationship of one's occupation choice with education level and father's occupation. The occupational choices will be the outcome variable which consists of categories of occupations. • Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have different preferences from young ones. The outcome variable here will be the types of food, and the predictor variables might be size of the alligators and other environmental variables. • Example 3. Entering high school students make program choices among general program, vocational program and academic program. Their choice might be modeled using their writing score and their social economic status.

Numeric Example Over the course of a school year, third-graders from three different schools are exposed to three different styles of mathematics instruction: a self-paced computer-learning style, a team approach, and a traditional class approach. The students are asked which style they prefer, and their responses, classified by the type of program they are in (a regular school day versus a regular school day supplemented with an afternoon school program), are recorded. The following SAS code runs a generalized logit model for the nominal outcome variable style(self/team/class).

SAS Application: Code data school; length program$ 9; input school program$ style$ count @@; datalines; 1 regular self 10 1 regular team 17 1 regular class 26 1 afternoon self 5 1 afternoon team 12 1 afternoon class 50 2 regular self 21 2 regular team 17 2 regular class 26 2 afternoon self 16 2 afternoon team 12 2 afternoon class 36 3 regular self 15 3 regular team 15 3 regular class 16 3 afternoon self 12 3 afternoon team 12 3 afternoon class 20 ; proc logistic; freq count; class school(ref='1') program(ref='afternoon')/param=ref; model style(order=data)=school program/link=glogitrsq; run;

The Important Features of the SAS Code • The data set contains frequencies of identical observations. The freq clause has to be used in proc logistic. • The option order=data prescribes SAS to use the last mentioned in the data set value of the outcome variable as the reference. • Option link=glogitspecifies the generalized logit link function. • Option lackfitcannot be specified because the Hosmer-Lemeshow goodness-of-fit test cannot be performed in the case of multinomial logistic regression.

Relevant SAS Output

Results • This model doesn’t have a very good fit, because both R-square and max-rescaled R-square are pretty small. • The fitted model is and

Interpretation of Beta Coefficients • The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in school2 is 2.953 times those in school1 (or 295.3%). • The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in school3 is 3.724 times those in school1 (or 372.4%). • The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in regular program is 2.112 times those in afternoon program (or 211.2%).

Interpretation of Beta Coefficients • The estimated odds of preferring a team learning approach as opposed to a traditional class approach in school2 is 1.197 times those in school1 (or 119.7%). • The estimated odds of preferring a team learning approach as opposed to a traditional class approach in school3 is 1.926 times those in school1 (or 192.6%). • The estimated odds of preferring a team learning approach as opposed to a traditional class approach in regular program is 2.101 times those in afternoon program (or 210.1%).

SPSS Application: Syntax Hyperlink to SPSS Data File Hyperlink to SPSS Syntax File • Only numeric values are allowed as SPSS data. • Schools were renumbered (1->3, 2->2, 3->1) to make school 1 the reference.

Relevant SPSS Output

Chocolate Cake Seminar Series on Statistical Applications