1 / 32

# Analysis of Categorical Data - PowerPoint PPT Presentation

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013. Overview. Data Types Contingency Tables Logit Models Binomial Ordinal Nominal. Things not covered (but still fit into the topic). Matched pairs/repeated measures

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Analysis of Categorical Data' - amish

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Analysis of Categorical Data

Nick Jackson

University of Southern California

Department of Psychology

10/11/2013

• Data Types

• Contingency Tables

• Logit Models

• Binomial

• Ordinal

• Nominal

Things not covered (but still fit into the topic)

• Matched pairs/repeated measures

• McNemar’sChi-Square

• Reliability

• Cohen’s Kappa

• ROC

• Poisson (Count) models

• Categorical SEM

• TetrachoricCorrelation

• Bernoulli Trials

Discrete/Categorical/Qualitative

Continuous/Quantitative

Nominal/Multinomial:

Rank Order/Ordinal:

Binary/Dichotomous/Binomial:

• Properties:

• Values arbitrary (no magnitude)

• No direction (no ordering)

• Example:

• Race: 1=AA, 2=Ca, 3=As

• Measures:

• Mode, relative frequency

• Properties:

• Values semi-arbitrary (no magnitude?)

• Have direction (ordering)

• Example:

• Lickert Scales (LICK-URT):

• 1-5, Strongly Disagree to Strongly Agree

• Measures:

• Mode, relative frequency, median

• Mean?

• Properties:

• 2 Levels

• Special case of Ordinal or Multinomial

• Examples:

• Gender (Multinomial)

• Disease (Y/N)

• Measures:

• Mode, relative frequency,

• Mean?

Contingency Tables

• Often called Two-way tables or Cross-Tab

• Have dimensions I x J

• Can be used to test hypotheses of association between categorical variables

Contingency Tables: Test of Independence

• Chi-Square Test of Independence (χ2)

• Calculate χ2

• Determine DF: (I-1) * (J-1)

• Compare to χ2 critical value for given DF.

R1=156

R2=664

N=820

C1=265

C2=331

C3=264

Where: Oi = Observed Freq

Ei= Expected Freq

n= number of cells in table

Contingency Tables: Test of Independence

• Pearson Chi-Square Test of Independence (χ2)

• H0: No Association

• HA: Association….where, how?

• Not appropriate when Expected (Ei) cell size freq < 5

• Use Fisher’s Exact Chi-Square

R1=156

R2=664

N=820

C1=265

C2=331

C3=264

• 2x2

Disorder (Outcome)

Yes

No

a

b

Yes

a+b

c

d

Risk Factor/

Exposure

c+d

No

a+c

b+d

a+b+c+d

Contingency Tables:Measures of Association

Depression

Probability :

Contrasting Probability:

Yes

No

a=

25

b=

10

Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol

35

Yes

c=

20

d=

45

Alcohol Use

Contrasting Odds:

Odds:

65

No

The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.

45

55

100

i=1 to 45

(20 + 45*i)

Depression

(45 + 55*i)

Yes

No

a=

25

b=

10*i

(25 + 10*i)

Yes

c=

20

d=

45*i

Alcohol Use

No

45

55*i

The GeneralizedLinear Model

• General Linear Model (LM)

• Continuous Outcomes (DV)

• Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA

• GeneralizedLinear Model (GLM)

• John Nelder and Robert Wedderburn

• Maximum Likelihood Estimation

• Continuous, Categorical, and Count outcomes.

• Distribution Family and Link Functions

• Error distributions that are not normal

• “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.)

• Binary Response

• Predicting Probability (related to the Probit model)

• Assume (the usual):

• Independence

• NOT Homoscedasticity or Normal Errors

• Linearity (in the Log Odds)

• The Model

• In terms of probability of success π(x)

• In terms of Logits (Log Odds)

• Logit transform gives us a linear equation

Logistic Regression: Example

The Output as Logits

• Logits: H0: β=0

Freq. Percent

Not Depressed 672 81.95

Depressed 148 18.05

• Conversion to Probability:

What does H0: β=0 mean?

• Conversion to Odds

• Also=0.1805/0.8195=0.22

Logistic Regression: Example

• The Output as ORs

• Odds Ratios: H0: β=1

• Conversion to Probability:

• Conversion to Logit (log odds!)

• Ln(OR) = logit

• Ln(0.220)=-1.51

Freq. Percent

Not Depressed 672 81.95

Depressed 148 18.05

Logistic Regression: Example

Logistic Regression w/ Single Continuous Predictor:

AS LOGITS:

Interpretation:

A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.

Hmmmm….I have no concept of what a log-odds is. Interpret as something else.

Logit > 0 so as age increases the risk of depression increases.

OR=e^0.013 = 1.013

For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.

We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]

• Overall Model Likelihood-Ratio Chi-Square

• Omnibus test for the model

• Overall model fit?

• Relative to other models

• Compares specified model with Null model (no predictors)

• Χ2=-2*(LL0-LL1), DF=K parameters estimated

Logistic Regression: GOF (Summary Measures)

• Pseudo-R2

• Not the same meaning as linear regression.

• There are many of them (Cox and Snell/McFadden)

• Only comparable within nested models of the same outcome.

• Hosmer-Lemeshow

• Models with Continuous Predictors

• Is the model a better fit than the NULL model. X2

• H0: Good Fit for Data, so we want p>0.05

• Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2

• Conservative (rarely rejects the null)

• Pearson Chi-Square

• Models with categorical predictors

• Similar to Hosmer-Lemeshow

• ROC-Area Under the Curve

• Predictive accuracy/Classification

Logistic Regression: GOF(Diagnostic Measures)

• Outliers in Y (Outcome)

• Pearson Residuals

• Square root of the contribution to the Pearson χ2

• Deviance Residuals

• Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model.

• Outliers in X (Predictors)

• Leverage (Hat Matrix/Projection Matrix)

• Maps the influence of observed on fitted values

• Influential Observations

• Pregibon’s Delta-Beta influence statistic

• Similar to Cook’s-D in linear regression

• Detecting Problems

• Residuals vs Predictors

• Leverage VsResiduals

• Boxplot of Delta-Beta

L-R χ2 (df=1): 2.47, p=0.1162

H-L GOF:

Number of Groups: 10

H-L Chi2: 7.12

DF: 8

P: 0.5233

Logistic Regression: Diagnostics

• Linearity in the Log-Odds

• Use a lowess (loess) plot

• Depressed vs Age

Logistic Regression: Example

Logistic Regression w/ Single Categorical Predictor:

AS OR:

Interpretation:

The odds of depression are 0.299 times lower for males compared to females.

We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.

Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.

• Also called Ordered Logistic or Proportional Odds Model

• Extension of Binary Logistic Model

• >2 Ordered responses

• New Assumption!

• Proportional Odds

• BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)

• The predictors effect on the outcome is the same across levels of the outcome.

• Bmi3grp (1 vs 2,3) = B(age)

• Bmi3grp (1,2 vs 3) = B(age)

• The Model

• A latent variable model (Y*)

• j= number of levels-1

• From the equation we can see that the odds ratio is assumed to be independent of the category j

Ordinal Logistic Regression Example

AS LOGITS:

For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higherbmi category

AS OR:

For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.

Ordinal Logistic Regression: GOF

• Assessing Proportional Odds Assumptions

• Brant Test of Parallel Regression

• H0: Proportional Odds, thus want p >0.05

• Tests each predictor separately and overall

• Score Test of Parallel Regression

• H0: Proportional Odds, thus want p >0.05

• Approx Likelihood-ratio test

• H0: Proportional Odds, thus want p >0.05

Ordinal Logistic Regression: GOF

• Pseudo R2

• Diagnostics Measures

• Performed on the j-1 binomial logistic regressions

• Also called multinomial logit/polytomous logistic regression.

• Same assumptions as the binary logistic model

• >2 non-ordered responses

• Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model

• The Model

• j= levels for the outcome

• J=reference level

• where x is a fixed setting of an explanatory variable

• Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.

• Similar to conducting separate binary logistic models, but with better type 1 error control

Multinomial Logistic Regression Example

Does degree of supernatural belief indicate a religious preference?

AS OR:

For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.

• Limited GOF tests.

• Look at LR Chi-square and compare nested models.

• “Essentially, all models are wrong, but some are useful” –George E.P. Box

• Pseudo R2

• Similar to Ordinal

• Perform tests on the j-1 binomial logistic regressions

“Categorical Data Analysis” by Alan Agresti

UCLA Stat Computing:

http://www.ats.ucla.edu/stat/