Analysis of categorical data
Download
1 / 32

Analysis of Categorical Data - PowerPoint PPT Presentation


  • 134 Views
  • Uploaded on

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013. Overview. Data Types Contingency Tables Logit Models Binomial Ordinal Nominal. Things not covered (but still fit into the topic). Matched pairs/repeated measures

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Analysis of Categorical Data' - amish


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Analysis of categorical data

Analysis of Categorical Data

Nick Jackson

University of Southern California

Department of Psychology

10/11/2013


Overview
Overview

  • Data Types

  • Contingency Tables

  • Logit Models

    • Binomial

    • Ordinal

    • Nominal


Things not covered but still fit into the topic
Things not covered (but still fit into the topic)

  • Matched pairs/repeated measures

    • McNemar’sChi-Square

  • Reliability

    • Cohen’s Kappa

    • ROC

  • Poisson (Count) models

  • Categorical SEM

    • TetrachoricCorrelation

  • Bernoulli Trials


Data types levels of measurement
Data Types (Levels of Measurement)

Discrete/Categorical/Qualitative

Continuous/Quantitative

Nominal/Multinomial:

Rank Order/Ordinal:

Binary/Dichotomous/Binomial:

  • Properties:

    • Values arbitrary (no magnitude)

    • No direction (no ordering)

  • Example:

    • Race: 1=AA, 2=Ca, 3=As

  • Measures:

    • Mode, relative frequency

  • Properties:

    • Values semi-arbitrary (no magnitude?)

    • Have direction (ordering)

  • Example:

    • Lickert Scales (LICK-URT):

    • 1-5, Strongly Disagree to Strongly Agree

  • Measures:

    • Mode, relative frequency, median

    • Mean?

  • Properties:

    • 2 Levels

    • Special case of Ordinal or Multinomial

  • Examples:

    • Gender (Multinomial)

    • Disease (Y/N)

  • Measures:

    • Mode, relative frequency,

    • Mean?


Contingency tables

Code 1.1

Contingency Tables

  • Often called Two-way tables or Cross-Tab

  • Have dimensions I x J

  • Can be used to test hypotheses of association between categorical variables


Contingency tables test of independence
Contingency Tables: Test of Independence

  • Chi-Square Test of Independence (χ2)

    • Calculate χ2

    • Determine DF: (I-1) * (J-1)

    • Compare to χ2 critical value for given DF.

R1=156

R2=664

N=820

C1=265

C2=331

C3=264

Where: Oi = Observed Freq

Ei= Expected Freq

n= number of cells in table


Contingency tables test of independence1

Code 1.2

Contingency Tables: Test of Independence

  • Pearson Chi-Square Test of Independence (χ2)

    • H0: No Association

    • HA: Association….where, how?

  • Not appropriate when Expected (Ei) cell size freq < 5

    • Use Fisher’s Exact Chi-Square

R1=156

R2=664

N=820

C1=265

C2=331

C3=264


Contingency tables1
Contingency Tables

  • 2x2

Disorder (Outcome)

Yes

No

a

b

Yes

a+b

c

d

Risk Factor/

Exposure

c+d

No

a+c

b+d

a+b+c+d


Contingency tables measures of association
Contingency Tables:Measures of Association

Depression

Probability :

Contrasting Probability:

Yes

No

a=

25

b=

10

Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol

35

Yes

c=

20

d=

45

Alcohol Use

Contrasting Odds:

Odds:

65

No

The odds for depression were 5.62 times greater in Alcohol users compared to nonusers.

45

55

100


Why odds ratios
Why Odds Ratios?

i=1 to 45

(20 + 45*i)

Depression

(45 + 55*i)

Yes

No

a=

25

b=

10*i

(25 + 10*i)

Yes

c=

20

d=

45*i

Alcohol Use

No

45

55*i


The general ized linear model
The GeneralizedLinear Model

  • General Linear Model (LM)

    • Continuous Outcomes (DV)

    • Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA

  • GeneralizedLinear Model (GLM)

    • John Nelder and Robert Wedderburn

    • Maximum Likelihood Estimation

    • Continuous, Categorical, and Count outcomes.

    • Distribution Family and Link Functions

      • Error distributions that are not normal


Logistic regression
Logistic Regression

  • “This is the most important model for categorical response data” –Agresti (Categorical Data Analysis, 2nd Ed.)

  • Binary Response

  • Predicting Probability (related to the Probit model)

  • Assume (the usual):

    • Independence

    • NOT Homoscedasticity or Normal Errors

    • Linearity (in the Log Odds)

    • Also….adequate cell sizes.


Logistic regression1
Logistic Regression

  • The Model

    • In terms of probability of success π(x)

    • In terms of Logits (Log Odds)

    • Logit transform gives us a linear equation


Logistic regression example

Code 2.1

Logistic Regression: Example

The Output as Logits

  • Logits: H0: β=0

Freq. Percent

Not Depressed 672 81.95

Depressed 148 18.05

  • Conversion to Probability:

What does H0: β=0 mean?

  • Conversion to Odds

    • Also=0.1805/0.8195=0.22


Logistic regression example1

Code 2.2

Logistic Regression: Example

  • The Output as ORs

    • Odds Ratios: H0: β=1

    • Conversion to Probability:

    • Conversion to Logit (log odds!)

      • Ln(OR) = logit

      • Ln(0.220)=-1.51

Freq. Percent

Not Depressed 672 81.95

Depressed 148 18.05


Logistic regression example2

Code 2.3

Logistic Regression: Example

Logistic Regression w/ Single Continuous Predictor:

AS LOGITS:

Interpretation:

A 1 unit increase in age results in a 0.013 increase in the log-odds of depression.

Hmmmm….I have no concept of what a log-odds is. Interpret as something else.

Logit > 0 so as age increases the risk of depression increases.

OR=e^0.013 = 1.013

For a 1 unit increase in age, there is a 1.013 increase in the odds of depression.

We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change]


Logistic regression gof
Logistic Regression: GOF

  • Overall Model Likelihood-Ratio Chi-Square

    • Omnibus test for the model

    • Overall model fit?

      • Relative to other models

    • Compares specified model with Null model (no predictors)

    • Χ2=-2*(LL0-LL1), DF=K parameters estimated


Logistic regression gof summary measures

Code 2.4

Logistic Regression: GOF (Summary Measures)

  • Pseudo-R2

    • Not the same meaning as linear regression.

    • There are many of them (Cox and Snell/McFadden)

    • Only comparable within nested models of the same outcome.

  • Hosmer-Lemeshow

    • Models with Continuous Predictors

    • Is the model a better fit than the NULL model. X2

    • H0: Good Fit for Data, so we want p>0.05

    • Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2

    • Conservative (rarely rejects the null)

  • Pearson Chi-Square

    • Models with categorical predictors

    • Similar to Hosmer-Lemeshow

  • ROC-Area Under the Curve

    • Predictive accuracy/Classification


Logistic regression gof diagnostic measures

Code 2.5

Logistic Regression: GOF(Diagnostic Measures)

  • Outliers in Y (Outcome)

    • Pearson Residuals

      • Square root of the contribution to the Pearson χ2

    • Deviance Residuals

      • Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model.

  • Outliers in X (Predictors)

    • Leverage (Hat Matrix/Projection Matrix)

      • Maps the influence of observed on fitted values

  • Influential Observations

    • Pregibon’s Delta-Beta influence statistic

    • Similar to Cook’s-D in linear regression

  • Detecting Problems

    • Residuals vs Predictors

    • Leverage VsResiduals

    • Boxplot of Delta-Beta


Logistic regression gof1
Logistic Regression: GOF

L-R χ2 (df=1): 2.47, p=0.1162

H-L GOF:

Number of Groups: 10

H-L Chi2: 7.12

DF: 8

P: 0.5233

McFadden’s R2: 0.0030


Logistic regression diagnostics

Code 2.6

Logistic Regression: Diagnostics

  • Linearity in the Log-Odds

    • Use a lowess (loess) plot

    • Depressed vs Age


Logistic regression example3

Code 2.7

Logistic Regression: Example

Logistic Regression w/ Single Categorical Predictor:

AS OR:

Interpretation:

The odds of depression are 0.299 times lower for males compared to females.

We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females.

Or…why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34.


Ordinal logistic regression
Ordinal Logistic Regression

  • Also called Ordered Logistic or Proportional Odds Model

  • Extension of Binary Logistic Model

  • >2 Ordered responses

  • New Assumption!

    • Proportional Odds

      • BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese)

      • The predictors effect on the outcome is the same across levels of the outcome.

        • Bmi3grp (1 vs 2,3) = B(age)

        • Bmi3grp (1,2 vs 3) = B(age)


Ordinal logistic regression1
Ordinal Logistic Regression

  • The Model

    • A latent variable model (Y*)

    • j= number of levels-1

    • From the equation we can see that the odds ratio is assumed to be independent of the category j


Ordinal logistic regression example

Code 3.1

Ordinal Logistic Regression Example

AS LOGITS:

For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higherbmi category

AS OR:

For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater.


Ordinal logistic regression gof

Code 3.2

Ordinal Logistic Regression: GOF

  • Assessing Proportional Odds Assumptions

    • Brant Test of Parallel Regression

      • H0: Proportional Odds, thus want p >0.05

      • Tests each predictor separately and overall

    • Score Test of Parallel Regression

      • H0: Proportional Odds, thus want p >0.05

    • Approx Likelihood-ratio test

      • H0: Proportional Odds, thus want p >0.05


Ordinal logistic regression gof1

Code 3.3

Ordinal Logistic Regression: GOF

  • Pseudo R2

  • Diagnostics Measures

    • Performed on the j-1 binomial logistic regressions


Multinomial logistic regression
Multinomial Logistic Regression

  • Also called multinomial logit/polytomous logistic regression.

  • Same assumptions as the binary logistic model

  • >2 non-ordered responses

    • Or You’ve failed to meet the parallel odds assumption of the Ordinal Logistic model


Multinomial logistic regression1
Multinomial Logistic Regression

  • The Model

    • j= levels for the outcome

    • J=reference level

    • where x is a fixed setting of an explanatory variable

    • Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It’s actually an OR.

    • Similar to conducting separate binary logistic models, but with better type 1 error control


Multinomial logistic regression example

Code 4.1

Multinomial Logistic Regression Example

Does degree of supernatural belief indicate a religious preference?

AS OR:

For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic.


Multinomial logistic regression gof
Multinomial Logistic Regression GOF

  • Limited GOF tests.

    • Look at LR Chi-square and compare nested models.

    • “Essentially, all models are wrong, but some are useful” –George E.P. Box

  • Pseudo R2

  • Similar to Ordinal

    • Perform tests on the j-1 binomial logistic regressions


Resources
Resources

“Categorical Data Analysis” by Alan Agresti

UCLA Stat Computing:

http://www.ats.ucla.edu/stat/


ad