the analysis of categorical data l.
Skip this Video
Loading SlideShow in 5 Seconds..
The Analysis of Categorical Data PowerPoint Presentation
Download Presentation
The Analysis of Categorical Data

Loading in 2 Seconds...

play fullscreen
1 / 46

The Analysis of Categorical Data - PowerPoint PPT Presentation

  • Uploaded on

The Analysis of Categorical Data. Categorical variables. When both predictor and response variables are categorical: Presence or absence Color, etc. The data in such a study represents counts –or frequencies - of observations in each category. Analysis. Two way Contingency Tables.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Analysis of Categorical Data' - butch

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
categorical variables
Categorical variables
  • When both predictor and response variables are categorical:
  • Presence or absence
  • Color, etc.
  • The data in such a study represents counts –or frequencies- of observations in each category
two way contingency tables
Two way Contingency Tables
  • Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data
sex cause of death and bone marrow type
Sex, cause of death, and bone marrow type
  • Sex (males / females)
  • Cause of death (predation / other)
  • Bone marrow type:
  • Solid white fatty (healthy animal)
  • Opaque gelatinous
  • Translucent gelatinous
contingency table
Contingency table

Sex * Death Crosstabulation

contingency table10
Contingency table

Sex * Marrow Crosstabulation

contingency table11
Contingency table

Death * Marrow Crosstabulation

are the variables independent
Are the variables independent?

We want to know, for example, whether males are more likely to die by predation than females

  • Specifying the null hypothesis:
  • The predictor and response variable are not associated with each other. The two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling
calculating the expected values
Calculating the expected values
  • The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation
the probability of two independent events
The probability of two independent events

Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals

contingency table15
Contingency table

Sex * Death expected values

testing the hypothesis pearson s chi square test
Testing the hypothesis: Pearson’s Chi-square test

= 0.0866, P=0.7685

= 0.0253, P=0.8736

calculating the p value
Calculating the P-value
  • We find the probability of obtaining a value of Χ2 as large or larger than 0.0866 relative to a Χ2 distribution with 1 degree of freedom
  • P = 0.769
an alternative
An alternative
  • The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution

= 0.0866

two way contingency tables22
Two way contingency tables
  • Sex * Death Crosstabulation:
  • Sex * Marrow Crosstabulation:
  • Marrow * Death Crosstabulation:
log linear models
Log-linear models
  • They treat the cell frequencies as counts distributed as a Poisson random variable
  • The expected cell frequencies are modeled against the variables using the log-link and Poisson error term
  • They are fit and parameters estimated using maximum likelihood techniques
log linear models27
Log-linear models
  • Do not distinguish response and predictor variables: all the variables are considered equally as response variables
  • A logit model with categorical variables can be analyzed as a log-linear model
two way tables
Two way tables
  • For a two way table (I by J) we can fit two log-linear models
  • The first is a saturated (full) model
  • Log fij= constant + λix+ λky+ λjkxy
  • fij= is the expected frequency in cell ij
  • λix = is the effect of category i of variable X
  • λky = is the effect of category k of variable Y
  • λjkxy = is the effect any interaction between X and Y
  • This model fit the observed frequencies perfectly
  • The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell
two way tables31
Two way tables
  • The second log-linear model represents independence of the two variables (X and Y) and is a reduced model:
  • Log fij= constant + λix+ λky
  • The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables
  • The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies:
  • i.e. a larger λ means that the expected frequencies will be larger for that variable.
  • These variables are also deviations from the mean of all expected frequencies
null hypothesis of independence
Null hypothesis of independence
  • The Ho is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies
  • It is also a test that λjkxy =0:
  • There is NO interaction between two variables
  • We can test this Ho by comparing the fit of the model without this term to the saturated model that includes this term
  • We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the log-likelihood of each model
  • We then compare the fit of the two models with the likelihood ratio test statistic ∆
  • However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G2 statistic
  • G2 =-2log∆
  • G2 Follows a Χ2distribution for reasonable sample sizes and can be generalized to
  • =- 2(log-likelihood reduced model -- log-likelihood full model)
degrees of freedom
Degrees of freedom
  • The calculated G2 is compared to a Χ2distribution with (I-1)(J-1) df.
  • This df (I-1)(J-1) is the difference between the df for the full model (IJ-1) and the df for the reduced model [(I-1)+(j-1)]
three way interaction
Three way interaction
  • Death*Sex*Marrow
  • Models compared 8 vs 9
  • G2= 7.19
  • df 2
  • P=0.027
conditional independence
Conditional independence

Death and marrow have a partial association

complete independence
Complete independence
  • Models compared 1 vs 8
  • G2=35.57
  • df= 5
  • P=<0.001
  • Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.