Chapter 9 Analysis of Two-Way Tables. Two-way (i.e. contingency) tables: to classify & analyze categorical data: Binomial counts: 'success' vs. 'failure' Proportions: binomial count divided by total sample size.

Analysis of Two-Way Tables

Two-way (i.e. contingency) tables: to classify & analyze categorical data:

• Binomial counts: ‘success’ vs. ‘failure’

• Proportions: binomial count divided by total sample size

Here's a two-way table:

. tab hsci white, cell

Nonwhite WhiteTotal

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55 145 200

27.5% 72.5% 100.0%

• The row variable: hsci vs. not hsci. The column variable: white vs. nonwhite.

• Cells: each combination of values for the two variables (50, 103, 5, 42).

• Joint distributions: Each cell’s percentage of the total sample (50/200=.250; 103/200=.515; 5/200=.025; 42/200=.210).

• The marginal frequencies: the row totals (153, 47) & the column totals (55, 145).

• The marginal distributions: each row total/sample total (76.5%, 23.5%). Each column total/sample total (27.5%, 72.5%).

Here are the same data displayed as column conditional probabilities:

. tab hsci white, col nofreq

nonwhite white Total

no 90.91% 71.03% 76.5%

yes 9.09% 28.97% 23.5%

Total 100.0% 100.% 100.00%

• The conditional distributions (i.e conditional probabilities): Column—divide each column cell count by its column total count.

Here are the same data displayed as row conditional probabilities:

. tab hsci white, row nofreq

nonwhite white Total

not hsci 32.68% 67.32% 100.0%

hsci 10.64% 89.36% 100.0%

Total 27.5% 72.5% 100.00%

• The conditional distributions (i.e conditional probabilities): Row—divide each row cell count by its row total count.

Tip: It's usually best to compute conditional distributions (i.e. probabilities) across the categories of the explanatory variable.

• E.g., tab hsci white, col: computes the conditional distributions across the categories of the explanatory variable race-ethnicity (i.e. white vs. nonwhite).

• Alternatively, you may want to compare joint distributions (i.e. cell counts/total sample): tab hsci white, cell

We've discussed the following:

• row variables

• column variables

• cells: each combination of values for the two variables.

• joint distributions: each cell’s percentage of the total sample.

marginal frequencies

• marginal distributions: each marginal frequency/total sample size

• column conditional distributions: divide each column cell count by its column total count

• row conditional distributions: divide each row cell count by its row total count.

Simpson's Paradox

• An NSF study found that the median salary of newly graduated female engineers & scientists was just 73% of the median salary for males. Here are women’s median salaries in the 16 fields as a percentage of male salaries:

• 94% 96% 98% 95% 85% 85% 84% 100% 103% 100% 107% 93% 104% 93% 106% 100%

• How can it be that, on average, the women earn just 73% of the median salary for males, since no listed % falls below 84%?

• Because women are disproportionately located in the lower-paying fields of engineering & science.

• That is, ‘field of science & engineering’ is a lurking variable (i.e. an unmeasured confounded variable) that influences the observed association between gender & salary.

• Simpson’s Paradox: the reversal of a bivariate relationship due to the influence of a lurking variable.

• Aggregating data has the effect of ignoring one or more lurking variables.

• Another example: comparing hospital mortality rates.

• Yet another: comparing airline on-time rates.

Conclusion from Simpson's Paradox:

• Always be on the lookout for lurking variables with aggregated data!!

• A bivariate relationship may change direction when a third, control variable is introduced.

What's a control variable?

• Holding a variable constant makes it acontrol variable: doing so removes the part of the bivariate relationship that was caused by the control variable.

• That is, controlling for a variable neutralizes its influence on the observed relationship.

• E.g., controlling for field of science & engineering.

• E.g., controlling for race/ethnicity.

To repeat, holding a variable constant removes its statistical effects from the bivariate association being examined.

• Doing so ensures (more or less) that a bivariate relationship is assessed apart from the influence of the controlled variable: e.g., the relationship between a Montessori school program & student IQ scores, holding constant social class.

• A bivariate association may not appear until a third, control variable is introduced.

• The apparent absence of the bivariate relationship is called spurious non-association.

• E.g., no association between years of education & level of income in post-WW II data, until controlling for age of respondents.

Conclusion from Spurious Non-Association:

• Spurious Non-Association

• Explore not just bivariate relationships but also multivariate relationships among all the variables of potential practical or theoretical relevance.

Describing Relations in Two-Way Tables

• The original data must be counts.

• Inference for two-way tables: compare the observed cell counts to the expected cell counts; then compute the Chi-square significance test.

We begin by computing the expected cell counts: row total times column total, divided by total sample size.

• Premise: the null hypothesis of ‘statistical independence’ (i.e. no association between the variables) characterizes the data.

Expected cell counts: row total times column total, divided by total sample size.

nonwhite white Total

. di (153*55)/200=42.075 . di (153*145)/200=110.925

. di (47*55)/200= 12.925 . di (47*145)/200= 34.075

• How do the expected cell counts compare to the observed cell counts: Do the conditional probabilities appear to be equal for nonwhites & whites across no-hsci & yes hsci?

Expected count for each cell: its row total times its column total, divided by the total sample size.

• Each expected cell count is based on the proportion of the total sample accounted for by its entire row & by its entire column.

• The Chi-square test assumes independence (i.e. no association) between the conditional distributions of nonwhites & whites in honors science.

• That is, each expected cell count reflects the null hypothesis of statistical independence (i.e. no association):

• that the proportion of non-white honors science students is simply the proportion of non-white students in the population.

• that the proportion of white honors science students is simply the proportion of white students in the population.

• What’s the alternative hypothesis?

Chi-Square Test Assumptions

• Random sample

• Two categorical variables

• Count data

• At least 5 observations in 80% of the cells & no less than 1 observation in any cell (best if there’s at least 5 observations in all cells)

• Chi-square statistic: measures how much the hypothesis of observed cell counts in a two-way table diverge from the expected cell counts.

• It’s therefore a test of independence:

Ho: the variables are independent from each other

Ha: they are not independent from each other

Chi-square statistic: positive values only (observed cell count – expected cell count)squared, divided by the cell's expected count

• Has a distinct distribution for each degree of freedom (see Moore/McCabe)

• Two-sided hypothesis test only

Chi-square Test: To Repeat… (observed cell count – expected cell count)squared, divided by the cell's expected count

• Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.

• That is, it compares the sample distribution with a hypothesized distribution.

• It’s a test of statistical independence (Ho: no association; Ha: association).

Hypothesis Test (observed cell count – expected cell count)squared, divided by the cell's expected count

Ho: hsci whites = hsci nonwhites

Ha: hsci whites ~= hsci nonwhites (i.e. two-sided alternative)

• Chi-square test: two-sided alternative hypothesis only.

. tab hsci white, cell chi2

• Pearson chi2(1) = 8.7613 Pr = 0.003

• Conclusion: Reject the null hypothesis.

When we add a control variable to the two-way table:

But when adding a control variable, beware of the consequences for sub-sample sizes.

• If the sub-samples are too small, it may be hard to obtain statistical significance.

• So always check the size of sub-samples.

• We can compare two population proportions either by the chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.

• Chi-square test advantage: can compare more than two populations (e.g., SES by race-ethnicity in hsb2.dta); but the original data must be counts.

• z-test advantage: can test either one-sided or two-sided alternatives; the original data may be counts or proportions.

. prtest hsci, by(white)

Two-sample test of proportion nonwhite: Number of obs = 55

white: Number of obs = 145

------------------------------------------------------------------------------

Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]

---------+--------------------------------------------------------------------

nonwhite | .0909091 .0387638 2.34521 0.0190 .0149335 .1668847

white | .2896552 .0376696 7.68936 0.0000 .2158241 .3634863

---------+--------------------------------------------------------------------

diff | -.1987461 .0540521 -.3046863 -.0928059

| under Ho: .0671451 -2.95995 0.0031

------------------------------------------------------------------------------

Ho: proportion(nonwhite) - proportion(white) = diff = 0

Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0

z = -2.960 z = -2.960 z = -2.960

P < z = 0.0015 P > |z| = 0.0031 P > z = 0.9985

. tab hsci white, cell chi2

Pearson chi2(1) = 8.7613 Pr = 0.003

• pr=.003 for Chi-square test & for z-test.

Other Useful Stata Commands

• findit tabchi: displays observed & expected frequencies, various types of residuals (raw, pearson, adjusted), & various tests of significance.

• findit tabout: to make publication-style contingency & other tables

• See the following class documents: ‘Making contingency tables in Stata’; ‘Making working & publication-style tables in Stata’.

Summary: Two-Way Tables

• Two-way tables: categorical data—binomial counts (‘success’ vs. ‘failure’) or proportions (binomial counts divided by the total sample size); but the data must be counts

• Row variables? Column variables? Cells?

Finally, when should we use contingency tables & the Chi-square test?

• As part of bivariate exploratory data analysis, including in preparation for regression analysis

• Or when we don’t have enough observations to do regression analysis (i.e. perhaps categorize the data and do cross-tabs).

. tabulate educ age [freq=years], chi2

. tabulate educ age [freq=years], cell chi2

• The ‘row’ or ‘col’ options may be preferable to use.