- 146 Views
- Uploaded on
- Presentation posted in: General

Chapter 9 Analysis of Two-Way Tables

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Chapter 9

Analysis of Two-Way Tables

Two-way (i.e. contingency) tables: to classify & analyze categorical data:

- Binomial counts: ‘success’ vs. ‘failure’
- Proportions: binomial count divided by total sample size

- We’ll later see that inference via two-way tables is an alternative—with advantages & disadvantages—to the z-test for comparing two sample proportions:
. prtest hsci, by(white)

- An advantage of two-way tables is that they can examine more than two variables.
- A disadvantage of two-way tables is that they can only do two-sided hypothesis tests.

- Here’s a two-way table:
.tab hsci white, cell

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

- The row variable: hsci vs. not hsci. The column variable: white vs. nonwhite.

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

- Cells: each combination of values for the two variables (50, 103, 5, 42).

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

- Joint distributions: Each cell’s percentage of the total sample (50/200=.250; 103/200=.515; 5/200=.025; 42/200=.210).

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55145200

27.5% 72.5%100.0%

- The marginal frequencies: the row totals (153, 47) & the column totals (55, 145).

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5%72.5%100.0%

- The marginal distributions: each row total/sample total (76.5%, 23.5%). Each column total/sample total (27.5%, 72.5%).

- Here are the same data displayed as column conditional probabilities:
.tab hsci white, col nofreq

nonwhite white Total

no 90.91% 71.03% 76.5%

yes 9.09% 28.97% 23.5%

Total 100.0% 100.% 100.00%

- The conditional distributions (i.e conditional probabilities): Column—divide each column cell count by its column total count.

- Here are the same data displayed as row conditional probabilities:
.tab hsci white, row nofreq

nonwhite white Total

not hsci 32.68% 67.32% 100.0%

hsci 10.64% 89.36% 100.0%

Total 27.5% 72.5% 100.00%

- The conditional distributions (i.e conditional probabilities): Row—divide each row cell count by its row total count.

- Tip:It’s usually best to compute conditional distributions (i.e. probabilities) across the categories of the explanatory variable.
- E.g., tab hsci white, col: computes the conditional distributions across the categories of the explanatory variable race-ethnicity (i.e. white vs. nonwhite).
- Alternatively, you may want to compare joint distributions (i.e. cell counts/total sample): tab hsci white, cell

We’ve discussed the following:

- row variables
- column variables
- cells: each combination of values for the two variables.
- joint distributions: each cell’s percentage of the total sample.

- marginal frequencies
- marginal distributions: each marginal frequency/total sample size
- column conditional distributions: divide each column cell count by its column total count
- row conditional distributions: divide each row cell count by its row total count.

- And we’ve said that typically it’s best to compute the conditional distributions (i.e. probabilities) across the categories of the explanatory variable.
- Or that it may be preferable to compare joint distributions (i.e. compare the cell probabilities).

- Let’s next consider conceptual problems of two-way tables.

- Simpson’s Paradox
- An NSF study found that the median salary of newly graduated female engineers & scientists was just 73% of the median salary for males. Here are women’s median salaries in the 16 fields as a percentage of male salaries:
- 94% 96% 98% 95% 85% 85% 84% 100% 103% 100% 107% 93% 104% 93% 106% 100%
- How can it be that, on average, the women earn just 73% of the median salary for males, since no listed % falls below 84%?

- Because women are disproportionately located in the lower-paying fields of engineering & science.
- That is, ‘field of science & engineering’ is a lurking variable (i.e. an unmeasured confounded variable) that influences the observed association between gender & salary.

- Simpson’s Paradox: the reversal of a bivariate relationship due to the influence of a lurking variable.
- Aggregating data has the effect of ignoring one or more lurking variables.
- Another example: comparing hospital mortality rates.
- Yet another: comparing airline on-time rates.

- Conclusion from
- Simpson’s Paradox
- Always be on the lookout for lurking variables with aggregated data!!
- A bivariate relationship may change direction when a third, control variable is introduced.

- What’s a control variable?
- Holding a variable constant makes it acontrol variable: doing so removes the part of the bivariate relationship that was caused by the control variable.
- That is, controlling for a variable neutralizes its influence on the observed relationship.
- E.g., controlling for field of science & engineering.
- E.g., controlling for race/ethnicity.

- To repeat, holding a variable constant removes its statistical effects from the bivariate association being examined.
- Doing so ensures (more or less) that a bivariate relationship is assessed apart from the influence of the controlled variable: e.g., the relationship between a Montessori school program & student IQ scores, holding constant social class.

- What’s better: statistical control or experimental control?
- The answer returns us to the matter of observational study versus experimental study (see Moore/McCabe, chapter 3).

- Good experimental design controls for all possible lurking variables. Why?
- But statistical control cannot do so. Why not?
- Moreover, statistical control is weakened by the imprecision of measurement of variables.
- But we can’t experiment on everything.

- Let’s consider the following variant on Simpson’s Paradox:

- A bivariate association may not appear until a third, control variable is introduced.
- The apparent absence of the bivariate relationship is called spurious non-association.
- E.g., no association between years of education & level of income in post-WW II data, until controlling for age of respondents.

- Conclusion from
- Spurious Non-Association
- Explore not just bivariate relationships but also multivariate relationships among all the variables of potential practical or theoretical relevance.

- Here’s how to add a control variable to a two-way table in Stata:
bys female: tab hsci white, cell chi2

male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196

- This example introduces a test of statistical significance for two-way tables.
- The test is based on the Chi-square statistic.

male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196

- Note in the example that the two-way table for male tests insignificant.
- How do two-way tables & their test of significance evaluate the data?
- They do so by comparing expected & observed cell counts in terms of proportional distributions.

- Back to the two-way table without the control variable:
- :
- .tab hsci white, cell
- Nonwhite WhiteTotal
- not hsci50 103153
- 25.0% 51.5%76.5%
- hsci5 4247
- 2.5% 21.0%23.5%
- Total55 145200
- 27.5% 72.5% 100.0%

- Describing Relations in Two-Way Tables
- The original data must be counts.
- Inference for two-way tables: compare the observed cell counts to the expected cell counts; then compute the Chi-square significance test.

- We begin by computing the expected cell counts: row total times column total, divided by total sample size.
- Premise: the null hypothesis of ‘statistical independence’ (i.e. no association between the variables) characterizes the data.

- Expected cell counts: row total times column total, divided by total sample size.
nonwhite white Total

no 50 103153

yes 5 42 47

Total 55 145200

nonwhite white Total

no 50 103153

yes 5 42 47

Total 55 145200

. di (153*55)/200=42.075 . di (153*145)/200=110.925

. di (47*55)/200= 12.925 . di (47*145)/200= 34.075

- How do the expected cell counts compare to the observed cell counts: Do the conditional probabilities appear to be equal for nonwhites & whites across no-hsci & yes hsci?

- Expected count for each cell: its row total times its column total, divided by the total sample size.
- Each expected cell count is based on the proportion of the total sample accounted for by its entire row & by its entire column.
- The Chi-square test assumes independence (i.e. no association) between the conditional distributions of nonwhites & whites in honors science.

- That is, each expected cell count reflects the null hypothesis of statistical independence (i.e. no association):
- that the proportion of non-white honors science students is simply the proportion of non-white students in the population.
- that the proportion of white honors science students is simply the proportion of white students in the population.
- What’s the alternative hypothesis?

- Chi-Square Test Assumptions
- Random sample
- Two categorical variables
- Count data
- At least 5 observations in 80% of the cells & no less than 1 observation in any cell (best if there’s at least 5 observations in all cells)

- If the assumptions are fulfilled, use the Chi-square test:
tab hsci white, chi2

- If the numbers of observations per cell don’t meet the assumptions, use ‘Fisher’s exact test’ (a non-parametric test, which may be very slow):
tab hsci white, exact

- Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.
- It’s therefore a test of independence:
Ho: the variables are independent from each other

Ha: they are not independent from each other

- Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count
- Step 2: df = (# row vars –1) (# column vars – 1)
- Step 3: Chi-square significance test=Chi-square/df

- Chi-square/df statistic: positive values only
- Has a distinct distribution for each degree of freedom (see Moore/McCabe)
- Two-sided hypothesis test only

Chi-square Test: To Repeat…

- Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.
- That is, it compares the sample distribution with a hypothesized distribution.
- It’s a test of statistical independence (Ho: no association; Ha: association).

- Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count
- Step 2: df = (# row vars –1) (# column vars – 1)
- Step 3: Chi-square significance test=Chi-square/df

Hypothesis Test

Ho: hsci whites = hsci nonwhites

Ha: hsci whites ~= hsci nonwhites (i.e. two-sided alternative)

- Chi-square test: two-sided alternative hypothesis only.

- .tab hsci white, cell chi2
- nonwhite white Total
- no 50 103 153
- 25.0% 51.5% 76.5%
- yes 5 42 47
- 2.5% 21.0% 23.5%
- Total 55 145 200
- 27.5% 72.5% 100.0%
- Pearson chi2(1) = 8.7613Pr = 0.003
- Conclusion: Reject the null hypothesis.

- Let’s repeat the earlier example to see what happens when we add a control variable to the two-way table:
bys female: tab hsci nonwhite, col chi2

female = male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00%

Pearson chi2(1) = 7.6120 Pr = 0.006

female = female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196

- But when adding a control variable, beware of the consequences for sub-sample sizes.
- If the sub-samples are too small, it may be hard to obtain statistical significance.
- So always check the size of sub-samples.

- Remember: cell counts should be >=5, & for at least 80% of the cells must be this large.

- We can compare two population proportions either by the chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.
- Chi-square test advantage: can compare more than two populations (e.g., SES by race-ethnicity in hsb2.dta); but the original data must be counts.
- z-test advantage: can test either one-sided or two-sided alternatives; the original data may be counts or proportions.

. prtest hsci, by(white)

Two-sample test of proportion nonwhite: Number of obs = 55

white: Number of obs = 145

------------------------------------------------------------------------------

Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]

---------+--------------------------------------------------------------------

nonwhite | .0909091 .0387638 2.34521 0.0190 .0149335 .1668847

white | .2896552 .0376696 7.68936 0.0000 .2158241 .3634863

---------+--------------------------------------------------------------------

diff | -.1987461 .0540521 -.3046863 -.0928059

| under Ho: .0671451 -2.95995 0.0031

------------------------------------------------------------------------------

Ho: proportion(nonwhite) - proportion(white) = diff = 0

Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0

z = -2.960 z = -2.960 z = -2.960

P < z = 0.0015 P > |z| = 0.0031 P > z = 0.9985

.tab hsci white, cell chi2

nonwhite white Total

no 50 103 153

25.0% 51.50% 76.50%

yes 5 42 47

2.50% 21.0% 23.50%

Total 55 145 200

27.50% 72.50% 100.0%

Pearson chi2(1) = 8.7613Pr = 0.003

- pr=.003 for Chi-square test & for z-test.

- Other Useful Stata Commands
- findit tabchi: displays observed & expected frequencies, various types of residuals (raw, pearson, adjusted), & various tests of significance.
- findit tabout: to make publication-style contingency & other tables
- See the following class documents: ‘Making contingency tables in Stata’; ‘Making working & publication-style tables in Stata’.

- For greater depth concerning contingency tables & their various significance tests, see Agresti & Finlay, Statistical Methods for the Social Sciences, chap. 8.

- Summary: Two-Way Tables
- Two-way tables: categorical data—binomial counts (‘success’ vs. ‘failure’) or proportions (binomial counts divided by the total sample size); but the data must be counts
- Row variables? Column variables? Cells?

- Marginal frequencies? Marginal distributions?
- Joint distributions?
- Row & column conditional distributions?

- How to compute expected cell frequencies? What do they represent?
- Null hypothesis? Alternative hypothesis?
- How to compute the Chi-square test?
- How to compute its degrees of freedom?
- Chi-square assumptions?

- Advantages/disadvantages of inference via two-ways tables versus inference via z-test for two sample proportions?
- Chi-square test of significance: equals the square of the z-test for comparing sample proportions, but the Chi-square test requires the original data to be counts.

- Simpson’s Paradox: aggregating the data ignores lurking variables.
- Moral of the story: beware of the relations portrayed in aggregated data (i.e. look out for lurking variables)!

- Spurious non-association: a bivariate association appears only when a third, control variable is introduced.
- Moral of this story: the same as for Simpson’s Paradox.

- Finally, when should we use contingency tables & the Chi-square test?
- As part of bivariate exploratory data analysis, including in preparation for regression analysis
- Or when we don’t have enough observations to do regression analysis (i.e. perhaps categorize the data and do cross-tabs).

Here’s how to do the tables for Moore/McCabe, problem 9.1:

. tabulate educ age [freq=years], chi2

a1a2a3 Total

e1 5,3259,15216,035 30,512

e2 14,06124,07018,320 56,451

e3 11,65919,926 9,662 41,247

e4 10,34219,878 8,005 38,225

Total 41,38773,02652,022 166,435

Pearson chi2(6)= 9.6e+03Pr = 0.000

. tabulate educ age [freq=years], cell chi2

a1 a2a3Total

e1 3.20 5.509.6318.33

e2 8.45 14.4611.0133.92

e3 7.01 11.975.8124.78

e4 6.21 11.944.8122.97

Total 24.87 43.8831.26100.00

Pearson chi2(6) = 9.6e+03Pr = 0.000

- The ‘row’ or ‘col’ options may be preferable to use.