Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 68

Chapter 9 Analysis of Two-Way Tables PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on
  • Presentation posted in: General

Chapter 9 Analysis of Two-Way Tables. Two-way (i.e. contingency) tables: to classify & analyze categorical data: Binomial counts: ‘success’ vs. ‘failure’ Proportions: binomial count divided by total sample size.

Download Presentation

Chapter 9 Analysis of Two-Way Tables

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Chapter 9

Analysis of Two-Way Tables


Slide2 l.jpg

Two-way (i.e. contingency) tables: to classify & analyze categorical data:

  • Binomial counts: ‘success’ vs. ‘failure’

  • Proportions: binomial count divided by total sample size


Slide3 l.jpg

  • We’ll later see that inference via two-way tables is an alternative—with advantages & disadvantages—to the z-test for comparing two sample proportions:

    . prtest hsci, by(white)


Slide4 l.jpg

  • An advantage of two-way tables is that they can examine more than two variables.

  • A disadvantage of two-way tables is that they can only do two-sided hypothesis tests.


Slide5 l.jpg

  • Here’s a two-way table:

    .tab hsci white, cell

    Nonwhite WhiteTotal

    not hsci50 103153

    25.0% 51.5%76.5%

    hsci5 4247

    2.5% 21.0%23.5%

    Total55 145200

    27.5% 72.5%100.0%


Slide6 l.jpg

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

  • The row variable: hsci vs. not hsci. The column variable: white vs. nonwhite.


Slide7 l.jpg

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

  • Cells: each combination of values for the two variables (50, 103, 5, 42).


Slide8 l.jpg

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5% 72.5%100.0%

  • Joint distributions: Each cell’s percentage of the total sample (50/200=.250; 103/200=.515; 5/200=.025; 42/200=.210).


Slide9 l.jpg

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55145200

27.5% 72.5%100.0%

  • The marginal frequencies: the row totals (153, 47) & the column totals (55, 145).


Slide10 l.jpg

Nonwhite WhiteTotal

not hsci50 103153

25.0% 51.5%76.5%

hsci5 4247

2.5% 21.0%23.5%

Total55 145200

27.5%72.5%100.0%

  • The marginal distributions: each row total/sample total (76.5%, 23.5%). Each column total/sample total (27.5%, 72.5%).


Slide11 l.jpg

  • Here are the same data displayed as column conditional probabilities:

    .tab hsci white, col nofreq

    nonwhite white Total

    no 90.91% 71.03% 76.5%

    yes 9.09% 28.97% 23.5%

    Total 100.0% 100.% 100.00%

  • The conditional distributions (i.e conditional probabilities): Column—divide each column cell count by its column total count.


Slide12 l.jpg

  • Here are the same data displayed as row conditional probabilities:

    .tab hsci white, row nofreq

    nonwhite white Total

    not hsci 32.68% 67.32% 100.0%

    hsci 10.64% 89.36% 100.0%

    Total 27.5% 72.5% 100.00%

  • The conditional distributions (i.e conditional probabilities): Row—divide each row cell count by its row total count.


Slide13 l.jpg

  • Tip:It’s usually best to compute conditional distributions (i.e. probabilities) across the categories of the explanatory variable.

  • E.g., tab hsci white, col: computes the conditional distributions across the categories of the explanatory variable race-ethnicity (i.e. white vs. nonwhite).

  • Alternatively, you may want to compare joint distributions (i.e. cell counts/total sample): tab hsci white, cell


Slide14 l.jpg

We’ve discussed the following:

  • row variables

  • column variables

  • cells: each combination of values for the two variables.

  • joint distributions: each cell’s percentage of the total sample.


Slide15 l.jpg

  • marginal frequencies

  • marginal distributions: each marginal frequency/total sample size

  • column conditional distributions: divide each column cell count by its column total count

  • row conditional distributions: divide each row cell count by its row total count.


Slide16 l.jpg

  • And we’ve said that typically it’s best to compute the conditional distributions (i.e. probabilities) across the categories of the explanatory variable.

  • Or that it may be preferable to compare joint distributions (i.e. compare the cell probabilities).


Slide17 l.jpg

  • Let’s next consider conceptual problems of two-way tables.


Slide18 l.jpg

  • Simpson’s Paradox

  • An NSF study found that the median salary of newly graduated female engineers & scientists was just 73% of the median salary for males. Here are women’s median salaries in the 16 fields as a percentage of male salaries:

  • 94% 96% 98% 95% 85% 85% 84% 100% 103% 100% 107% 93% 104% 93% 106% 100%

  • How can it be that, on average, the women earn just 73% of the median salary for males, since no listed % falls below 84%?


Slide19 l.jpg

  • Because women are disproportionately located in the lower-paying fields of engineering & science.

  • That is, ‘field of science & engineering’ is a lurking variable (i.e. an unmeasured confounded variable) that influences the observed association between gender & salary.


Slide20 l.jpg

  • Simpson’s Paradox: the reversal of a bivariate relationship due to the influence of a lurking variable.

  • Aggregating data has the effect of ignoring one or more lurking variables.

  • Another example: comparing hospital mortality rates.

  • Yet another: comparing airline on-time rates.


Slide21 l.jpg

  • Conclusion from

  • Simpson’s Paradox

  • Always be on the lookout for lurking variables with aggregated data!!

  • A bivariate relationship may change direction when a third, control variable is introduced.


Slide22 l.jpg

  • What’s a control variable?

  • Holding a variable constant makes it acontrol variable: doing so removes the part of the bivariate relationship that was caused by the control variable.

  • That is, controlling for a variable neutralizes its influence on the observed relationship.

  • E.g., controlling for field of science & engineering.

  • E.g., controlling for race/ethnicity.


Slide23 l.jpg

  • To repeat, holding a variable constant removes its statistical effects from the bivariate association being examined.

  • Doing so ensures (more or less) that a bivariate relationship is assessed apart from the influence of the controlled variable: e.g., the relationship between a Montessori school program & student IQ scores, holding constant social class.


Slide24 l.jpg

  • What’s better: statistical control or experimental control?

  • The answer returns us to the matter of observational study versus experimental study (see Moore/McCabe, chapter 3).


Slide25 l.jpg

  • Good experimental design controls for all possible lurking variables. Why?

  • But statistical control cannot do so. Why not?

  • Moreover, statistical control is weakened by the imprecision of measurement of variables.

  • But we can’t experiment on everything.


Slide26 l.jpg

  • Let’s consider the following variant on Simpson’s Paradox:


Slide27 l.jpg

  • A bivariate association may not appear until a third, control variable is introduced.

  • The apparent absence of the bivariate relationship is called spurious non-association.

  • E.g., no association between years of education & level of income in post-WW II data, until controlling for age of respondents.


Slide28 l.jpg

  • Conclusion from

  • Spurious Non-Association

  • Explore not just bivariate relationships but also multivariate relationships among all the variables of potential practical or theoretical relevance.


Slide29 l.jpg

  • Here’s how to add a control variable to a two-way table in Stata:

    bys female: tab hsci white, cell chi2


Slide30 l.jpg

male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide31 l.jpg

  • This example introduces a test of statistical significance for two-way tables.

  • The test is based on the Chi-square statistic.


Slide32 l.jpg

male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide33 l.jpg

  • Note in the example that the two-way table for male tests insignificant.

  • How do two-way tables & their test of significance evaluate the data?

  • They do so by comparing expected & observed cell counts in terms of proportional distributions.


Slide34 l.jpg

  • Back to the two-way table without the control variable:

  • :

  • .tab hsci white, cell

  • Nonwhite WhiteTotal

  • not hsci50 103153

  • 25.0% 51.5%76.5%

  • hsci5 4247

  • 2.5% 21.0%23.5%

  • Total55 145200

  • 27.5% 72.5% 100.0%


Slide35 l.jpg

  • Describing Relations in Two-Way Tables

  • The original data must be counts.

  • Inference for two-way tables: compare the observed cell counts to the expected cell counts; then compute the Chi-square significance test.


Slide36 l.jpg

  • We begin by computing the expected cell counts: row total times column total, divided by total sample size.

  • Premise: the null hypothesis of ‘statistical independence’ (i.e. no association between the variables) characterizes the data.


Slide37 l.jpg

  • Expected cell counts: row total times column total, divided by total sample size.

    nonwhite white Total

    no 50 103153

    yes 5 42 47

    Total 55 145200


Slide38 l.jpg

nonwhite white Total

no 50 103153

yes 5 42 47

Total 55 145200

. di (153*55)/200=42.075 . di (153*145)/200=110.925

. di (47*55)/200= 12.925 . di (47*145)/200= 34.075

  • How do the expected cell counts compare to the observed cell counts: Do the conditional probabilities appear to be equal for nonwhites & whites across no-hsci & yes hsci?


Slide39 l.jpg

  • Expected count for each cell: its row total times its column total, divided by the total sample size.

  • Each expected cell count is based on the proportion of the total sample accounted for by its entire row & by its entire column.

  • The Chi-square test assumes independence (i.e. no association) between the conditional distributions of nonwhites & whites in honors science.


Slide40 l.jpg

  • That is, each expected cell count reflects the null hypothesis of statistical independence (i.e. no association):

  • that the proportion of non-white honors science students is simply the proportion of non-white students in the population.

  • that the proportion of white honors science students is simply the proportion of white students in the population.

  • What’s the alternative hypothesis?


Slide41 l.jpg

  • Chi-Square Test Assumptions

  • Random sample

  • Two categorical variables

  • Count data

  • At least 5 observations in 80% of the cells & no less than 1 observation in any cell (best if there’s at least 5 observations in all cells)


Slide42 l.jpg

  • If the assumptions are fulfilled, use the Chi-square test:

    tab hsci white, chi2

  • If the numbers of observations per cell don’t meet the assumptions, use ‘Fisher’s exact test’ (a non-parametric test, which may be very slow):

    tab hsci white, exact


Slide43 l.jpg

  • Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.

  • It’s therefore a test of independence:

    Ho: the variables are independent from each other

    Ha: they are not independent from each other


Slide44 l.jpg

  • Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count

  • Step 2: df = (# row vars –1) (# column vars – 1)

  • Step 3: Chi-square significance test=Chi-square/df


Slide45 l.jpg

  • Chi-square/df statistic: positive values only

  • Has a distinct distribution for each degree of freedom (see Moore/McCabe)

  • Two-sided hypothesis test only


Slide46 l.jpg

Chi-square Test: To Repeat…

  • Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.

  • That is, it compares the sample distribution with a hypothesized distribution.

  • It’s a test of statistical independence (Ho: no association; Ha: association).


Slide47 l.jpg

  • Step 1: Chi-square = summation for all cells of (observed cell count – expected cell count)squared, divided by the cell’s expected count

  • Step 2: df = (# row vars –1) (# column vars – 1)

  • Step 3: Chi-square significance test=Chi-square/df


Slide48 l.jpg

Hypothesis Test

Ho: hsci whites = hsci nonwhites

Ha: hsci whites ~= hsci nonwhites (i.e. two-sided alternative)

  • Chi-square test: two-sided alternative hypothesis only.


Slide49 l.jpg

  • .tab hsci white, cell chi2

  • nonwhite white Total

  • no 50 103 153

  • 25.0% 51.5% 76.5%

  • yes 5 42 47

  • 2.5% 21.0% 23.5%

  • Total 55 145 200

  • 27.5% 72.5% 100.0%

  • Pearson chi2(1) = 8.7613Pr = 0.003

  • Conclusion: Reject the null hypothesis.


Slide50 l.jpg

  • Let’s repeat the earlier example to see what happens when we add a control variable to the two-way table:

    bys female: tab hsci nonwhite, col chi2


Slide51 l.jpg

female = male

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00%

Pearson chi2(1) = 7.6120 Pr = 0.006

female = female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide52 l.jpg

  • But when adding a control variable, beware of the consequences for sub-sample sizes.

  • If the sub-samples are too small, it may be hard to obtain statistical significance.

  • So always check the size of sub-samples.


Slide53 l.jpg

  • Remember: cell counts should be >=5, & for at least 80% of the cells must be this large.


Slide54 l.jpg

  • We can compare two population proportions either by the chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.

  • Chi-square test advantage: can compare more than two populations (e.g., SES by race-ethnicity in hsb2.dta); but the original data must be counts.

  • z-test advantage: can test either one-sided or two-sided alternatives; the original data may be counts or proportions.


Slide55 l.jpg

. prtest hsci, by(white)

Two-sample test of proportion nonwhite: Number of obs = 55

white: Number of obs = 145

------------------------------------------------------------------------------

Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]

---------+--------------------------------------------------------------------

nonwhite | .0909091 .0387638 2.34521 0.0190 .0149335 .1668847

white | .2896552 .0376696 7.68936 0.0000 .2158241 .3634863

---------+--------------------------------------------------------------------

diff | -.1987461 .0540521 -.3046863 -.0928059

| under Ho: .0671451 -2.95995 0.0031

------------------------------------------------------------------------------

Ho: proportion(nonwhite) - proportion(white) = diff = 0

Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0

z = -2.960 z = -2.960 z = -2.960

P < z = 0.0015 P > |z| = 0.0031 P > z = 0.9985


Slide56 l.jpg

.tab hsci white, cell chi2

nonwhite white Total

no 50 103 153

25.0% 51.50% 76.50%

yes 5 42 47

2.50% 21.0% 23.50%

Total 55 145 200

27.50% 72.50% 100.0%

Pearson chi2(1) = 8.7613Pr = 0.003

  • pr=.003 for Chi-square test & for z-test.


Slide57 l.jpg

  • Other Useful Stata Commands

  • findit tabchi: displays observed & expected frequencies, various types of residuals (raw, pearson, adjusted), & various tests of significance.

  • findit tabout: to make publication-style contingency & other tables

  • See the following class documents: ‘Making contingency tables in Stata’; ‘Making working & publication-style tables in Stata’.


Slide58 l.jpg

  • For greater depth concerning contingency tables & their various significance tests, see Agresti & Finlay, Statistical Methods for the Social Sciences, chap. 8.


Slide59 l.jpg

  • Summary: Two-Way Tables

  • Two-way tables: categorical data—binomial counts (‘success’ vs. ‘failure’) or proportions (binomial counts divided by the total sample size); but the data must be counts

  • Row variables? Column variables? Cells?


Slide60 l.jpg

  • Marginal frequencies? Marginal distributions?

  • Joint distributions?

  • Row & column conditional distributions?


Slide61 l.jpg

  • How to compute expected cell frequencies? What do they represent?

  • Null hypothesis? Alternative hypothesis?

  • How to compute the Chi-square test?

  • How to compute its degrees of freedom?

  • Chi-square assumptions?


Slide62 l.jpg

  • Advantages/disadvantages of inference via two-ways tables versus inference via z-test for two sample proportions?

  • Chi-square test of significance: equals the square of the z-test for comparing sample proportions, but the Chi-square test requires the original data to be counts.


Slide63 l.jpg

  • Simpson’s Paradox: aggregating the data ignores lurking variables.

  • Moral of the story: beware of the relations portrayed in aggregated data (i.e. look out for lurking variables)!


Slide64 l.jpg

  • Spurious non-association: a bivariate association appears only when a third, control variable is introduced.

  • Moral of this story: the same as for Simpson’s Paradox.


Slide65 l.jpg

  • Finally, when should we use contingency tables & the Chi-square test?

  • As part of bivariate exploratory data analysis, including in preparation for regression analysis

  • Or when we don’t have enough observations to do regression analysis (i.e. perhaps categorize the data and do cross-tabs).


Slide66 l.jpg

Here’s how to do the tables for Moore/McCabe, problem 9.1:


Slide67 l.jpg

. tabulate educ age [freq=years], chi2

a1a2a3 Total

e1 5,3259,15216,035 30,512

e2 14,06124,07018,320 56,451

e3 11,65919,926 9,662 41,247

e4 10,34219,878 8,005 38,225

Total 41,38773,02652,022 166,435

Pearson chi2(6)= 9.6e+03Pr = 0.000


Slide68 l.jpg

. tabulate educ age [freq=years], cell chi2

a1 a2a3Total

e1 3.20 5.509.6318.33

e2 8.45 14.4611.0133.92

e3 7.01 11.975.8124.78

e4 6.21 11.944.8122.97

Total 24.87 43.8831.26100.00

Pearson chi2(6) = 9.6e+03Pr = 0.000

  • The ‘row’ or ‘col’ options may be preferable to use.


  • Login