Chapter 9
Download
1 / 68

- PowerPoint PPT Presentation


  • 222 Views
  • Updated On :

Chapter 9 Analysis of Two-Way Tables. Two-way (i.e. contingency) tables: to classify & analyze categorical data: Binomial counts: ‘success’ vs. ‘failure’ Proportions: binomial count divided by total sample size.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - galvin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Chapter 9

Analysis of Two-Way Tables


Slide2 l.jpg

Two-way (i.e. contingency) tables: to classify & analyze categorical data:

  • Binomial counts: ‘success’ vs. ‘failure’

  • Proportions: binomial count divided by total sample size


Slide3 l.jpg


Slide4 l.jpg


Slide5 l.jpg

  • Here’s a two-way table: more than two variables.

    . tab hsci white, cell

    Nonwhite WhiteTotal

    not hsci 50 103 153

    25.0% 51.5% 76.5%

    hsci 5 42 47

    2.5% 21.0% 23.5%

    Total 55 145 200

    27.5% 72.5% 100.0%


Slide6 l.jpg

Nonwhite White more than two variables.Total

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55 145 200

27.5% 72.5% 100.0%

  • The row variable: hsci vs. not hsci. The column variable: white vs. nonwhite.


Slide7 l.jpg

Nonwhite White more than two variables.Total

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55 145 200

27.5% 72.5% 100.0%

  • Cells: each combination of values for the two variables (50, 103, 5, 42).


Slide8 l.jpg

Nonwhite White more than two variables.Total

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55 145 200

27.5% 72.5% 100.0%

  • Joint distributions: Each cell’s percentage of the total sample (50/200=.250; 103/200=.515; 5/200=.025; 42/200=.210).


Slide9 l.jpg

Nonwhite White more than two variables.Total

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55145 200

27.5% 72.5% 100.0%

  • The marginal frequencies: the row totals (153, 47) & the column totals (55, 145).


Slide10 l.jpg

Nonwhite White more than two variables.Total

not hsci 50 103 153

25.0% 51.5% 76.5%

hsci 5 42 47

2.5% 21.0% 23.5%

Total 55 145 200

27.5%72.5% 100.0%

  • The marginal distributions: each row total/sample total (76.5%, 23.5%). Each column total/sample total (27.5%, 72.5%).


Slide11 l.jpg

  • Here are the same data displayed as more than two variables.column conditional probabilities:

    . tab hsci white, col nofreq

    nonwhite white Total

    no 90.91% 71.03% 76.5%

    yes 9.09% 28.97% 23.5%

    Total 100.0% 100.% 100.00%

  • The conditional distributions (i.e conditional probabilities): Column—divide each column cell count by its column total count.


Slide12 l.jpg

  • Here are the same data displayed as more than two variables.row conditional probabilities:

    . tab hsci white, row nofreq

    nonwhite white Total

    not hsci 32.68% 67.32% 100.0%

    hsci 10.64% 89.36% 100.0%

    Total 27.5% 72.5% 100.00%

  • The conditional distributions (i.e conditional probabilities): Row—divide each row cell count by its row total count.


Slide13 l.jpg

  • Tip: more than two variables.It’s usually best to compute conditional distributions (i.e. probabilities) across the categories of the explanatory variable.

  • E.g., tab hsci white, col: computes the conditional distributions across the categories of the explanatory variable race-ethnicity (i.e. white vs. nonwhite).

  • Alternatively, you may want to compare joint distributions (i.e. cell counts/total sample): tab hsci white, cell


Slide14 l.jpg

We’ve discussed the following: more than two variables.

  • row variables

  • column variables

  • cells: each combination of values for the two variables.

  • joint distributions: each cell’s percentage of the total sample.


Slide15 l.jpg

  • marginal frequencies more than two variables.

  • marginal distributions: each marginal frequency/total sample size

  • column conditional distributions: divide each column cell count by its column total count

  • row conditional distributions: divide each row cell count by its row total count.


Slide16 l.jpg


Slide17 l.jpg


Slide18 l.jpg

  • Simpson’s Paradox conditional distributions (i.e. probabilities) across the categories of the explanatory variable.

  • An NSF study found that the median salary of newly graduated female engineers & scientists was just 73% of the median salary for males. Here are women’s median salaries in the 16 fields as a percentage of male salaries:

  • 94% 96% 98% 95% 85% 85% 84% 100% 103% 100% 107% 93% 104% 93% 106% 100%

  • How can it be that, on average, the women earn just 73% of the median salary for males, since no listed % falls below 84%?


Slide19 l.jpg

  • Because women are disproportionately located in the lower-paying fields of engineering & science.

  • That is, ‘field of science & engineering’ is a lurking variable (i.e. an unmeasured confounded variable) that influences the observed association between gender & salary.


Slide20 l.jpg

  • Simpson’s Paradox: the reversal of a bivariate relationship due to the influence of a lurking variable.

  • Aggregating data has the effect of ignoring one or more lurking variables.

  • Another example: comparing hospital mortality rates.

  • Yet another: comparing airline on-time rates.


Slide21 l.jpg

  • Conclusion from relationship due to the influence of a lurking variable.

  • Simpson’s Paradox

  • Always be on the lookout for lurking variables with aggregated data!!

  • A bivariate relationship may change direction when a third, control variable is introduced.


Slide22 l.jpg

  • What’s a control variable? relationship due to the influence of a lurking variable.

  • Holding a variable constant makes it acontrol variable: doing so removes the part of the bivariate relationship that was caused by the control variable.

  • That is, controlling for a variable neutralizes its influence on the observed relationship.

  • E.g., controlling for field of science & engineering.

  • E.g., controlling for race/ethnicity.


Slide23 l.jpg

  • To repeat, holding a variable constant relationship due to the influence of a lurking variable.removes its statistical effects from the bivariate association being examined.

  • Doing so ensures (more or less) that a bivariate relationship is assessed apart from the influence of the controlled variable: e.g., the relationship between a Montessori school program & student IQ scores, holding constant social class.


Slide24 l.jpg


Slide25 l.jpg



Slide27 l.jpg

  • A bivariate association may not appear until a third, control variable is introduced.

  • The apparent absence of the bivariate relationship is called spurious non-association.

  • E.g., no association between years of education & level of income in post-WW II data, until controlling for age of respondents.


Slide28 l.jpg

  • Conclusion from control variable is introduced.

  • Spurious Non-Association

  • Explore not just bivariate relationships but also multivariate relationships among all the variables of potential practical or theoretical relevance.


Slide29 l.jpg


Slide30 l.jpg

male in Stata:

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide31 l.jpg


Slide32 l.jpg

male for two-way tables.

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00 %

Pearson chi2(1) = 7.6120 Pr = 0.006

female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide33 l.jpg


Slide34 l.jpg


Slide35 l.jpg

  • Describing Relations in Two-Way Tables insignificant.

  • The original data must be counts.

  • Inference for two-way tables: compare the observed cell counts to the expected cell counts; then compute the Chi-square significance test.


Slide36 l.jpg

  • We begin by computing the insignificant.expected cell counts: row total times column total, divided by total sample size.

  • Premise: the null hypothesis of ‘statistical independence’ (i.e. no association between the variables) characterizes the data.


Slide37 l.jpg

  • Expected cell counts insignificant.: row total times column total, divided by total sample size.

    nonwhite white Total

    no 50 103 153

    yes 5 42 47

    Total 55 145200


Slide38 l.jpg

nonwhite white Total insignificant.

no 50 103 153

yes 5 42 47

Total 55 145 200

. di (153*55)/200=42.075 . di (153*145)/200=110.925

. di (47*55)/200= 12.925 . di (47*145)/200= 34.075

  • How do the expected cell counts compare to the observed cell counts: Do the conditional probabilities appear to be equal for nonwhites & whites across no-hsci & yes hsci?


Slide39 l.jpg

  • Expected count for each cell insignificant.: its row total times its column total, divided by the total sample size.

  • Each expected cell count is based on the proportion of the total sample accounted for by its entire row & by its entire column.

  • The Chi-square test assumes independence (i.e. no association) between the conditional distributions of nonwhites & whites in honors science.


Slide40 l.jpg

  • That is, each expected cell count reflects the null hypothesis of statistical independence (i.e. no association):

  • that the proportion of non-white honors science students is simply the proportion of non-white students in the population.

  • that the proportion of white honors science students is simply the proportion of white students in the population.

  • What’s the alternative hypothesis?


Slide41 l.jpg

  • Chi-Square Test Assumptions hypothesis of

  • Random sample

  • Two categorical variables

  • Count data

  • At least 5 observations in 80% of the cells & no less than 1 observation in any cell (best if there’s at least 5 observations in all cells)


Slide42 l.jpg


Slide43 l.jpg

  • Chi-square statistic: measures how much the hypothesis of observed cell counts in a two-way table diverge from the expected cell counts.

  • It’s therefore a test of independence:

    Ho: the variables are independent from each other

    Ha: they are not independent from each other


Slide44 l.jpg


Slide45 l.jpg

  • Chi-square/df statistic: positive values only cell count – expected cell count)squared, divided by the cell’s expected count

  • Has a distinct distribution for each degree of freedom (see Moore/McCabe)

  • Two-sided hypothesis test only


Slide46 l.jpg

Chi-square Test: To Repeat… cell count – expected cell count)squared, divided by the cell’s expected count

  • Chi-square statistic: measures how much the observed cell counts in a two-way table diverge from the expected cell counts.

  • That is, it compares the sample distribution with a hypothesized distribution.

  • It’s a test of statistical independence (Ho: no association; Ha: association).


Slide47 l.jpg


Slide48 l.jpg

Hypothesis Test cell count – expected cell count)squared, divided by the cell’s expected count

Ho: hsci whites = hsci nonwhites

Ha: hsci whites ~= hsci nonwhites (i.e. two-sided alternative)

  • Chi-square test: two-sided alternative hypothesis only.


Slide49 l.jpg

  • . tab hsci white, cell chi2 cell count – expected cell count)squared, divided by the cell’s expected count

  • nonwhite white Total

  • no 50 103 153

  • 25.0% 51.5% 76.5%

  • yes 5 42 47

  • 2.5% 21.0% 23.5%

  • Total 55 145 200

  • 27.5% 72.5% 100.0%

  • Pearson chi2(1) = 8.7613 Pr = 0.003

  • Conclusion: Reject the null hypothesis.


Slide50 l.jpg


Slide51 l.jpg

female = male we add a control variable to the two-way table:

nonwhite white Total

0 21 41 62

23.08% 45.05% 68.13%

1 2 27 29

2.20% 29.67% 31.87%

Total 23 68 91

25.27% 74.73% 100.00%

Pearson chi2(1) = 7.6120 Pr = 0.006

female = female

nonwhite white Total

0 29 62 91

26.61% 56.88% 83.49%

1 3 15 18

2.75% 13.76% 16.51%

Total 32 77 109

29.36% 70.64% 100.00%

Pearson chi2(1) = 1.6744 Pr = 0.196


Slide52 l.jpg

  • But when adding a control variable, we add a control variable to the two-way table:beware of the consequences for sub-sample sizes.

  • If the sub-samples are too small, it may be hard to obtain statistical significance.

  • So always check the size of sub-samples.



Slide54 l.jpg

  • We can compare two population proportions either by the chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.

  • Chi-square test advantage: can compare more than two populations (e.g., SES by race-ethnicity in hsb2.dta); but the original data must be counts.

  • z-test advantage: can test either one-sided or two-sided alternatives; the original data may be counts or proportions.


Slide55 l.jpg

. prtest hsci, by(white) chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.

Two-sample test of proportion nonwhite: Number of obs = 55

white: Number of obs = 145

------------------------------------------------------------------------------

Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]

---------+--------------------------------------------------------------------

nonwhite | .0909091 .0387638 2.34521 0.0190 .0149335 .1668847

white | .2896552 .0376696 7.68936 0.0000 .2158241 .3634863

---------+--------------------------------------------------------------------

diff | -.1987461 .0540521 -.3046863 -.0928059

| under Ho: .0671451 -2.95995 0.0031

------------------------------------------------------------------------------

Ho: proportion(nonwhite) - proportion(white) = diff = 0

Ha: diff < 0 Ha: diff ~= 0 Ha: diff > 0

z = -2.960 z = -2.960 z = -2.960

P < z = 0.0015 P > |z| = 0.0031 P > z = 0.9985


Slide56 l.jpg

. chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.tab hsci white, cell chi2

nonwhite white Total

no 50 103 153

25.0% 51.50% 76.50%

yes 5 42 47

2.50% 21.0% 23.50%

Total 55 145 200

27.50% 72.50% 100.0%

Pearson chi2(1) = 8.7613 Pr = 0.003

  • pr=.003 for Chi-square test & for z-test.


Slide57 l.jpg

  • Other Useful Stata Commands chi-square test or by the two-sample z-test—which give exactly the same result—because the chi-square test is equal to the square of the z-test.

  • findit tabchi: displays observed & expected frequencies, various types of residuals (raw, pearson, adjusted), & various tests of significance.

  • findit tabout: to make publication-style contingency & other tables

  • See the following class documents: ‘Making contingency tables in Stata’; ‘Making working & publication-style tables in Stata’.


Slide58 l.jpg


Slide59 l.jpg

  • Summary: Two-Way Tables various significance tests, see Agresti & Finlay,

  • Two-way tables: categorical data—binomial counts (‘success’ vs. ‘failure’) or proportions (binomial counts divided by the total sample size); but the data must be counts

  • Row variables? Column variables? Cells?


Slide60 l.jpg


Slide61 l.jpg


Slide62 l.jpg


Slide63 l.jpg


Slide64 l.jpg


Slide65 l.jpg

  • Finally, when should we use contingency tables & the Chi-square test?

  • As part of bivariate exploratory data analysis, including in preparation for regression analysis

  • Or when we don’t have enough observations to do regression analysis (i.e. perhaps categorize the data and do cross-tabs).



Slide67 l.jpg

. tabulate educ age [freq=years], chi2 Chi-square test?

a1 a2 a3 Total

e1 5,325 9,152 16,035 30,512

e2 14,061 24,070 18,320 56,451

e3 11,659 19,926 9,662 41,247

e4 10,342 19,878 8,005 38,225

Total 41,387 73,026 52,022 166,435

Pearson chi2(6) = 9.6e+03 Pr = 0.000


Slide68 l.jpg

. tabulate educ age [freq=years], cell chi2 Chi-square test?

a1 a2 a3 Total

e1 3.20 5.50 9.63 18.33

e2 8.45 14.46 11.01 33.92

e3 7.01 11.97 5.81 24.78

e4 6.21 11.94 4.81 22.97

Total 24.87 43.88 31.26 100.00

Pearson chi2(6) = 9.6e+03 Pr = 0.000

  • The ‘row’ or ‘col’ options may be preferable to use.


ad