1 / 56

Frequency distributions: Testing of goodness of fit and contingency tables

Frequency distributions: Testing of goodness of fit and contingency tables. Chi-square statistics. Widely used for nominal data’s analysis Introduced by Karl Pearson during 1900 Its theory and application expanded by him and R. A. Fisher

genica
Download Presentation

Frequency distributions: Testing of goodness of fit and contingency tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frequency distributions: Testing of goodness of fit and contingency tables

  2. Chi-square statistics • Widely used for nominal data’s analysis • Introduced by Karl Pearson during 1900 • Its theory and application expanded by him and R. A. Fisher • This lecture will cover Chi-square test, G test, Kolmogorov-Smirnov goodness of fit for continuous data

  3. The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • Obtain a sample of nominal scale data and to infer if the population conforms to a certain theoretical distributione.g. genetic study • Test Ho that the observations (not the variables) are independent of each other for the population. • Based on the difference between the actual observed frequencies(not %) and the expected frequencies

  4. The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • As a measure of how far a sample distribution deviates from a theoretical distribution • Ho: no difference between the observed and expected frequency (HA: they are different) • If Ho is true:the difference andChi-square  SMALL • If Ho is false:both measurements  Large

  5. For Questionnaire Example (1) • In a questionnaire, 259 adults were asked what they thought about cutting air pollution by increasing tax on vehicle fuel. • 113 people agreed with this idea but the rest disagreed. • Perform a Chi-square text to determine the probability of the results being obtained by chance.

  6. For Questionnaire Agree Disagree Observed 113 259 -113 = 146 Expected 259/2 = 129.5 259/2 = 129.5 Ho: Observed = Expected 2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5 = 2.102 + 2.102 = 4.204 df = k - 1 = 2 - 1 = 1 From the Chi-square (Table B1 in Zar’s book) 2 ( = 0.05, df = 1)= 3.841  for 2 = 4.202, 0.025<p<0.05 Therefore, rejected Ho. The probability of the results being obtained by chance is between 0.025 and 0.05.

  7. For Genetics Practical (1) • Calculate the Chi-square of data consisting of 100 flowers to a hypothesized color ratio of 3:1 (red: green) and test the Ho • Ho: the sample data come from a population having a 3:1 ratio of red to green flowers • Observation: 84 red and 16 green • Expected frequency for 100 flowers: • 75 red and 25 green Please Do it Now

  8. For Genetics Practical (2) • Calculate the Chi-square of data consisting of 100 flowers to a hypothesized color ratio of 3:1 (red: green) and test the Ho • Ho: the sample data come from a population having a 3:1 ratio of red to green flowers • Observation: 67 red and 33 green • Expected frequency for 100 flowers: • 75 red and 25 green Please Do it Now

  9. For Genetics For > 2 categories • Ho: The sample of Drosophila from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) • Student’s observations in the lab: PNW PVW DNW DVW Total 300 77 89 36 502 Calculate the chi-square and test Ho

  10. For Genetics • Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) PNW PVW DNW DVW Total Observed 300 77 89 36 502 Exp. proportion 9/16 3/16 3/16 1/16 1 Expected 282.4 94.1 94.1 31.4 502 O - E 17.6 -17.1 -5.1 4.6 0 (O - E)2 309.8 292.4 26.0 21.2 (O - E)2/E 1.1 3.1 0.3 0.7 2= 1.1 + 3.1 + 0.3 + 0.7 = 5.2 df = 4 -1 = 3 2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10 Therefore, accept Ho.

  11. For Questionnaire Cross Tabulation or Contingency Tables • Further examination of the data on the opinion on increasing fuel to cut down air pollution (example 1): • Ho: the decision is independent of sex Males Females Agree 13 (a) 100 (b) Disagree 116 (c) 30 (d) Expected frequency for cell b = (a + b)[(b + d)/N] Males Females n Agree 13 100 113 113(129/259)=56.28 113(130/259)= 56.72 Disagree 116 30 146 146(129/259)=72.72 146(130/259)= 73.28 n 129 130 259

  12. Cross tabulation or contingency tables: • Ho: the decision is independent of sex Males Females n Agree 13 100 113 56.28 56.72 Disagree 116 30 146 72.72 73.28 n 129 130 259 2 = (13 - 56.28)2/56.28 + (100 - 56.72)2/56.72 + (116 - 72.72) 2/72.72 + (30 - 73.28)2/73.28 = 117.63 df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1 2 ( = 0.05, df = 1)= 3.841  p<0.001 Therefore, reject Ho and accept HA that the decision is dependent of sex.

  13. Quicker method for 2 x 2 cross tabulation: Class A Class B n State 1 a b a + b State 2 c d c + d n a + c b + d n = a + b + c +d 2 = n (ad - bc)2/(a + b)(c + d)(a + c)(b + d) Males Females Agree 13 100 113 Disagree 116 30 146 129 130 259 2 = 259(13  30 - 116  100)2/(113)(146)(129)(130) = 117.64 2 ( = 0.05, df = 1)= 3.841  p<0.001; Therefore, rejected Ho.

  14. Yates’ continuity correction: • Chi-square is also a continuous distribution, while the frequencies being analyzed are discontinuous (whole number). • To improve the analysis, Yates’ correction is often applied (Yate,1934): • 2 =  (observed freq. - expected freq. - 0.5)2/ expected freq. • For 2 x 2 contingency table: 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d)

  15. Yates’ Correction (example 1): • 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d) Males Females Agree 13 100 113 Disagree 116 30 146 129 130 259 2 = 259(1330 - 116100 -0.5259)2/(113)(146)(129)(130) = 114.935 (smaller than 117.64, less bias) 2 ( = 0.05, df = 1)= 3.841  p<0.001; Therefore, rejected Ho.

  16. Practical 3: • 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d) • For a drug test, Ho: The survival of the animals is independent of whether the drug is administered Dead Alive n Treated 12 30 42 Not treated 27 31 58 n 39 61 100 Using Yates’ correction to calculate 2 and test the hypothesis Please do it at home

  17. Bias in Chi-square calculations  • If values of expected frequency (fi) are very small, the calculated 2is biased in that it is larger than the theoretical 2 value and we shall tend to reject Ho. • Rules: fi > 1 and no more than 20% of fi < 5.0. • It may be conservative at significance levels < 5%, especially when the expected frequencies are all equal. • If having small fi, (1) increase the sample size if possible, use G-test or (2) combine the categories if possible.   

  18. The G test (log-likelihood ratio) G = 2  O ln (O/E) • Similar to the 2 test • Many statisticians believe that the G test is superior to the 2 test (although at present it is not as popular) • For 2 x 2 cross tabulation: Class A Class B State 1 a b State 2 c d The expected frequency for cell a = (a+b)[(a+c)/n] Practical 3Dead Alive n Treated 12 (16.38) 30 (25.62) 42 Not treated 27 (22.62) 31 (35.38) 58 n 39 61 100

  19. G = 2  O ln (O/E) Dead Alive n Treated 12 (16.38) 30 (25.62) 42 Not treated 27 (22.62) 31 (35.38) 58 n 39 61 100 (1) Calculate G: G = 2 [ 12 ln(12/16.38) + 30 ln(30/25.62) + 27 ln(27/22.62) + 31 ln(31/35.38)] G = 2 (1.681) = 3.362 (2) Calculate the William’s correction:1 + [(w2 - 1)/6nd] where w is the number of frequency cells, n is total number of measurements and d is the degree of freedom (r-1)(c-1) = 1 + [(42 - 1)/ (6)(100)(1)] = 1.025  G (adjusted) = 2 = 3.362/1.025 = 3.28 (< 3.31 from 2 test)  2 ( = 0.05, df = 1)= 3.841  p>0.05; Therefore, accept Ho.

  20. Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) PNW PVW DNW DVW Total Observed 300 77 89 36 502 Expected 282.4 94.1 94.1 31.4 O ln(O/E) 18.14 -15.44 -4.96 4.92 G value: G = 2  (18.14 - 15.44 - 4.96 + 4.92) = 5.32 William’s correction: 1 + [(42 - 1)/6 (502) (3)] = 1.00166 G (adjusted): 5.32/1.00166 = 5.311 2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10 Therefore, accept Ho.

  21. The Kolmogorov-Smirnov goodness of fit test = Kolmogorov-Smirnov one-sample test • Deal with goodness of fit tests applicable to nominal scale data and for data in ordered categories • Example: 35 cats were tested one at a time, and allowed to choose 5 different diets with different moisture content (1= very moist to 5 = very dry): • Ho: Cats prefer all five equally 1 2 3 4 5 n Observed 2 18 10 4 5 35 Expected 7 7 7 7 7 35

  22. Kolmogorov-Smirnov one-sample test • Ho: Cats prefer all five diets equally 1 2 3 4 5 n O 2 18 10 4 1 35 E 7 7 7 7 7 35 Cumulative O 2 20 30 34 35 Cumulative E 7 14 21 28 35  di  5 6 9 6 0 dmax = maximum  di = 9 (dmax), k, n = (dmax) 0.05, 5, 35 = 7 (Table B8: k = no. of categories) Therefore reject Ho. 0.002< p < 0.005 • When applicable (i.e. the categories are ordered), the K-S test is more powerful than the 2 test when n is small or when values of observed frequencies are small. • Note: order for the same data changed to 2, 1, 4, 18 and 10: the 2 test will give the same results (independent of the orders) but the calculated dmax from the K-S test will be different.

  23. Kolmogorov-Smirnov one-sample test for continuous ratio scale data • Example 22.11 (page 479 in Zar) • Ho: Moths are distributed uniformly from ground level to height of 25 m • HA: Moths are not distributed uniformly from ground level to height of 25 m • Use of Table B9

  24. Kolmogorov-Smirnov one-sample test for grouped data (example 22.11) • The power is reduced by grouping the data and therefore grouping should be avoided whenever possible. • K-S test can be used to test normality of data

  25. Recognizing the Normal Distribution

  26. Recognizing the distribution of your data is important • Provides a firm base on which to establish and test hypotheses • If data are normally distributed, you can use parametric tests; • Otherwise transform data to normal distribution • Or non-parametric tests should be performed

  27. For a reliable test for normality of interval data, n must be large enough (e.g. > 15) • Difficult to tell whether a small data set (e.g. 5) is normally distributed

  28. Methods • Inspection of the frequency histogram • Probability plot • Chi-square goodness of fit • Kolmogorov-Smirnov one-sample test • Symmetry and Kurtosis: D’Agostino-Pearson K2 test (Chapters 6 & 7, Zar 99)

  29. Inspection of the frequency histogram • Construct the frequency histogram • Calculate the mean and median (mode as well, if possible) • Check the shape of the distribution and the location of these measurements

  30. Probability plot c.f./61 =NORMSINV(X) e.g. 1

  31. Probability plot e.g. 1

  32. Probability plot e.g. 2

  33. Probability plot e.g. 2 • Obviously, the data is not distributed on the line. • Based on the frequency distribution of the data, the distribution is positive skew (higher frequencies at lower classes)

  34. Concave curve indicates positive skew which suggest a log-normal distribution (i.e. log-transformation of the upper class limit is required) • very common e.g. mortality rates • Convex curve indicates negative skew • less common (e.g. some binomial distribution)

  35. S-shaped curve suggests ‘bad’ kurtosis: Normality departure but their mean, median, mode remain equal • Leptokurtic distribution: data bunched around the mean, giving a sharp peak • Platykurtic distribution: a board summit which falls rapidly in the tails • Bimodal distributions e.g. toxicity data produce a sigmoid probability plot • Multi-modal distributions: data from animals with several age-classes; undulating wave-like curve

  36. Chi-Square Goodness of Fit 6.1 Accept Ho: the data are normally distributed

  37. =(345438-(49122/70))/(70-1)

  38. Kolmogorov-Smirnov one-sample test Another method can be found in example 7.14 (Zar 99)

  39. Symmetry (Skewness) and Kurtosis • Skewness • A measure of the asymmetry of a distribution. • The normal distribution is symmetric, and has a skewness value of zero. • A distribution with a significant positive skewness has a long right tail. • A distribution with a significant negative skewness has a long left tail. • As a rough guide, a skewness value more than twice it's standard error is taken to indicate a departure from symmetry.

  40. Symmetry (Skewness) and Kurtosis • Kurtosis • A measure of the extent to which observations cluster around a central point. • For a normal distribution, the value of the kurtosis statistic is 0. • Positive kurtosis indicates that the observations cluster more and have longer tails than those in the normal distribution ( leptokurtic). • Negative kurtosis indicates the observations cluster less and have shorter tails ( Platykurtic).

  41. Important Notes • You should read the Chapters 1-7 of Zar 1999 which have been covered by the five lectures so far. • The frequency distribution of a sample can often be identified with a theoretical distribution, such as the normal distribution. • Five methods for comparing a sample distribution: inspection of the frequency histogram; probability plot; Chi-square goodness of fit, Kolmogorov-Smirnov one-sample test and D’Agostino-Pearson K2 test. • Probability plots can be used for testing normal and log-normal distributions. • Graphical methods often provide evidence of non-normal distributions, such as skewness and kurtosis (Excel or SPSS can determine the degree of these two measurements). • The Chi-square goodness of fit or Kolmogorov-Smirnov one-sample test also can be used to test of an unknown distribution against a theoretical distribution (apart from normal distribution).

  42. Binomial & Poisson Distributionsand their Application (Chapters 24 & 25, Zar 1999)

  43. Binomial • Consider nominal scale data that come from a population with only two categories • members of a mammal litter may be classified as male or female • victims of an epidemic as dead or alive • progeny of a Drosophila cross as white-eyed or red-eyed

  44. Binomial Distributions The proportion of the population belonging to one of the two categories is denoted as: • p, then the other q = 1- p • e.g. if 48% male and 52% female so p = 0.48 and q = 0.52

  45. (Source of photos: BBC)

  46. http://www.mun.ca/biology/scarr/Bird_sexing.htm http://zygote.swarthmore.edu/chap20.html

  47. Binomial Distributions • e.g. if p = 0.4 and q = 0.6: for taking 10 random samples, you will expect 4 males and 6 females; however, you might get 1 male and 9 females. • The probabilities of two independent events both occurring is the product of the probabilities of the two separate events: • (p)(q) = (0.4)(0.6) = 0.24; • (p)(p) = 0.16; and • (q)(q) = 0.36

  48. Binomial Distributions • e.g. if p = 0.4 and q = 0.6: for taking 10 random samples, you will expect 4 males and 6 females • The probabilities of either of two independent events is sum of the probabilities of each event, e.g. for having onemale and one female in the sample: pq + qp = 2 pq = 2(0.4)(0.6) = 0.48 • For having all male, all female, Both sexes = 0.16 + 0.36 + 0.48 = 1

More Related