Frequency distributions: Testing of goodness of fit and contingency tables

Frequency distributions: Testing of goodness of fit and contingency tables

Chi-square statistics • Widely used for nominal data’s analysis • Introduced by Karl Pearson during 1900 • Its theory and application expanded by him and R. A. Fisher • This lecture will cover Chi-square test, G test, Kolmogorov-Smirnov goodness of fit for continuous data

The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • Obtain a sample of nominal scale data and to infer if the population conforms to a certain theoretical distributione.g. genetic study • Test Ho that the observations (not the variables) are independent of each other for the population. • Based on the difference between the actual observed frequencies(not %) and the expected frequencies

The 2 test: 2 =  (observed freq. - expected freq.)2/ expected freq. • As a measure of how far a sample distribution deviates from a theoretical distribution • Ho: no difference between the observed and expected frequency (HA: they are different) • If Ho is true:the difference andChi-square  SMALL • If Ho is false:both measurements  Large

For Questionnaire Example (1) • In a questionnaire, 259 adults were asked what they thought about cutting air pollution by increasing tax on vehicle fuel. • 113 people agreed with this idea but the rest disagreed. • Perform a Chi-square text to determine the probability of the results being obtained by chance.

For Questionnaire Agree Disagree Observed 113 259 -113 = 146 Expected 259/2 = 129.5 259/2 = 129.5 Ho: Observed = Expected 2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5 = 2.102 + 2.102 = 4.204 df = k - 1 = 2 - 1 = 1 From the Chi-square (Table B1 in Zar’s book) 2 ( = 0.05, df = 1)= 3.841  for 2 = 4.202, 0.025<p<0.05 Therefore, rejected Ho. The probability of the results being obtained by chance is between 0.025 and 0.05.

For Genetics Practical (1) • Calculate the Chi-square of data consisting of 100 flowers to a hypothesized color ratio of 3:1 (red: green) and test the Ho • Ho: the sample data come from a population having a 3:1 ratio of red to green flowers • Observation: 84 red and 16 green • Expected frequency for 100 flowers: • 75 red and 25 green Please Do it Now

For Genetics Practical (2) • Calculate the Chi-square of data consisting of 100 flowers to a hypothesized color ratio of 3:1 (red: green) and test the Ho • Ho: the sample data come from a population having a 3:1 ratio of red to green flowers • Observation: 67 red and 33 green • Expected frequency for 100 flowers: • 75 red and 25 green Please Do it Now

For Genetics For > 2 categories • Ho: The sample of Drosophila from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) • Student’s observations in the lab: PNW PVW DNW DVW Total 300 77 89 36 502 Calculate the chi-square and test Ho

For Genetics • Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) PNW PVW DNW DVW Total Observed 300 77 89 36 502 Exp. proportion 9/16 3/16 3/16 1/16 1 Expected 282.4 94.1 94.1 31.4 502 O - E 17.6 -17.1 -5.1 4.6 0 (O - E)2 309.8 292.4 26.0 21.2 (O - E)2/E 1.1 3.1 0.3 0.7 2= 1.1 + 3.1 + 0.3 + 0.7 = 5.2 df = 4 -1 = 3 2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10 Therefore, accept Ho.

For Questionnaire Cross Tabulation or Contingency Tables • Further examination of the data on the opinion on increasing fuel to cut down air pollution (example 1): • Ho: the decision is independent of sex Males Females Agree 13 (a) 100 (b) Disagree 116 (c) 30 (d) Expected frequency for cell b = (a + b)[(b + d)/N] Males Females n Agree 13 100 113 113(129/259)=56.28 113(130/259)= 56.72 Disagree 116 30 146 146(129/259)=72.72 146(130/259)= 73.28 n 129 130 259

Cross tabulation or contingency tables: • Ho: the decision is independent of sex Males Females n Agree 13 100 113 56.28 56.72 Disagree 116 30 146 72.72 73.28 n 129 130 259 2 = (13 - 56.28)2/56.28 + (100 - 56.72)2/56.72 + (116 - 72.72) 2/72.72 + (30 - 73.28)2/73.28 = 117.63 df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1 2 ( = 0.05, df = 1)= 3.841  p<0.001 Therefore, reject Ho and accept HA that the decision is dependent of sex.

Quicker method for 2 x 2 cross tabulation: Class A Class B n State 1 a b a + b State 2 c d c + d n a + c b + d n = a + b + c +d 2 = n (ad - bc)2/(a + b)(c + d)(a + c)(b + d) Males Females Agree 13 100 113 Disagree 116 30 146 129 130 259 2 = 259(13  30 - 116  100)2/(113)(146)(129)(130) = 117.64 2 ( = 0.05, df = 1)= 3.841  p<0.001; Therefore, rejected Ho.

Yates’ continuity correction: • Chi-square is also a continuous distribution, while the frequencies being analyzed are discontinuous (whole number). • To improve the analysis, Yates’ correction is often applied (Yate,1934): • 2 =  (observed freq. - expected freq. - 0.5)2/ expected freq. • For 2 x 2 contingency table: 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d)

Yates’ Correction (example 1): • 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d) Males Females Agree 13 100 113 Disagree 116 30 146 129 130 259 2 = 259(1330 - 116100 -0.5259)2/(113)(146)(129)(130) = 114.935 (smaller than 117.64, less bias) 2 ( = 0.05, df = 1)= 3.841  p<0.001; Therefore, rejected Ho.

Practical 3: • 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d) • For a drug test, Ho: The survival of the animals is independent of whether the drug is administered Dead Alive n Treated 12 30 42 Not treated 27 31 58 n 39 61 100 Using Yates’ correction to calculate 2 and test the hypothesis Please do it at home

Bias in Chi-square calculations  • If values of expected frequency (fi) are very small, the calculated 2is biased in that it is larger than the theoretical 2 value and we shall tend to reject Ho. • Rules: fi > 1 and no more than 20% of fi < 5.0. • It may be conservative at significance levels < 5%, especially when the expected frequencies are all equal. • If having small fi, (1) increase the sample size if possible, use G-test or (2) combine the categories if possible.   

The G test (log-likelihood ratio) G = 2  O ln (O/E) • Similar to the 2 test • Many statisticians believe that the G test is superior to the 2 test (although at present it is not as popular) • For 2 x 2 cross tabulation: Class A Class B State 1 a b State 2 c d The expected frequency for cell a = (a+b)[(a+c)/n] Practical 3Dead Alive n Treated 12 (16.38) 30 (25.62) 42 Not treated 27 (22.62) 31 (35.38) 58 n 39 61 100

G = 2  O ln (O/E) Dead Alive n Treated 12 (16.38) 30 (25.62) 42 Not treated 27 (22.62) 31 (35.38) 58 n 39 61 100 (1) Calculate G: G = 2 [ 12 ln(12/16.38) + 30 ln(30/25.62) + 27 ln(27/22.62) + 31 ln(31/35.38)] G = 2 (1.681) = 3.362 (2) Calculate the William’s correction:1 + [(w2 - 1)/6nd] where w is the number of frequency cells, n is total number of measurements and d is the degree of freedom (r-1)(c-1) = 1 + [(42 - 1)/ (6)(100)(1)] = 1.025  G (adjusted) = 2 = 3.362/1.025 = 3.28 (< 3.31 from 2 test)  2 ( = 0.05, df = 1)= 3.841  p>0.05; Therefore, accept Ho.

Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1 ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to dark-normal wing (DNW) to dark-vestigial wing (DVW) PNW PVW DNW DVW Total Observed 300 77 89 36 502 Expected 282.4 94.1 94.1 31.4 O ln(O/E) 18.14 -15.44 -4.96 4.92 G value: G = 2  (18.14 - 15.44 - 4.96 + 4.92) = 5.32 William’s correction: 1 + [(42 - 1)/6 (502) (3)] = 1.00166 G (adjusted): 5.32/1.00166 = 5.311 2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10 Therefore, accept Ho.

The Kolmogorov-Smirnov goodness of fit test = Kolmogorov-Smirnov one-sample test • Deal with goodness of fit tests applicable to nominal scale data and for data in ordered categories • Example: 35 cats were tested one at a time, and allowed to choose 5 different diets with different moisture content (1= very moist to 5 = very dry): • Ho: Cats prefer all five equally 1 2 3 4 5 n Observed 2 18 10 4 5 35 Expected 7 7 7 7 7 35

Kolmogorov-Smirnov one-sample test • Ho: Cats prefer all five diets equally 1 2 3 4 5 n O 2 18 10 4 1 35 E 7 7 7 7 7 35 Cumulative O 2 20 30 34 35 Cumulative E 7 14 21 28 35  di  5 6 9 6 0 dmax = maximum  di = 9 (dmax), k, n = (dmax) 0.05, 5, 35 = 7 (Table B8: k = no. of categories) Therefore reject Ho. 0.002< p < 0.005 • When applicable (i.e. the categories are ordered), the K-S test is more powerful than the 2 test when n is small or when values of observed frequencies are small. • Note: order for the same data changed to 2, 1, 4, 18 and 10: the 2 test will give the same results (independent of the orders) but the calculated dmax from the K-S test will be different.

Kolmogorov-Smirnov one-sample test for continuous ratio scale data • Example 22.11 (page 479 in Zar) • Ho: Moths are distributed uniformly from ground level to height of 25 m • HA: Moths are not distributed uniformly from ground level to height of 25 m • Use of Table B9

Kolmogorov-Smirnov one-sample test for grouped data (example 22.11) • The power is reduced by grouping the data and therefore grouping should be avoided whenever possible. • K-S test can be used to test normality of data

Recognizing the Normal Distribution

Recognizing the distribution of your data is important • Provides a firm base on which to establish and test hypotheses • If data are normally distributed, you can use parametric tests; • Otherwise transform data to normal distribution • Or non-parametric tests should be performed

For a reliable test for normality of interval data, n must be large enough (e.g. > 15) • Difficult to tell whether a small data set (e.g. 5) is normally distributed

Methods • Inspection of the frequency histogram • Probability plot • Chi-square goodness of fit • Kolmogorov-Smirnov one-sample test • Symmetry and Kurtosis: D’Agostino-Pearson K2 test (Chapters 6 & 7, Zar 99)

Inspection of the frequency histogram • Construct the frequency histogram • Calculate the mean and median (mode as well, if possible) • Check the shape of the distribution and the location of these measurements

Probability plot c.f./61 =NORMSINV(X) e.g. 1

Probability plot e.g. 1

Probability plot e.g. 2

Probability plot e.g. 2 • Obviously, the data is not distributed on the line. • Based on the frequency distribution of the data, the distribution is positive skew (higher frequencies at lower classes)

Concave curve indicates positive skew which suggest a log-normal distribution (i.e. log-transformation of the upper class limit is required) • very common e.g. mortality rates • Convex curve indicates negative skew • less common (e.g. some binomial distribution)

S-shaped curve suggests ‘bad’ kurtosis: Normality departure but their mean, median, mode remain equal • Leptokurtic distribution: data bunched around the mean, giving a sharp peak • Platykurtic distribution: a board summit which falls rapidly in the tails • Bimodal distributions e.g. toxicity data produce a sigmoid probability plot • Multi-modal distributions: data from animals with several age-classes; undulating wave-like curve

Chi-Square Goodness of Fit 6.1 Accept Ho: the data are normally distributed

=(345438-(49122/70))/(70-1)

Kolmogorov-Smirnov one-sample test Another method can be found in example 7.14 (Zar 99)

Symmetry (Skewness) and Kurtosis • Skewness • A measure of the asymmetry of a distribution. • The normal distribution is symmetric, and has a skewness value of zero. • A distribution with a significant positive skewness has a long right tail. • A distribution with a significant negative skewness has a long left tail. • As a rough guide, a skewness value more than twice it's standard error is taken to indicate a departure from symmetry.

Symmetry (Skewness) and Kurtosis • Kurtosis • A measure of the extent to which observations cluster around a central point. • For a normal distribution, the value of the kurtosis statistic is 0. • Positive kurtosis indicates that the observations cluster more and have longer tails than those in the normal distribution ( leptokurtic). • Negative kurtosis indicates the observations cluster less and have shorter tails ( Platykurtic).

Important Notes • You should read the Chapters 1-7 of Zar 1999 which have been covered by the five lectures so far. • The frequency distribution of a sample can often be identified with a theoretical distribution, such as the normal distribution. • Five methods for comparing a sample distribution: inspection of the frequency histogram; probability plot; Chi-square goodness of fit, Kolmogorov-Smirnov one-sample test and D’Agostino-Pearson K2 test. • Probability plots can be used for testing normal and log-normal distributions. • Graphical methods often provide evidence of non-normal distributions, such as skewness and kurtosis (Excel or SPSS can determine the degree of these two measurements). • The Chi-square goodness of fit or Kolmogorov-Smirnov one-sample test also can be used to test of an unknown distribution against a theoretical distribution (apart from normal distribution).

Binomial & Poisson Distributionsand their Application (Chapters 24 & 25, Zar 1999)

Binomial • Consider nominal scale data that come from a population with only two categories • members of a mammal litter may be classified as male or female • victims of an epidemic as dead or alive • progeny of a Drosophila cross as white-eyed or red-eyed

Binomial Distributions The proportion of the population belonging to one of the two categories is denoted as: • p, then the other q = 1- p • e.g. if 48% male and 52% female so p = 0.48 and q = 0.52

(Source of photos: BBC)

http://www.mun.ca/biology/scarr/Bird_sexing.htm http://zygote.swarthmore.edu/chap20.html

Binomial Distributions • e.g. if p = 0.4 and q = 0.6: for taking 10 random samples, you will expect 4 males and 6 females; however, you might get 1 male and 9 females. • The probabilities of two independent events both occurring is the product of the probabilities of the two separate events: • (p)(q) = (0.4)(0.6) = 0.24; • (p)(p) = 0.16; and • (q)(q) = 0.36

Binomial Distributions • e.g. if p = 0.4 and q = 0.6: for taking 10 random samples, you will expect 4 males and 6 females • The probabilities of either of two independent events is sum of the probabilities of each event, e.g. for having onemale and one female in the sample: pq + qp = 2 pq = 2(0.4)(0.6) = 0.48 • For having all male, all female, Both sexes = 0.16 + 0.36 + 0.48 = 1

Frequency distributions: Testing of goodness of fit and contingency tables