1 / 22

Testing statistical hypotheses: The Chi-square test and how Sir R.A. Fisher caught Mendel cheating

The Chi-Square Test. How well does it fit the facts ? In many cases can be answered by the -test.The test was invented in 1900 by Karl PearsonThe test is used when there are more than two categories of data; Like the probabilities of A, C, G, T in two DNA sequences, to check wheth

enrico
Download Presentation

Testing statistical hypotheses: The Chi-square test and how Sir R.A. Fisher caught Mendel cheating

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Testing statistical hypotheses: The Chi-square test and how Sir R.A. Fisher caught Mendel cheating

    2. The Chi-Square Test How well does it fit the facts ? In many cases can be answered by the -test. The test was invented in 1900 by Karl Pearson The test is used when there are more than two categories of data; Like the probabilities of A, C, G, T in two DNA sequences, to check whether these categories are equally likely.

    3. A gambler is accused of using a loaded die but he pleads innocent. A record has been kept for the last 60 throws. 4 3 3 1 2 3 4 6 5 6 2 4 1 3 3 5 3 4 3 4 3 3 4 5 4 5 6 4 5 1 6 4 4 2 3 3 2 4 4 5 6 3 6 2 4 6 4 6 3 2 5 4 6 3 3 3 5 3 1 4 If the gambler is innocent, the numbers from the table should be like 60 random drawings with replacement from a box with {1,2,3,4,5,6}. Each number should show up about 10 times. The expected frequency is 10.

    4. Observed Frequencies Value Observed Freq Expected Freq 1 4 10 2 6 10 3 17 10 4 16 10 5 8 10 6 9 10

    5. The statistic

    6. The P-value: the observed significance level We need to know the chance that when a fair die is rolled 60 times and is computed from the observed frequencies, its value turns out to be 14.2 or more. The answer P=1.4% That is, if the die is fair there is 1.4% chance for the statistic to be as big as or bigger than the observed one. Conclusion: The gambler is in trouble!!!

    7. Degrees of freedom Pearson invented to curves one curve for each degree of freedom. In our case, the model is fully specified, i.e., there is no parameter to estimate from data so degrees of freedom = number of terms in - 1

    8. The -test P-value For the -test the P-value is approximately equal to the area to the right of the observed value for the statistic, under the -curve with the appropriate number of degrees of freedom.

    9. P= area under curve c

    10. Rule of thumb The approximation given by the curve can be trusted when the expected frequency in each line of the table is 5 or more.

    11. Is Mendels experimental data too good to be true ? Yes! In 1865 Gregor Mendel published an article in which he provided a scientific explanation for heredity, and eventually caused a revolution in biology. Mendels experiments were all performed on garden peas. Pea seeds are either yellow or green. Color is a property of the seed. Mendel bred a pure yellow strain, that is a strain in which every plant in every generation had only yellow seeds; and separately he bred a pure green strain.

    12. Yellow and Green peas He then crossed plants of the pure yellow with the plants of pure green The seeds resulted from a yellow-green cross and the resulting plants are called first-generation hybrids. First-generation hybrid seeds are all yellow, indistinguishable from seeds of the pure yellow strain. The green seems to have disappeared completely. These first-generation hybrid seeds grew into first-generation hybrid plants which Mendel crossed with themselves, producing second-generation hybrid seeds. Some of these second generation seeds were yellow, but some were green. So the green disappeared for one generation but reappeared in the second. Even more surprising, the green reappeared in a simple proportion: Of the second generation hybrids 75% were yellow and 25% were green

    13. Factors, aka genes To explain it, Mendel postulated the existence of factors later called genes. According to Mendels theory, there were two different variants of a gene which paired up to control seed color. Denoted Y and G. It is the gene pair in the seed not the parent which determines what color the seed will be, all the cells making up a seed contain the same gene-pair

    14. Y is dominant There are four different gene-pairs: Y/Y, Y/G, G/Y, G/G Gene pairs control seed color by the rule: Y/Y, Y/G,G/Y make yellow G/G makes green As geneticists say, Y is dominant and G is recessive

    15. Randomness Seed grows and become a plant. All cells in this plant also carry the seeds color gene-pair. With one exception: Sex cells, either sperm or eggs, contain only one gene of the pair. For example, a plant whose ordinary gene pair is Y/Y will produce sperm cell each containing a gene Y; similarly it will produce egg cells each containing gene Y. One plant whose pair is Y/G will produce half of its sperm cells containing Y and half containing G. The same is true of the eggs cells.

    16. First generation model explanation Plants of pure yellow have the color pair Y/Y Plants of pure green have the color pair G/G Crossing a pure yellow with a pure green is producing fertilized egg of Y/G gene pair; this cell reproduces itself and eventually becomes a seed, in which all the cells have the gene-pair Y/G and are yellow in color.

    17. Second generation model explanation A first generation hybrid seed grows into a first generation hybrid plant with gene-pair Y/G. This plant produces sperm cells of which half will contain the gene Y and the other half will contain the gene G; it also produces eggs of which half will be Y and half G. When two first generation hybrids are crossed, each resulting second-generation hybrid seed gets one gene at random from each parent -- because it is formed by the random combination of a sperm cell and an egg.

    18. Mendels chance model: He was right!

    19. Did Mendels facts fit his model ? Only too well answered R. A. Fisher

    20. How Fisher used the test to show that: Mendel was cheating For each of Mendels experiments, Fisher computed the statistic. These experiments were all independent, for they involved different sets of plants. And Fisher pooled the results.

    21. Too good to be true For example if one experiment gives = 5.8 with 5 degrees of freedom, and another independent experiment gives = 3.1 with 2 degrees of freedom, the two together have a pooled = 8.9 with 7 degrees of freedom. For Mendels data, Fisher got a pooled under 42, with 84 degrees of freedom. The area under the left of 42 under the curve with 84 degrees of freedom is about 4 in 100,000. The agreement between the observed and expected is too good to be true.

    22. What does it mean ? Suppose million of scientists were repeating Mendels experiments. For each scientist, imagine measuring the discrepancy between his observed frequencies and the expected frequencies by the statistic. Then by the laws of chance, about 99,996 out of every 100,000 of these scientists would report a discrepancy between observations and expectations greater than the one reported by Mendel. That leaves two possibilities. (1) Either Mendels data were massaged (2) Or he was pretty lucky ? The first is easier to believe.

    23. Using chi-square test To test whether the null hypothesis that the prescribed probabilities for the nucleotides of a sequence are for i=1,2,3,4 (aka A, C, G, T) we apply the test If the observed values are such that is large that we reject the null hypothesis. is the number of nucleotides in category in our sequence. The formula for is a measure of discrepancy between the observed values and the respective null hypothesis means When the null hypothesis is true and large we have a chi-square distribution with 4-1=3 degrees of freedom.

More Related