BIOL 582

BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence

The first two examples included one frequency distribution and some known or true expectation. • The first two examples included categorical data • There are two different ways we can (and will) go • 1. Goodness of Fit tests for continuous frequency data • 2. Goodness of fit tests for more than one distribution • We have to start with one of these, so let’s start with 1. • Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.

We have done these types of tests nearly all semester! • Kolmogorov-Smirnov and Shapiro-Wilk are such tests • An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned. • For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed. Red notches indicate the height of the curve at the centers of bins, which would indicated expected frequencies/densities

We have done these types of tests nearly all semester! • Kolmogorov-Smirnov and Shapiro-Wilk are such tests • An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned. • For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed. • This method is no longer considered appropriate (as changing the number of columns can change the outcome) • We will use the K-S test as a standard, non-parametric GOF between one distribution and either an intrinsic or extrinsic expectation of its frequency.

What is the K-S test in a nutshell? • The K-S test orders data from lowest to highest • The observed “cumulative relative frequency” distribution is calculated by dividing rank by n. (I.e., 1/n, 2/n, 3/n, …. n/n) • In a stepwise fashion, a cumulative Frequency function produces the expected cumulative relative frequency for every 1/n steps • The difference between observed and expected frequencies is measured at each step • The largest absolute (vertical) distance is used as a test statistic. • This distance is compared to critical values from a Kolmogorov distribution (you can see what that is on your own) • There might be some “adjustments” made along the way to estimate the expected frequencies. Just assume that the canned function knows when to make such adjustments.

A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test. > # Residuals from an anlysis > > snake<-read.csv("snake.data.csv") > attach(snake) > Sex<-as.factor(Sex) > > lm.snake<-lm(HS ~ log(SVL) + Sex) > r<-resid(lm.snake) > r<-r/var(r) # make residuals into standardized residuals > r<-sort(r) # sorts residuals for small to large > n<-length(r) > > # Creating expected frequencies > o<-array(1:n)/n # observed frequencies (densities) > e<-pnorm(r,mean=mean(r),sd=sd(r)) # expected frequencies (densities)

A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test. > # Evaluation > max(abs(o-e)) [1] 0.1032246 > > plot(r,o,ylab="Cumulative relative frequency", > xlab="Standardized Residuals", > main="Circles = observed; Line = expected") > points(r,e,type="l") > > ks.test(r,'pnorm',mean(r),sd(r)) > # indicates to get the cumulative area under > # a curve (p) from a normal distribution One-sample Kolmogorov-Smirnov test data: r D = 0.1032, p-value = 0.749 alternative hypothesis: two-sided

Let’s repeat the test with a different model on the same data, but where the residuals are a little less normal > lm.snake<-lm(HS ~ sqrt(SVL+0.5*SVL^2)) > r<-resid(lm.snake) > r<-r/var(r) > r<-sort(r) > n<-length(r) > > # Creating expected frequencies > o<-array(1:n)/n > e<-pnorm(r,mean=mean(r),sd=sd(r)) > > # Evaluation > max(abs(o-e)) [1] 0.1616094 > > plot(r,o,ylab="Cumulative relative frequency",xlab="Standardized Residuals", > main="Circles = observed; Line = expected") > points(r,e,type="l") > > ks.test(r,'pnorm',mean(r),sd(r)) One-sample Kolmogorov-Smirnov test data: r D = 0.1616, p-value = 0.2217 alternative hypothesis: two-sided

Let’s look at a process that produces one type of data tested against other distributions > # Generate data from a log-normal distribution > y<-rlnorm(50,meanlog=2.5,sdlog=0.6) > y<-sort(y) > r<-(y-mean(y))/sd(y) > n<-length(r) > o<-array(1:n)/n > > # Expected densities from three distributions > e.norm<-pnorm(y,mean=mean(y),sd=sd(y)) > e.poisson<-ppois(y,lambda=mean(y)) > e.log.norm<-plnorm(y,meanlog=mean(log(y)),sdlog=sd(log(y))) > > par(mfrow=c(1,3)) > plot(y,o, main ="Compared to Normal",ylab="Density") > points(y,e.norm,type='l') > plot(y,o, main ="Compared to Poisson",ylab="Density") > points(y,e.poisson,type='l') > plot(y,o, main ="Compared to Lognormal",ylab="Density") > points(y,e.log.norm,type='l’)

Let’s look at a process that produces one type of data tested against other distributions > ks.test(y,'pnorm',(mean(y)),(sd(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.2198, p-value = 0.01335 alternative hypothesis: two-sided > ks.test(y,'ppois',(mean(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.4004, p-value = 9.531e-08 alternative hypothesis: two-sided > ks.test(y,'plnorm',(mean(log(y))),sd(log(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.096, p-value = 0.7096 alternative hypothesis: two-sided

Now let’s consider the case where we have two sets of categorical frequencies, and we wish to compare them to determine if the they have the same distributions of proportional outcomes (irrespective of the sample size) • We have done this already: Contingency Table analysis • Often Contingency tables are called Two-way or Multi-way tables because the sample size (n) can be partitioned in two, or more ways • Sokal and Rohlf (2011) also describe and recommend the following • The following examples are also from Sokal and Rohlf (2011) • * Chi-square tests are also applicable

Example Model I (Box 17.6 Sokal and Rohlf2011) • A plant ecologist samples 100 trees of a rare species in a 400 square-mile area • He records for each tree if it is rooted in serpentine soil, and whether its leaves are pubescent or smooth • Question: Do trees grown in serpentine soils have different ratios of smooth: pubescent leaves? • H0: Ratios are equal

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • For a two-way table • The probability of observing the cell frequencies, a, b, c, and d, is computed as • Via some steps reserved for additional reading, • And G is -2lnL Computationally Easier

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • Observed • G components

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • Observed • G components (add these)

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • Observed • G components (then add these)

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • Observed • G components

Example Model I (Box 17.6 Sokal and Rohlf2011) • Expected values are based on a multinomial distribution: • Model I Two-way tables have type I error rates that are higher than intended. The William's Correction is recommended • Which for the current example is • Thus • The df is equal to (r-1)(c-1) = 1, for two rows and two columns • The probability of finding a value of 1.30277 or higher from a Chi-square distribution with 1 df is 0.253708; thus do not reject the null hypothesis of same ratios (accept null hypothesis of independence  leaf type is independent of soil type)

The general formula for the G stat is from now on • Also, unless otherwise stated, this is the same • As the base is not given, so the log is assumed natural

Example Model II (Sokal and Rohlf2011) • An immunology experiment involved inoculating 111 mice with a pathogenic bacteria • 57 mice were also given antiserum • After a sufficient amount of time, the number of dead mice was compared between the two treatments • This is Model II because the number of mice in the treatments was fixed. • H0: Ratios are equal

Example Model II (Sokal and Rohlf2011) • Observed • G Components • G = 2[377.97216 – 897.29807 + 522.75785] = 6.97927704 • Gadj= 6.97927704/1.15658 = 5.9470375 • P-value = 0.01474; reject null hypothesis  ratios are different

Next time… (Or next two times) • Model III and Fisher’s Exact test • More than 2 rows and columns • Odds-ratios for proportions • Logistic Regression

BIOL 582

BIOL 582

Presentation Transcript

BIOL 3340

BIOL 3340

Mgmt 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

BIOL 582

582

BIOL 3340