Design of experiment: ANOVA and testing hypotheses

Design of experiment: ANOVA and testing hypotheses • ANOVA • Interactions • Contrasts • Multiple hypothesis tests

ANOVA tables Standard ANOVA tables look like Where v1,,,vp are values we want to test if they are 0. df-s is degrees of freedom corresponding to these values. SSh-s are corresponding sum of the squares (h denotes hypothesis). F is F-value used for F distribution. Its degrees of freedom are (di,de). Prob-s are corresponding probability. If a particular probability is very low (less than 0.05) then we reject hypothesis that this value is 0. These values are calculated using likelihood ratio test. Let us say we want to test hypothesis: H0: vi=0 vs H1:vi0 Then we maximise likelihood under null hypothesis to find corresponding variance and then we maximise the likelihood under alternative hypothesis and find corresponding variance. Then we calculate sum of the squares for null and alternative hypotheses and find F-statistics

LR test for ANOVA Suppose variances are: Then mean sum of the squares for the null and alternative hypotheses as: Since the first sum of the squares is 2 with degrees of freedom dfh and the second sum of squares is 2 with degrees of freedom dfe and they are independent then their ratio has F-distribution with degrees of freedom (dfh,dfe). Degrees of freedom of hypothesis is found using the number of levels of the factor minus 1 in the simplest case. Using this type of ANOVA tables we can only tell if there is significant differences between means. It does not tell which one is significantly different. Degrees of freedom of hypothesis is defined by the number of constraints it implies. Degrees of freedom of error is as usual the number of observations minus the number of (estimated) parameters

Example: Two way ANOVA Let us consider an example taken from Box, Hunter and Hunter. Experiment was done on animals. Survival times of the animals for various poisons and treatment was tested. Table is: treatment A B C D poisons I 0.31 0.82 0.43 0.45 0.45 1.10 0.45 0.71 0.46 0.88 0.63 0.66 0.43 0.72 0.76 0.62 II 0.36 0.92 0.44 0.56 0.29 0.61 0.35 1.02 0.40 0.49 0.31 0.71 0.23 1.24 0.40 0.38 III 0.22 0.30 0.23 0.30 0.21 0.37 0.25 0.36 0.18 0.38 0.24 0.31 0.23 0.29 0.22 0.33

ANOVA table ANOVA table produced by R: Df Sum Sq Mean Sq F value Pr(>F) pois 2 1.03828 0.51914 22.5135 4.551e-07 *** treat 3 0.92569 0.30856 13.3814 5.057e-06 *** pois:treat 6 0.25580 0.04263 1.8489 0.1170 Residuals 36 0.83013 0.02306 Most important values are F and Pr(>F). In this table we have tests for main effects - pois. and treat. Moreover we have “interactions” between these two factors. If there are significant interactions then it would be difficult to separate effects of these two factors. They should be considered simultaneously. In this case pr. for interaction is not very small and it is not large enough to discard interaction effects with confidence. In these situations transformation of the observations might help. Let us consider ANOVA table for the transformed observations. Let us use transformation 1/y. Now ANOVA table looks like: Df Sum Sq Mean Sq F value Pr(>F) pois 2 34.903 17.452 72.2347 2.501e-13 *** treat 3 20.449 6.816 28.2131 1.457e-09 *** pois:treat 6 1.579 0.263 1.0892 0.3874 Residuals 36 8.697 0.242

ANOVA table According to this table pr. corresponding to the interaction terms is high. It means that interaction for the transformed observations is not significant. We could remove interaction terms. We can build ANOVA table without the interactions. It will look like: Df Sum Sq Mean Sq F value Pr(>F) pois 2 34.903 17.452 71.326 3.124e-14 *** treat 3 20.449 6.816 27.858 4.456e-10 *** Residuals 42 10.276 0.245 Now we can say that there is significant differences between poisons as well as treatments. Sometimes it is useful to use transformation to reduce the effect of interactions. For this several different transformations (inverse, inverse square, log) could be used.

Interactions For one way ANOVA we need 1) to find out if there is significant effect (difference between means); 2) to analyse and find out where this difference is located For two way or higher levels we need first to consider interactions. We would like to remove interactions if we can. Otherwise tests for the main effects (differences between means of observations corresponding to different levels of factors) may not be reliable. If we see significant interactions then we can try transformations of the data. Box and Cox family of variance stabilisiting transformations is one way of removing interactions. If we find necessary transformation then we can carry out further analysis using the transformed data or use GLM with corresponding link function. If interactions can not be removed then we may need to combine all pairs together and work with them as if we have one way anova with IJ levels, where I is the number of levels for the first factor and J is the number of levels for the second factor.

What to do if there is an effect ANOVA tables will tell us if there is an effect somewhere. But it does not say where is this effect. If we can make conclusions that there is significant differences then we should carry on and find out where are these differences. One way (not recommended) is to look at the confidence intervals using confint. For example for poison data: 2.5 % 97.5 % (Intercept) 0.22201644 0.40631690 poison1 -0.09299276 0.01986776 poison2 -0.13414253 -0.06898247 treatB 0.23217989 0.49282011 treatC -0.05198677 0.20865344 treatD 0.08967989 0.35032011 Since these intervals were calculated after making decision that there are significant main effects they meant to be more reliable (we know that there are effects so effects we see on this table may correspond to actual effects that exist). It is called Fisher least significant differences - Fisher’s LSD. Confidence intervals based directly on t distribution are very optimistic when many tests are performed simultaneously.

Contrasts Let us say we have m parameters -  estimate and we have a vector c with m elements and with the property: ici=0. If we are dealing with unbalanced anova then inici=0 should be considered. Then we can build hypothesis H0: iici=0. iiciis called contrast. If we have a mxk matrix C then we can form k contrasts. Matrix C is called contrasts matrix. Usually this matrix is orthogonal. I.e. jcij =0 jcij ckj =0 In anova we are interested in effects, i.e. differences between between effects of different levels of some factors. They can be written using contrasts matrix. Default contrasts matrix used in R - contr.treatment uses this type of contrast matrix. Note that R uses contrasts by default and during the estimation usually not the parameters themselves but those corresponding to contrasts are estimated. Finding where the effect is becomes multiple hypotheses testing problem.

Multiple comparison • One comparison - use t-test or equivalent • Few comparisons - use Bonferroni • Many comparisons - Tukey’s honest significant differences, Holm, Scheffe or others

Bonferroni correction If there is only one comparison then we can use t-test or intervals based on t distribution. However if the number of tests increases then probability that significant effect will be observed when there is no significant effect becomes very large. It can be calculated using 1-(1-)n, where  is significance level and n is the number of comparisons. For example if the significance level is 0.05 and the number of comparisons (tests) is 10 then the probability that at least one significant effect will be detected by chance is 1-(1-0.05)10=0.40. Bonferroni suggested using /n instead of  for designing simultaneous confidence intervals. It means that the intervals will be calculated using i-j  t/(2n)(se of comparison)0.5 Clearly when n becomes very large these intervals will become very conservative. Bonferroni correction is recommended when only few effects are compared. pairwise.t.test in R can do Bonferroni correction. When finding p-values then p values are multiplied by the number of comparisons (Note that of we are testing effects of I levels of factor then the number of comparisons is I(I-1)/2

Bonferroni correction: Example Let us take the example dataset - poisons and try Bonferroni correction for each factor: pairwise.t.test(poison,treat,”none”,data=poisons) 1 2 2 0.32844 - 3 3.3e-05 0.00074 pairwise.t.test(poison,treat,”bonferroni”,data=poisons) 1 2 2 0.9853 - 3 1e-04 0.0022 As it is seen each p-value is multiplied by the number of comparisons 3*2/2 = 3. If the corresponding adjusted p-value becomes more than one then it is truncated to one. It says that there is significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant. Note: Command in R - pairwise.t.test can be used for one way anova only.

Holm correction Another correction for multiple tests - Holmes correction is less conservation than Bonferroni correction. It is a modification of Bonferroni correction. It works in a sequential manner. Let us say we need to make n comparisons an significant level is . Then we calculate p values for all of them and sort them and apply the following procedure: • set i = 1 • If pi< /(n-i+1) then it is significant, otherwise it is not. • If a comparison number i is significant then increment i by one and if i  n go to the step 2 The number of significant effects will be equal to i where the procedure stops. When reporting p-values Holm correction works similar to Bonferroni but in a sequential manner. If we have m comparisons then the smallest p value is multiplied by m, the second smallest is multiplied by m-1, j-th comparison is multiplied by m-j+1

Holm correction: example Let us take the example - the data set poisons and try Holm correction for each factor: pairwise.t.test(poison,treat,”none”,data=poisons) 1 2 2 0.32844 - 3 3.3e-05 0.00074 pairwise.t.test(poison,treat,”holm”,data=poisons) # this correction is the default in R 1 2 2 0.3284 - 3 1e-04 0.0015 The smallest is multiplied by 3 the second by 2 and the largest by 1 It says that there is significant differences between effects of poisons 1 and 3 and between 2 and 3. Difference between effects of poisons 1 and 2 is not significant.

Tukey’s honest significant difference This test is used to calculate simultaneous confidence intervals for differences of all effects. Tukey’s range distribution. If we have a random sample of size N from normal distribution then the distribution of stundetised range - (maxi(xi)-mini(xi))/sd is called Tukey’s distribution. Let us say we want to test if i- j is 0. For simultaneous 100% confidence intervals we need to calculate for all pairs lower and upper limits of the interval using: difference ± ql, sd (1/Ji+1/Jj)0.5 /2 Where q is the -quantile of Tukey’s distribution, Ji and Jjare the numbers of observations used to calculate I and j, sd is the standard deviation, l is the number of levels to be compared and  is the degree of freedom used to calculate sd.

Tukey’s honest significant difference R command to perform this test is TukeyHSD. It takes an object derived using aov as an input and gives confidence intervals for all possible differences. For example for poison data (if you want to use this command you should use aov for analysis) lm1 = aov(time~poison+treat,data=poisons) TukeyHSD(lm1) $poison diff lwr upr p adj 2-1 -0.073125 -0.2089936 0.0627436 0.3989657 # insignifacnt 3-1 -0.341250 -0.4771186 -0.2053814 0.0000008 # significant 3-2 -0.268125 -0.4039936 -0.1322564 0.0000606 # significant $treat diff lwr upr p adj B-A 0.36250000 0.18976135 0.53523865 0.0000083 #sginificant C-A 0.07833333 -0.09440532 0.25107198 0.6221729 #insgnific D-A 0.22000000 0.04726135 0.39273865 0.0076661 #significant C-B -0.28416667 -0.45690532 -0.11142802 0.0004090 #significant D-B -0.14250000 -0.31523865 0.03023865 0.1380432 #insignific D-C 0.14166667 -0.03107198 0.31440532 0.1416151 #insignific

Scheffe’s simultaneous confidence intervals If we have a parameter vector then a linear combination cT is estimatable if there exists a vector a so that E(aTy) = cT. Scheffe’s theorem states that simultaneous 100(1-)% confidence interval for all estimatable  is:  + (q Fq,n-r() )1/2 (var()1/2 q is the dimension of the dimension of space of all possible contrasts, r is the rank of X (design matrix), n is the number of observations. It can also be applied for regression surface confidence intervals xT + (q Fq,n-r() )1/2 (var(xT(XTX)-1x))1/2

Testing for homogeneity of variances One of the simplest tests for homogeneity is using Levene’s test. It can be done using the algorithm: 1) fit linear model; 2) calculate residuals; 3) fit linear model using absolute values of residuals as a response vector and 4) test for differences. If p-values are very small then variances are not equal (homogenic) otherwise variances are not homogenic. There is also an R command levene.test Another test for homogeneity of variances is Bartlett test. It can be done using bartlett.test in R. Bartlett test is sensitive to violation normality assumption. Levene’s test is less sensitive to such violation.

Testing for homogeneity of variances Example: lm2 = lm(time~poison+treat) summary(lm(abs(lm2$res)~poison+treat)) Residual standard error: 0.09448 on 42 degrees of freedom Multiple R-squared: 0.2533, Adjusted R-squared: 0.1644 F-statistic: 2.849 on 5 and 42 DF, p-value: 0.02649 P value is 0.026 small enough (usually you would look around 0.01). lm2 = lm(1/time~poison+treat) summary(lm(abs(lm2$res)~poison+treat)) Residual standard error: 0.2634 on 42 degrees of freedom Multiple R-squared: 0.1494, Adjusted R-squared: 0.04812 F-statistic: 1.475 on 5 and 42 DF, p-value: 0.2183 Now p-value is sufficiently big. So we can say that variances are homogenic.

References • Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A • Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters • Berthold, MJ and Hand, DJ. Intelligent Data Analysis • Cox, GM and Cochran WG. Experimental design • Dalgaard, P. Introductory Statistics with R

Design of experiment: ANOVA and testing hypotheses