Chapter 4 Analysis of Variance
Table 4.1: Weight gain • Pg 55 • 2 factors- source and type • 2 kind of sources Beef and Cereal • 2 types High or Low • 1 Measurement Weight Gain
Factor Functions in R • factor(x = character(), levels = sort(unique.default(x), na.last = TRUE), labels = levels, exclude = NA, ordered = is.ordered(x)) • X - a vector of data, usually taking a small number of distinct values. • Levels - an optional vector of the values that x might have taken. The default is the set of values taken by x, sorted into increasing order. • Exclude - a vector of values to be excluded when forming the set of levels. This should be of the same type as x, and will be coerced if necessary. • Ordered - logical flag to determine if the levels should be regarded as ordered (in the order given)....(in ordered(.)): any of the above, apart from ordered itself. • Related functions - is.factor(x) is.ordered(x) as.factor(x) as.ordered(x)
Design of Analysis • The model formula (the paramater for the aov function) to be used here is the two-way layout with interactions. This partitions the variation of the observation into each factor, interaction, and error term. • Yijk = Mu + Gi + Bi + (G:B)ij + Eijk • This formula works well for a balanced data set- a set with the same number of observations for each factor or combination of factors.
F-tests • For F-tests, the following assumptions are made: • The observations are independent of each other • The observations in each cell arise from a population having a normal distribution • The observations in each cell are from populations having the same variance (homogeneity)
Tapply function (help(“tapply”)) • tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE) • X - an atomic object, typically a vector. • INDEX - list of factors, each of same length as X. The elements are coerced to factors by as.factor. • FUN - the function to be applied. In the case of functions like +, %*%, etc., the function name must be quoted. If FUN is NULL, tapply returns a vector which can be used to subscript the multi-way array tapply normally produces....optional arguments to FUN: the Note section. • Simplify - If FALSE, tapply always returns an array of mode "list". If TRUE (the default), then if FUN always returns a scalar, tapply returns an array with the mode of the scalar.
Interpretation: weightgain • Little difference between mean and variance – homogeneity assumption looks reasonable
Aov function (Analysis of Variance) • aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...) • Formula - A formula specifying the model. • Data - A data frame in which the variables specified in the formula will be found. If missing, the variables are searched for in the standard way. • Contrasts - A list of contrasts to be used for some of the factors in the formula. These are not used for any Error term, and supplying contrasts for factors only in the Error term will give a warning. • ... - Arguments to be passed to lm, such as subset or na.action.
Plot.design function • plot.design(x, y = NULL, fun = mean, data = NULL, ..., ylim = NULL, xlab = "Factors", ylab = NULL, main = NULL, ask = NULL, xaxt = par("xaxt"), axes = TRUE, xtick = FALSE) • X - either a data frame containing the design factors and optionally the response, or a formula or terms object. • Y - the response, if not given in x. • Fun - a function (or name of one) to be applied to each subset. It must return one number for a numeric (vector) input. • Data - data frame containing the variables referenced by x when that is formula like. • ... - graphical arguments such as col, see par.
Summary of Weightgain ANOVA • The summary function displays the usual ANOVA table. • The P value for type is significant • The interaction is borderline significant which complicates our interpretation
Estimate the intercept (Mu) and effects • To study the effect of the interaction, the coefficients are estimated first using the constraints that G1,B1 = 0 • Use coef(wg_aov) • The coefficients estimate the difference G2 – G1, or Beef and Cereal • The constraint option can be checked using options(“contrasts”) • An alternative restriction is the sum(i) of Gi = 0
Interpretation • A Low protein, Cereal combination leads to greater weightgain than a Low, Beef combination • A High protein, Beef combination leads to the highest weightgain.
The data.frame Foster • Substantial differences in mean litter weight for the different genotypes of the mother. • Unbalanced number of observations! • There is a proportion of the variance of the response variable that is attributed to both factors • This proportion must either be removed from the analysis or attributed to either one factor or the other • The design of the following analysis uses the Type I sums of squares method • The effects are considered in different orders – the first factor acquires the portion of the variance that is attributed to both factors.
Interpretation • There are small differences between the two ANOVA tables –the data must be only slightly unbalanced. • The difference in litter weights between genotypes is significant for both analysis. • Use a multivariate approach (Tukey Honest Significant Differences) for more detail • Eg. The effect of Genotype B on litter weight • Possibly only a difference in B and J genotypes
Male Egyptian Skulls • Lets look at the four skull measurements for each of the 5 epochs using the Aggregate function • Four Dependent Variables, one ANOVA wont work • Aggregate function splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. Arguments are: • X - an R object.bya list of grouping elements, each as long as the variables in x. • FUN - a scalar function to compute the summary statistics which can be applied to all data subsets. • Nfrequency - new number of observations per unit of time; must be a divisor of the frequency of x. • Ndeltat - new fraction of the sampling period between successive observations; must be a divisor of the sampling interval of x. • ts.eps - tolerance used to decide if nfrequency is a sub-multiple of the original frequency
Interpretation • Large differences between epoch means • Graph indicates a correlation between the bl and mb measurements • Little evidence of correlation elsewhere
MANOVA function • The MANOVA function gathers the measurements into a matrix using the cbind() function, and passes the matrix to the AOV function multiple times (one for each measurement or dependent variable). It also defines a new summary method, requiring a test parameter.
Interpretation • Four tests for MANOVA show a small p-value – significant difference of measurements between epochs. • Samuel Stanley Wilks' lambda (Λ), • Pillai-M. S. Bartlett trace • Lawley-Hotelling trace • Roy’s greatest root
Further Analysis • Univariate F-tests for each of the four dependent variables suggest that the difference between the maximum breadths (mb) and basialiveolar length (bl) are significant, while the others are not. • Comparing the between just two different epochs simplifies the analysis, and reveals that the p-value decreases as the epochs get farther apart- suggesting a greater change in skull measurements over time.
Homework • Please do Excercise 4.2 in the Purple R Book • “You’ll find it filed under purple”