VI. Determining the analysis of variance table

VI. Determining the analysis of variance table VI.A The procedure VI.B The Latin square example VI.C Usage of the procedure VI.D Rules for determining the analysis of variance table — summary VI.E Determining the analysis of variance table — further examples Statistical Modelling Chapter VI

VI.A The procedure The 7 steps • Description of pertinent features of the study • The experimental structure • Sources derived from the structure formulae • Degrees of freedom and sums of squares • The analysis of variance table • Maximal expectation and variation models • The expected mean squares. • Should be done when designing an experiment as it allows you to work out the properties of the experiment: • in particular, what effects will occur in the experiment and how they will affect each other. Statistical Modelling Chapter VI

Works for the vast majority of experimental designs used in practice. Statistical Modelling Chapter VI

a) Description of pertinent features of the study • The first stage in determining the analysis of variance table is to identify the following features: • observational unit • response variable • unrandomized factors • randomized factors • type of study Statistical Modelling Chapter VI

Definitions of pertinent features • Definition VI.1: The observational unit is the native physical entity which is individually measured. • For example, a person in a survey or a run in an experiment. • Definition VI.2: The response variable is the measured variable that the investigator wants to see if the factors affect its response. • For example, the experimenter may want to determine whether or not there are differences in yield, height, and so on for the different treatments — this is the response variable; that is, the variable of interest or for which differences might exist. • Definition VI.3: The unrandomized factors are those factors that would index the observational units if no randomization had been performed. • Definition VI.4: The randomized factors are those factors associated with the observational unit as a result of randomization. • Definition VI.5: The type of study is the name of the experimental design or sampling method; for example, CRD, RCBD, LS, SRS, factorial, and so on. Statistical Modelling Chapter VI

Unrandomized vs randomized factors • To decide whether a factor is unrandomized or randomized, consider what information about the factors would be available if no randomization had been performed. • Unrandomized factors are innate to the observational units, so no need to have performed the randomization to know which of their levels are associated with the different observational units. • Levels of the randomized factors associated with the different observational units can only be known after the randomization has been performed. • Rule VI.1: To determine whether a factor is unrandomized or randomized, ask the following question: • For an observational unit, can I identify the levels of that factor associated with the unit if randomization has not been performed? • If yes then the factor is unrandomized, if no then it is randomized. Statistical Modelling Chapter VI

Features of experiments Example VI.1 Calf diets • In an experiment to investigate differences between two calf diets the progeny of five dams who had twins were taken. • For the two calves of each dam, one was chosen at random to receive diet A and the other diet B. • The weight gained by each calf in the first 6 months was measured. • The observations for the experiment might be: Statistical Modelling Chapter VI

Summary of the features of the study • Observational unit • A calf (10) • Response variable • Weight gain • Unrandomized factors • Dam, Calf • Randomized factors • Diet • Type of study • RCBD Statistical Modelling Chapter VI

Unrandomized factors unique • Dam and Calf uniquely identify the observations as no 2 observational units with the same combination of these two factors • (for example 2,1). Statistical Modelling Chapter VI

Example VI.2 Plant yield • Consider a CRD experiment consisting of 5 observations, each observation being the yield of a single plot which had one of three varieties applied to it. • The results of the experiment are as follows: • What are the features of the study? Statistical Modelling Chapter VI

Features of plant experiment • Observational unit • a plot Variables (including factors) are? • Yield, Variety, Plot • Response variable • Yield • Unrandomized factors • Plot • Randomized factors • Variety • Type of study • CRD • Note that two of the levels of the factor Variety are replicated twice and the third only once. Statistical Modelling Chapter VI

Additional points • Feature reflected in ANOVA table, particularly the unrandomized and randomized factors. • The number of unrandomized factors is a characteristic of each design. Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives • Consider the experiment to investigate the reduction in the emission of nitrous oxides resulting from the use of four different petrol additives. • Four cars and four drivers are employed in this study with additives being assigned to a particular driver-car combination according to a Latin square. Statistical Modelling Chapter VI

Features of pollution experiment • Observational unit • a car with a driver Variables (including factors) are? • Reduction, Driver, Car, Additive • Response variable • Reduction • Unrandomized factors • Driver, Car • Randomized factors • Additive • Type of study • Latin square Statistical Modelling Chapter VI

Features of surveys Example VI.4 Vineyard sampling • A vineyard of 125 vines is sampled at random with 15 vines being selected at random and the yields measured. • What are the features of the study? • Observational unit • a vine Variables (including factors) are? • Yield, Vines • Response variable • Yield • Unrandomized factors • Vines • Randomized factors • Not applicable • Type of study • Simple random sample Statistical Modelling Chapter VI

Example VI.5 Smoking effect on blood cholesterol • Consider an observational study to investigate the effect of smoking on blood cholesterol by observing 30 patients and recording whether they smoke tobacco and measuring their blood cholesterol. Suppose it happens that 11 patients smoke and 19 patients do not smoke. • What are the features of the study? • Observational unit • a patient Variables (including factors) are? • Blood cholesterol, Smoking, Patients • Response variable • Blood cholesterol • Unrandomized factors • Smoking, Patients • Randomized factors • Not applicable • Type of study • Survey Statistical Modelling Chapter VI

Unrandomized vs randomized factors revisited • In determining the unrandomized and randomized factors it is most important to distinguish between • randomization: random selection to assign • random sampling: random selection to observe a fraction of a wholly observable population • It is not surprising that surveys do not contain randomized factors, since they do not involve randomization. • Remember the crucial question is: • If I take an observational unit, can I tell which level of this factor is associated with that unit without doing the randomization? • If yes then unrandomized, otherwise randomized. Statistical Modelling Chapter VI

b) The experimental structure • Having determined the unrandomized and randomized factors, one next determines the experimental structure. • Rule VI.2: Determine the experimental structure by • describing the nesting and crossing relationships between the unrandomized factors in the experiment, • describing the crossing and nesting relationships between • the randomized factors, and • (the randomized and the unrandomized factors, if any.) • Nos of levels of factors placed in front of names of factors. • Often step ii) not be required. • assume that the effects of the randomized factors are approximately the same for each observational unit • so unrandomized and randomized factors can be treated as independent. Statistical Modelling Chapter VI

Crossing vs nesting • Definition VI.6: Two factors are intrinsically nested if units with the same level of the nested factor, but different levels of the nesting factor, have no apparent characteristic in common. • Definition VI.7: Two factors are intrinsically crossed if units with the same level of one factor, but different levels of the second factor, have a common characteristic associated with the first factor. Statistical Modelling Chapter VI

Examples • Suppose I have 6 heights for 3 students of each sex. Example VI.6 Student height — unknown age • Student (3 levels) is intrinsically nested within Sex (2 levels). • Two Student 1s, have different Sex • However, apparently nothing in common, save inconsequentially labelled with same no. Example VI.7 Student height — known age • Suppose I have 6 students indexed by • Sex, with 2 levels, and • Age, with 3 levels. • Two factors are intrinsically crossed because two 18 year-old students, even though different sex, have in common that they are both 18. Statistical Modelling Chapter VI

Notation for factor relationships • Between two factors: • a slash (‘/’) indicates that they are nested (nested factor on right). • an asterisk (‘*’) indicates that they are crossed • a wedge (‘’) signifies all combinations of the two factors • a plus (+) indicates they are to be considered independently. • Order (high to low) of precedence of operators in a structure formula is ‘’, ‘/’, ‘*’ and ‘+’. • For example, the structure formula A * B / C is the same as A * ( B / C ). Statistical Modelling Chapter VI

Example VI.1 Calf diets (continued) • The factors were designated: • Unrandomized factors – Dam, Calf • Randomized factors – Diet • So are the unrandomized factors nested? • Well, 'Do we have information that connects a calf from one dam with any of the calves from another dam?' • No. So they are nested. Statistical Modelling Chapter VI

Example VI.1 Calf diets (continued) • In fact, Calf is nested within Dam. • This is written symbolically as Dam/Calf. • Thus the experimental structure for this experiment is: Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives (continued) • The factors were designated: • Unrandomized factors – Driver, Car • Randomized factors – Additive • So are the unrandomized factors nested? • Well, 'Do we have information that connects one of the drivers of a car with a driver from another car?' • Yes, one of the 4 from each of the other cars is the same driver. Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives (continued) • They are crossed. • This is written symbolically as Driver*Car. • Thus the experimental structure for this experiment is: Statistical Modelling Chapter VI

Respecting intrinsic crossing • In pollution experiment Drivers & Cars intrinsically crossed. • Consider following two designs for the experiment: • LS on left — each Additive once in each row and column • RCBD, with Cars (columns) as blocks, on right. • Each additive is used in a car once and only once. • Same cannot be said of drivers. • Appropriate structure for this design would be Car/Driver. • Use if thought no driver differences – more Residual df • Only LS respects intrinsic crossing by restricting randomization in both directions Statistical Modelling Chapter VI

Crossing & nesting not just intrinsic • A factor will be nested within another either because: • they are intrinsically nested or • because the randomization employed requires that they be so regarded. • Hence, for two factors to be crossed requires that: • they are intrinsically crossed and • that the randomization employed respects this relationship. • A Latin square respects the crossing of its two unrandomized factors, the RCBD does not. Statistical Modelling Chapter VI

Unrandomized factors for the different designs • The set of unrandomized factors will uniquely identify the observations. • Surveys have only unrandomized factors. • CRD — only one unrandomized factor, the only design that does. RCBD (& BIBD) — two unrandomized factors, one of which is nested within the other. LS (& YS) — two unrandomized factors that are crossed. Statistical Modelling Chapter VI

c) Sources derived from the structure formulae • Having determined the experimental structure, the next step is to expand the formulae to obtain the sources that are to be included in the analysis of variance table. • Rule VI.3: The rules for expanding structure formulae involving two factors A and B are: • A*B = A + B + A#B where A#B represents the interaction of A and B • A/B = A + B[A] `where B[A] represents the nested effects of B within A • More generally, if L and M are two formulae • L*M = L + M + L#M where L#M is the sum of all pairwise combinations of a source in L with a source in M • L/M = L + M[gf(L)] where gf(L) is the generalized factor (see definition VI.8 in section D, Degrees of freedom) formed from the combination of all factors in L. Statistical Modelling Chapter VI

Examples Example VI.1 Calf diets (continued) • Dam/Calf = Dam + Calf[Dam] • where Calf[Dam] measures the difference between calves with the same Dam. Example VI.3 Pollution effects of petrol additives (continued) • Driver*Car = Car + Driver + Driver#Car • where Driver#Car measures extent to which Driver differences change from Car to Car. (More about interaction later.) Statistical Modelling Chapter VI

d) Degrees of freedom and sums of squares • The DFs for an ANOVA can be calculated with the aid of Hasse diagrams for Generalized-Factor Marginalities for each structure formula. • Definition VI.8: A generalized factor is the factor formed from several (original) factors and whose levels are the combinations that occur in the experiment of the levels of the constituent factors. • The generalized factor is the "meet" of the constituent factors and it is written as the list of constituent factors separated by "wedges" or "meets" (""). • For convenience we include ordinary factors amongst the set of generalized factors for an experiment. • Generalized factor corresponding to each source obtained from structure formulae —consists of the factors in source. Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives (continued) • Consider sources Car and Driver#Car from Latin square example. • The two generalized factors corresponding to these sources are Car and DriverCar • DriverCar is a factor with 4  4 = 16 levels, one for each combination of Driver and Car. Statistical Modelling Chapter VI

Marginality of generalized factors • Hasse diagram for Generalized-Factor Marginalities displays marginality relationships between the generalized factors corresponding to the sources from a structure formula. • Have previously discussed marginality for models. • Basically same here, except applies to generalized factors that are potentially single indicator-variable terms in a model. • Definition VI.9: One generalized factor, V say, is marginal to another, Z say, if the factors in the marginal generalized factor are a subset of those in the other and this will occur irrespective of the replication of the levels of the generalized factors. • We write V  Z. • Of course, a generalized factor is marginal to itself (V = V). Example VI.3 Pollution effects of petrol additives (continued) • Car is marginal to DriverCar (Car < DriverCar) as the factor in Car is a subset of those in DriverCar. Statistical Modelling Chapter VI

Constructing Hasse diagram • Rule VI.4: The Hasse diagrams for Generalized-Factor Marginalities for a structure formula is formed by • placing generalized factors above those to which they are marginal • connecting them by an upwards arrow • adding each source alongside its generalized factor (to save space use only 1st letter for each factor). • In constructing a Hasse diagram, • The Universe factor (no named factor), connecting all units that occurred in the experiment, is included at the top of the diagram. • Next consider generalized factors consisting of 1 factor, then 2 factors, and so on. Works because must be less factors in a marginal generalized factor. • Under a generalized factor write its no. of levels. • Under a source write the DF for the corresponding source. • Calculate as difference between the entry for its generalized factor and the sum of the DFs of all sources whose generalized factors are marginal to the current generalized factor. Construct Hasse diagram for Dam/Calf and Driver*Car on board. Statistical Modelling Chapter VI

DF when all factors in formula are crossed • Rule VI.5: When all the factors are crossed, the degrees of freedom of any source can be calculated directly. The rule for doing this is: • For each factor in the source, calculate the number of levels minus one and multiply these together. Example VI.3 Pollution effects of petrol additives (continued) • Both Car and Driver in Latin square example have 4 levels. • The degrees of freedom of Driver#Car, corresponding to DriverCar, is (4-1)(4-1) = 32= 9. Statistical Modelling Chapter VI

Hasse diagrams for some of the experiments Statistical Modelling Chapter VI

Source corresponding to generalized factor BlocksUnits is Units[Blocks]. • BlocksUnits has bt levels • Units[Blocks] has b(t-1) degrees of freedom Statistical Modelling Chapter VI

Source corresponding to the generalized factor RowsColumns is Rows#Columns. • RowsColumns has t2 levels • Rows#Columns has (t-1)2 degrees of freedom. Statistical Modelling Chapter VI

Expression for Qs in terms of Ms • Clear can write SSq as Y'QY— but how to compute SSq • Need expression for Q in terms of Ms • Rule VI.6: There is a mean operator (M) and a quadratic-form (Q) matrix for each generalized factor and source, respectively, obtained from the structure formulae. • To obtain expressions for them, take the Hasse diagram of generalized factors for the formula: • for each generalized factor, replace its number of levels combinations with its M matrix. • for each source, work out the expression for its Q matrix by • taking the M matrix for its generalized factor; • subtract all expressions for Q matrices of sources whose generalized factors are marginal to the generalized factor for the source whose expression you are deriving; • replace the degrees of freedom under the source in the Hasse diagram with the expression. Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives (continued) Statistical Modelling Chapter VI

Estimators of the quadratic forms or SSqs on which ANOVA based • YQDY, YQCY , YQDCY and YQAY • where • QD=MD-MG • QC=MC-MG • QDC=MDC-MD-MC + MG • QA=MA-MG • Note no YQGY • From this can say what vector to calculate for forming SSq • e.g. YQDCY is the SSq of QDCY where • so element is? hence summation form of SSq is? Statistical Modelling Chapter VI

Example VI.1 Calf diets (continued) • Note use of "d" for Diet in contrast to "D" for Dam. • Thus estimators for SSqs are YQDY, YQDCY and YQdY • where QD=MD-MG, QDC=MDC-MD and Qd=Md-MG. Statistical Modelling Chapter VI

e) The analysis of variance table • At this step an analysis of variance table with the sources, their df and the quadratic-form estimators is formulated. • Use following rule, although it only specifies the quadratic-form estimators for orthogonal experiments. • Rule VI.7: The analysis of variance table is formed by: • Listing down all the unrandomized sources in the Source column, and their degrees of freedom in the df column and the quadratic forms in the SSq column. • Then the randomized sources are placed indented under the unrandomized sources with which they are confounded, along with their df and, if the design is orthogonal, their quadratic forms. • Residual sources are added to account for the left-over portions of unrandomized sources and their dfs and quadratic forms are computed by difference.For orthogonal experiments, matrix of Residual quadratic form is the difference of the matrices of the quadratic forms from which it is computed. Statistical Modelling Chapter VI

Example VI.1 Calf diets (continued) and it can be proven that this matrix is symmetric and idempotent. Statistical Modelling Chapter VI

Example VI.3 Pollution effects of petrol additives (continued) and it can be proven that this matrix is symmetric and idempotent. Statistical Modelling Chapter VI

f) Maximal expectation and variation models • Have been writing down expectation and variation models as the sum of a set of indicator-variable terms, these terms being derived from the generalized factors. • Rule VI.8: To obtain the terms in the expectation and variation model: • Designate each factor in the experiment as either fixed or random. • Determine whether a generalized factor is a potential expectation or variation term as follows: • generalized factors that involve only fixed (original) factors are potential expectation terms • generalized factors that contain at least one random (original) factor will become variation terms. • If there is no unrandomized factor that has been classified as random, the term consisting of all unrandomized factors will be designated as random. • The maximal expectation model is then the sum of all the expectation terms except those that are marginal to a term in the model; if there are no expectation terms, the model consists of a single term for the grand mean. • The maximal variation model is the sum of all the variation terms. Statistical Modelling Chapter VI

Fixed versus random factors • So first step in determining model is to classify all the factors in the experiment as fixed or random. • Definition VI.10: A factor will be designated as random if it is anticipated that the distribution of effects associated with the population set of levels for the factor can be described using a probability distribution function. • Definition VI.11: A factor will be designated as fixed if it is anticipated that a probability distribution function will not provide a satisfactory description the set of effects associated with the population set of levels for the factor. • In practice • Random if • large number of population levels and • random behaviour • Fixed if • small or large number of population levels and • systematic or other non-random behaviour Statistical Modelling Chapter VI

Fixed versus random (continued) • Remember, must always model terms to which other terms have been randomized as random effects. • For example, because Treatments are randomized to Units (within Blocks) in an RCBD, Units must be a random factor. • It often happens, but not always, that: • unrandomized factors = random factors so all terms from unrandomized structure in variation model and • randomized factors = fixed factors so all terms from randomized structure, minus marginal terms, in expectation model. Statistical Modelling Chapter VI

Example VI.1 Calf diets (continued) • For Diet: random or fixed? • it is most unlikely that the 2 observed levels will be thought of as being representative of a very large group of Diets and we anticipate that there will be arbitrary differences between the different Diets and so it is a fixed factor. • For Dam: random or fixed? • we can envisage a very large population of dams of which the 5 are representative and random dam differences so the set of population effects may well be described by a probability distribution and so it is a random factor. • For Calf: random or fixed? • similarly, a large population of calves and random calf differences so population effects described by probability distribution. • anyway Diets randomized to Calf • Thus, appropriate to designate Dam and Calf as random factors. • (diagnostic checking will try to support this assumption). Statistical Modelling Chapter VI

Formulating the model • The generalized factors in the experiment are: • Dam, DamCalf and Diet. • As the first two generalized factors contain at least one random factor, they are variation terms. • The generalized factor Diet consists of only a fixed factor and so it is a potential expectation term. • Hence, the maximal model to be used for this experiment is: • y = E[Y] = Diet • V[Y] = Dam + DamCalf. • fixed = randomized and random = unrandomized Statistical Modelling Chapter VI

VI. Determining the analysis of variance table