Topic 9 – Multiple Comparisons Multiple Comparisons of Treatment Means Reading: 17.7-17.8
Overview • Brief Review of One-Way ANOVA • Pairwise Comparisons of Treatment Means • Multiplicity of Testing • Linear Combinations & Contrasts of Treatment Means
Review: One-Way ANOVA • Analysis of Variance (ANOVA) models provide an efficient way to compare multiple groups. In a single factor ANOVA, • The Model F-test will test the equality of all group means at the same time. • If this test is significant, then our next goal is to identify specific differences. This is our big topic for this lesson.
Review: Cell Means Model • Basic ANOVA Model is: where • Notation: • “i” subscript indicates the level of the factor • “j” subscript indicates observation number within the group
Review: Factor Effects Model • Relationship to Cell Means:
Review: Notation • DOT indicates “sum” • BAR indicates “average” or “divide by cell/sample size” • is the mean for all observations • is the mean for the observations in Level i of Factor A. • Sometimes we omit the “dots” for brevity, but the meaning is the same.
Review: Components of Variation • Variation between groups gets “explained” by allowing the groups to have different means. This variation contributes to MSR. • Variation within groups is unexplained, and contributes to MSE. • The ratio F = MSR / MSE forms the basis for testing the hypothesis that all group means are the same.
Review: Components of Variation • Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: BETWEEN WITHIN GROUPS GROUPS
Review: Model F Test • Null Hypothesis (Cell Means) • Alternative Hypothesis • If we conclude the alternative, then it makes sense to try to determine specific differences. • For Factor Effects model:
Further Comparisons The F-test is Significant... ...What Next?
Pairwise Comparisons • Generally our next step is that we want to find out more specifics about the actual differences between treatment groups. • Which groups are actually different? • We can compare two groups by looking at the difference between means.
Pairwise Comparisons (2) • Can rewrite null hypothesis as and so proceed to look at the difference between means. • Estimate difference by . (Note that to this point, it’s the same as a two-sample T test) • A critical value and standard error are all we need for a confidence interval.
Variance for Difference • Recall that the variance associated to the mean of any given sample is . • So if we take the difference in means for two of our samples, the variance will be • Remember we have assumed equal sample variances, but we don’t know .
SE for Difference in Means • Estimate by the MSE and then take the square root in order to get the SE: • If the cell sizes happen to be equal:
Confidence Interval • So the confidence interval will be • Is the use of a t critical value appropriate? • What critical value should be used?
Multiple Comparisons • We need to compare all of the treatment means. How many comparisons is this? • Suppose we decide to just look at the “largest” difference? Does this mean we don’t need to adjust for multiple comparisons?
Multiple Comparisons (2) • The fact that we are effectively doing a large number of pairwise comparisons means... • Each test takes a 5% chance of making a Type I error (showing a difference where in reality none exists). • The overall Type I error rate (chance of at least one Type I error) will be much larger than 5% • Effectively, the testing procedure becomes biased in favor of rejecting at least one H0
Valid Approaches • How do we adjust for this multiplicity issue? • Least Significant Differences Procedure (unadjusted!) – Relies on a significant F-test • Bonferroni Adjustment – turns out to be too conservative for all pairwise comparisons • Tukey Adjustment – best for all pairwise comparisons, usually best for our class because we usually will compare all pairs • Dunnett Adjustment – Appropriate for comparing each treatment to a control (fewer tests).
Least Significant DifferencesNo adjustment • The LSD procedure goes as follows: • Verify that the model F test is significant to confirm the existence of differences. • Unadjusted differences are used (t tests). At a minimum, the means that are the furthest apart are presumed to be different. • Note: Our textbook mislabels this (what they call LSD is actually Bonferroni adjusted LSD).
LSD (2) • So if we use the LSD procedure, then we are NOTmaking any formal adjustment • Type I error IS inflated by the number of tests • Some things we can do: • Use strict requirements on the F test: use α=0.005 instead of α=0.05 • Additionally we could strengthen the requirements on the T-tests: using α=0.01 • Neither is a formal adjustment, Type I error is uncontrolled
LSD – Why it Works • Some p-values are stronger than others. When the F-test is “very” significant, we can be more sure that some groups do have different means and LSD will find those • We are informally adjusting for multiplicity by “strengthening” our requirements for alpha. • Works great when we are exploring, maybe to be followed by a more rigorous study • Not too concerned about Type I errors
Means Treatment 1 Mean = 13 Treatment 2 Mean = 27 Treatment 3 Mean = 14 Treatment 4 Mean = 24 Overall F-test – Significant, p < 0.001 Pairwise Tests 1v2: <0.001 1v3: 0.8721 1v4: <0.001 2v3: <0.001 2v4: 0.0473 3v4: <0.001 Example
Example (2) • There are two clear groups here (1,3) and (2,4). Between these groups the differences are clear. • Because the p-value for 2v4 is so borderline, we should not consider these to be different.
Lines Plot (Example) • A convenient way to represent this information is via a “lines” plot. Treatment Mean Grouping TRT 2 27 A TRT 4 24 A TRT 3 14 B TRT 1 13 B
Lines Plot (2) • There can be overlapping groups. For example, we might wind up with something like: Treatment Mean Grouping TRT 2 27 A TRT 4 24 A TRT 5 19 A B TRT 3 14 B TRT 1 13 B TRT 6 1 C
Bonferroni Adjustment • Still uses a t critical value, but we formally adjust our T-tests and use a Bonferroni t • There are (a)(a – 1)/2 pairwise tests. Divide alpha by this number for the pairwise comparisons (can be expensive) • 6 treatments, 15 pairs: effective α=0.00333 • 8 treatments, 28 pairs: effective α=0.00178. • We are formally adjusting the t-critical value to avoid Type I error inflation.
Bonferroni (2) • The advantage here is that you don’t need to worry about the F-test. (It is possible that you can have significant T-tests without a significant F-test!) • Bonferroni works the best when: • you are only interested in a few of the comparisons (not all pairs are being compared, don’t have to break up α as much!) • you have planned your tests in advance (you know which ones you want to compare before the analysis)
Comparison LSD vs. Bonferroni • Control of the Type I Error Rate? • Power?
Tukey’s Method • Concept: The pairwise comparisons are dependent (they involve the same means). We can take advantage of that dependence to get more power than a Bonferroni adjustment (with the same alpha). • The change is in the critical value. Instead of a T-distribution, we use the studentized range distribution (Q) • Critical values in Table A-6 (similar to F-tables); to actually get a usable critical value “Q” we must divide q from the table by .
Tukey’s Method (2) • Our CI becomes: • This CI will be narrower than the Bonferroni intervals, but still wider than the LSD intervals since it does take care of the overall Type I error rate. • The Tukey method can only be used for pairwise comparisons of means • It also works better when cell sizes are equal! • It is best for all pairwise comparisons!
Tukey vs. Bonferroni • Remember the only thing that changes is the critical value! • Tukey is always better if you are doing ALL pairwise comparisons • If you only need a small number (planned in advance), Bonferroni can be superior • So by comparing the critical values you can see which method is advantageous (you’ll do this in the homework) • Bonferroni t vs. Tukey Q crit. values • The smaller critical value gives more power!
Minimum Significant Differences • Because of the structure of the confidence interval, zero will be included in the interval if and only if the difference in means is less than: • Or if the cell sizes are the same:
Minimum Significant Difference (2) • This is the half-width of the CI, and is called the minimum significant difference • Any two means that differ by a larger value will be considered statistically different. • Note that this value will generally be shown in the SAS output and it depends upon the comparison method in use.
Suppose that you have six treatment groups and the treatment means are: TRT 1: 52 TRT 2: 76 TRT 3: 58 Suppose we want to compare all 6 treatments, which adjustment is appropriate? ______ From this adjustment, we calculate the Minimum Significant Difference as 10. Which groups are significantly different? Construct a “LINES” plot TRT 4: 54 TRT 5: 83 TRT 6: 46 Example
Example (2) • First sort the means (increasing or decreasing order): Treatment Mean Grouping TRT 5 83 TRT 2 76 TRT 3 58 TRT 4 54 TRT 1 52 TRT 6 46
Example (2) • Now, starting at the top, form the first group (remember the Tukey-MSD is 10). Treatment Mean Grouping TRT 5 83 A TRT 2 76 A TRT 3 58 B TRT 4 54 TRT 1 52 TRT 6 46
Example (3) • Continue down the table (algorithmically): Treatment Mean Grouping TRT 5 83 A TRT 2 76 A TRT 3 58 B TRT 4 54 B C TRT 1 52 B C TRT 6 46 C
Example (4) • Notice that when a group ends, you simply drop down to the next group mean and start comparing again • It is not unusual at all to have some overlap between groups, so you may have to backward check groups above • Remember this process only works for cell sizes that are the same (or very similar). WHY?
Dunnett’s Method • Specifically designed for comparing each treatment to a control group! Based on another distribution (similar to Tukey) that reflects the dependence between these a-1 tests. • Like Tukey for “all pairwise comparisons”, Dunnett is the most powerful method for “treatment vs. control” comparisons. • Our book does not have these critical values, but it is easy to use Dunnett in SAS (and it will provide you with the minimum significant difference as well).
Example • Suppose in our previous example, treatment 6 was a control. We should have used Dunnett’s instead of Tukey. • We calculate the Dunnet MSD as 7 Treatment Mean TRT 5 83 TRT 2 76 TRT 3 58 TRT 4 54 TRT 1 52 CONTROL 46 • Which groups are now different?
Summary: Pairwise Comps. • For pairwise comparison of treatments: • Dunnett is the most powerful if considering treatments versus control. • Tukey is the most powerful if considering ALL pairwise comparisons. • Bonferroni should only be used if you have a relatively small number of pre-planned comparisons of interest • LSD is appropriate for exploratory studies (to be followed up by a more well-planned study).
SAS Code & Output • MEANS statement is added to PROC GLM in order to compare levels for a variable listed in the CLASS statement.
Other Options / Formatting • BON – use Bonferroni instead of Tukey (will produce full output, but you should want only part of it, right?) • ALPHA = ??? changes your significance level • CLM calls for CI’s for the means (BON would apply) • CLDIFF calls for the CI’s for differences • DUNNETT <‘xxx’> uses Dunnett’s method where xxx is the name of the control group • DUNNETTU / DUNNETTL if you want one-sided comparisons (strictly better or worse than control)
Output (Tukey, Lines) • Blood Type Example
Different Sample Sizes • For illustration, we just delete one of the points from Type B. • Sample sizes are now 3, 2, 3, 3. • What will happen to the CI’s?