Download Presentation
## Topic 9 – Multiple Comparisons

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Topic 9 – Multiple Comparisons**Multiple Comparisons of Treatment Means Reading: 17.7-17.8**Overview**• Brief Review of One-Way ANOVA • Pairwise Comparisons of Treatment Means • Multiplicity of Testing • Linear Combinations & Contrasts of Treatment Means**Review: One-Way ANOVA**• Analysis of Variance (ANOVA) models provide an efficient way to compare multiple groups. In a single factor ANOVA, • The Model F-test will test the equality of all group means at the same time. • If this test is significant, then our next goal is to identify specific differences. This is our big topic for this lesson.**Review: Cell Means Model**• Basic ANOVA Model is: where • Notation: • “i” subscript indicates the level of the factor • “j” subscript indicates observation number within the group**Review: Factor Effects Model**• Relationship to Cell Means:**Review: Notation**• DOT indicates “sum” • BAR indicates “average” or “divide by cell/sample size” • is the mean for all observations • is the mean for the observations in Level i of Factor A. • Sometimes we omit the “dots” for brevity, but the meaning is the same.**Review: Components of Variation**• Variation between groups gets “explained” by allowing the groups to have different means. This variation contributes to MSR. • Variation within groups is unexplained, and contributes to MSE. • The ratio F = MSR / MSE forms the basis for testing the hypothesis that all group means are the same.**Review: Components of Variation**• Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: BETWEEN WITHIN GROUPS GROUPS**Review: Model F Test**• Null Hypothesis (Cell Means) • Alternative Hypothesis • If we conclude the alternative, then it makes sense to try to determine specific differences. • For Factor Effects model:**Further Comparisons**The F-test is Significant... ...What Next?**Pairwise Comparisons**• Generally our next step is that we want to find out more specifics about the actual differences between treatment groups. • Which groups are actually different? • We can compare two groups by looking at the difference between means.**Pairwise Comparisons (2)**• Can rewrite null hypothesis as and so proceed to look at the difference between means. • Estimate difference by . (Note that to this point, it’s the same as a two-sample T test) • A critical value and standard error are all we need for a confidence interval.**Variance for Difference**• Recall that the variance associated to the mean of any given sample is . • So if we take the difference in means for two of our samples, the variance will be • Remember we have assumed equal sample variances, but we don’t know .**SE for Difference in Means**• Estimate by the MSE and then take the square root in order to get the SE: • If the cell sizes happen to be equal:**Confidence Interval**• So the confidence interval will be • Is the use of a t critical value appropriate? • What critical value should be used?**Multiple Comparisons**• We need to compare all of the treatment means. How many comparisons is this? • Suppose we decide to just look at the “largest” difference? Does this mean we don’t need to adjust for multiple comparisons?**Multiple Comparisons (2)**• The fact that we are effectively doing a large number of pairwise comparisons means... • Each test takes a 5% chance of making a Type I error (showing a difference where in reality none exists). • The overall Type I error rate (chance of at least one Type I error) will be much larger than 5% • Effectively, the testing procedure becomes biased in favor of rejecting at least one H0**Valid Approaches**• How do we adjust for this multiplicity issue? • Least Significant Differences Procedure (unadjusted!) – Relies on a significant F-test • Bonferroni Adjustment – turns out to be too conservative for all pairwise comparisons • Tukey Adjustment – best for all pairwise comparisons, usually best for our class because we usually will compare all pairs • Dunnett Adjustment – Appropriate for comparing each treatment to a control (fewer tests).**Least Significant DifferencesNo adjustment**• The LSD procedure goes as follows: • Verify that the model F test is significant to confirm the existence of differences. • Unadjusted differences are used (t tests). At a minimum, the means that are the furthest apart are presumed to be different. • Note: Our textbook mislabels this (what they call LSD is actually Bonferroni adjusted LSD).**LSD (2)**• So if we use the LSD procedure, then we are NOTmaking any formal adjustment • Type I error IS inflated by the number of tests • Some things we can do: • Use strict requirements on the F test: use α=0.005 instead of α=0.05 • Additionally we could strengthen the requirements on the T-tests: using α=0.01 • Neither is a formal adjustment, Type I error is uncontrolled**LSD – Why it Works**• Some p-values are stronger than others. When the F-test is “very” significant, we can be more sure that some groups do have different means and LSD will find those • We are informally adjusting for multiplicity by “strengthening” our requirements for alpha. • Works great when we are exploring, maybe to be followed by a more rigorous study • Not too concerned about Type I errors**Means**Treatment 1 Mean = 13 Treatment 2 Mean = 27 Treatment 3 Mean = 14 Treatment 4 Mean = 24 Overall F-test – Significant, p < 0.001 Pairwise Tests 1v2: <0.001 1v3: 0.8721 1v4: <0.001 2v3: <0.001 2v4: 0.0473 3v4: <0.001 Example**Example (2)**• There are two clear groups here (1,3) and (2,4). Between these groups the differences are clear. • Because the p-value for 2v4 is so borderline, we should not consider these to be different.**Lines Plot (Example)**• A convenient way to represent this information is via a “lines” plot. Treatment Mean Grouping TRT 2 27 A TRT 4 24 A TRT 3 14 B TRT 1 13 B**Lines Plot (2)**• There can be overlapping groups. For example, we might wind up with something like: Treatment Mean Grouping TRT 2 27 A TRT 4 24 A TRT 5 19 A B TRT 3 14 B TRT 1 13 B TRT 6 1 C**Bonferroni Adjustment**• Still uses a t critical value, but we formally adjust our T-tests and use a Bonferroni t • There are (a)(a – 1)/2 pairwise tests. Divide alpha by this number for the pairwise comparisons (can be expensive) • 6 treatments, 15 pairs: effective α=0.00333 • 8 treatments, 28 pairs: effective α=0.00178. • We are formally adjusting the t-critical value to avoid Type I error inflation.**Bonferroni (2)**• The advantage here is that you don’t need to worry about the F-test. (It is possible that you can have significant T-tests without a significant F-test!) • Bonferroni works the best when: • you are only interested in a few of the comparisons (not all pairs are being compared, don’t have to break up α as much!) • you have planned your tests in advance (you know which ones you want to compare before the analysis)**Comparison LSD vs. Bonferroni**• Control of the Type I Error Rate? • Power?**Tukey’s Method**• Concept: The pairwise comparisons are dependent (they involve the same means). We can take advantage of that dependence to get more power than a Bonferroni adjustment (with the same alpha). • The change is in the critical value. Instead of a T-distribution, we use the studentized range distribution (Q) • Critical values in Table A-6 (similar to F-tables); to actually get a usable critical value “Q” we must divide q from the table by .**Tukey’s Method (2)**• Our CI becomes: • This CI will be narrower than the Bonferroni intervals, but still wider than the LSD intervals since it does take care of the overall Type I error rate. • The Tukey method can only be used for pairwise comparisons of means • It also works better when cell sizes are equal! • It is best for all pairwise comparisons!**Tukey vs. Bonferroni**• Remember the only thing that changes is the critical value! • Tukey is always better if you are doing ALL pairwise comparisons • If you only need a small number (planned in advance), Bonferroni can be superior • So by comparing the critical values you can see which method is advantageous (you’ll do this in the homework) • Bonferroni t vs. Tukey Q crit. values • The smaller critical value gives more power!**Minimum Significant Differences**• Because of the structure of the confidence interval, zero will be included in the interval if and only if the difference in means is less than: • Or if the cell sizes are the same:**Minimum Significant Difference (2)**• This is the half-width of the CI, and is called the minimum significant difference • Any two means that differ by a larger value will be considered statistically different. • Note that this value will generally be shown in the SAS output and it depends upon the comparison method in use.**Suppose that you have six treatment groups and the treatment**means are: TRT 1: 52 TRT 2: 76 TRT 3: 58 Suppose we want to compare all 6 treatments, which adjustment is appropriate? ______ From this adjustment, we calculate the Minimum Significant Difference as 10. Which groups are significantly different? Construct a “LINES” plot TRT 4: 54 TRT 5: 83 TRT 6: 46 Example**Example (2)**• First sort the means (increasing or decreasing order): Treatment Mean Grouping TRT 5 83 TRT 2 76 TRT 3 58 TRT 4 54 TRT 1 52 TRT 6 46**Example (2)**• Now, starting at the top, form the first group (remember the Tukey-MSD is 10). Treatment Mean Grouping TRT 5 83 A TRT 2 76 A TRT 3 58 B TRT 4 54 TRT 1 52 TRT 6 46**Example (3)**• Continue down the table (algorithmically): Treatment Mean Grouping TRT 5 83 A TRT 2 76 A TRT 3 58 B TRT 4 54 B C TRT 1 52 B C TRT 6 46 C**Example (4)**• Notice that when a group ends, you simply drop down to the next group mean and start comparing again • It is not unusual at all to have some overlap between groups, so you may have to backward check groups above • Remember this process only works for cell sizes that are the same (or very similar). WHY?**Dunnett’s Method**• Specifically designed for comparing each treatment to a control group! Based on another distribution (similar to Tukey) that reflects the dependence between these a-1 tests. • Like Tukey for “all pairwise comparisons”, Dunnett is the most powerful method for “treatment vs. control” comparisons. • Our book does not have these critical values, but it is easy to use Dunnett in SAS (and it will provide you with the minimum significant difference as well).**Example**• Suppose in our previous example, treatment 6 was a control. We should have used Dunnett’s instead of Tukey. • We calculate the Dunnet MSD as 7 Treatment Mean TRT 5 83 TRT 2 76 TRT 3 58 TRT 4 54 TRT 1 52 CONTROL 46 • Which groups are now different?**Summary: Pairwise Comps.**• For pairwise comparison of treatments: • Dunnett is the most powerful if considering treatments versus control. • Tukey is the most powerful if considering ALL pairwise comparisons. • Bonferroni should only be used if you have a relatively small number of pre-planned comparisons of interest • LSD is appropriate for exploratory studies (to be followed up by a more well-planned study).**SAS Code & Output**• MEANS statement is added to PROC GLM in order to compare levels for a variable listed in the CLASS statement.**Other Options / Formatting**• BON – use Bonferroni instead of Tukey (will produce full output, but you should want only part of it, right?) • ALPHA = ??? changes your significance level • CLM calls for CI’s for the means (BON would apply) • CLDIFF calls for the CI’s for differences • DUNNETT <‘xxx’> uses Dunnett’s method where xxx is the name of the control group • DUNNETTU / DUNNETTL if you want one-sided comparisons (strictly better or worse than control)**Output (Tukey, Lines)**• Blood Type Example**Different Sample Sizes**• For illustration, we just delete one of the points from Type B. • Sample sizes are now 3, 2, 3, 3. • What will happen to the CI’s?