Topic 9 – Multiple Comparisons

1 / 77

# Topic 9 – Multiple Comparisons - PowerPoint PPT Presentation

Topic 9 – Multiple Comparisons. Multiple Comparisons of Treatment Means Reading: 17.7-17.8. Overview. Brief Review of One-Way ANOVA Pairwise Comparisons of Treatment Means Multiplicity of Testing Linear Combinations & Contrasts of Treatment Means. Review: One-Way ANOVA.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Topic 9 – Multiple Comparisons' - deion

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Topic 9 – Multiple Comparisons

Multiple Comparisons

of Treatment Means

Overview
• Brief Review of One-Way ANOVA
• Pairwise Comparisons of Treatment Means
• Multiplicity of Testing
• Linear Combinations & Contrasts of Treatment Means
Review: One-Way ANOVA
• Analysis of Variance (ANOVA) models provide an efficient way to compare multiple groups. In a single factor ANOVA,
• The Model F-test will test the equality of all group means at the same time.
• If this test is significant, then our next goal is to identify specific differences. This is our big topic for this lesson.
Review: Cell Means Model
• Basic ANOVA Model is:

where

• Notation:
• “i” subscript indicates the level of the factor
• “j” subscript indicates observation number within the group
Review: Factor Effects Model
• Relationship to Cell Means:
Review: Notation
• DOT indicates “sum”
• BAR indicates “average” or “divide by cell/sample size”
• is the mean for all observations
• is the mean for the observations in Level i of Factor A.
• Sometimes we omit the “dots” for brevity, but the meaning is the same.
Review: Components of Variation
• Variation between groups gets “explained” by allowing the groups to have different means. This variation contributes to MSR.
• Variation within groups is unexplained, and contributes to MSE.
• The ratio F = MSR / MSE forms the basis for testing the hypothesis that all group means are the same.
Review: Components of Variation
• Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have:

BETWEEN WITHIN

GROUPS GROUPS

Review: Model F Test
• Null Hypothesis (Cell Means)
• Alternative Hypothesis
• If we conclude the alternative, then it makes sense to try to determine specific differences.
• For Factor Effects model:

### Further Comparisons

The F-test is Significant...

...What Next?

Pairwise Comparisons
• Generally our next step is that we want to find out more specifics about the actual differences between treatment groups.
• Which groups are actually different?
• We can compare two groups by looking at the difference between means.
Pairwise Comparisons (2)
• Can rewrite null hypothesis as and so proceed to look at the difference between means.
• Estimate difference by . (Note that to this point, it’s the same as a two-sample T test)
• A critical value and standard error are all we need for a confidence interval.
Variance for Difference
• Recall that the variance associated to the mean of any given sample is .
• So if we take the difference in means for two of our samples, the variance will be
• Remember we have assumed equal sample variances, but we don’t know .
SE for Difference in Means
• Estimate by the MSE and then take the square root in order to get the SE:
• If the cell sizes happen to be equal:
Confidence Interval
• So the confidence interval will be
• Is the use of a t critical value appropriate?
• What critical value should be used?
Multiple Comparisons
• We need to compare all of the treatment means. How many comparisons is this?
• Suppose we decide to just look at the “largest” difference? Does this mean we don’t need to adjust for multiple comparisons?
Multiple Comparisons (2)
• The fact that we are effectively doing a large number of pairwise comparisons means...
• Each test takes a 5% chance of making a Type I error (showing a difference where in reality none exists).
• The overall Type I error rate (chance of at least one Type I error) will be much larger than 5%
• Effectively, the testing procedure becomes biased in favor of rejecting at least one H0
Valid Approaches
• How do we adjust for this multiplicity issue?
• Least Significant Differences Procedure (unadjusted!) – Relies on a significant F-test
• Bonferroni Adjustment – turns out to be too conservative for all pairwise comparisons
• Tukey Adjustment – best for all pairwise comparisons, usually best for our class because we usually will compare all pairs
• Dunnett Adjustment – Appropriate for comparing each treatment to a control (fewer tests).
• The LSD procedure goes as follows:
• Verify that the model F test is significant to confirm the existence of differences.
• Unadjusted differences are used (t tests). At a minimum, the means that are the furthest apart are presumed to be different.
• Note: Our textbook mislabels this (what they call LSD is actually Bonferroni adjusted LSD).
LSD (2)
• So if we use the LSD procedure, then we are NOTmaking any formal adjustment
• Type I error IS inflated by the number of tests
• Some things we can do:
• Use strict requirements on the F test: use α=0.005 instead of α=0.05
• Additionally we could strengthen the requirements on the T-tests: using α=0.01
• Neither is a formal adjustment, Type I error is uncontrolled
LSD – Why it Works
• Some p-values are stronger than others. When the F-test is “very” significant, we can be more sure that some groups do have different means and LSD will find those
• We are informally adjusting for multiplicity by “strengthening” our requirements for alpha.
• Works great when we are exploring, maybe to be followed by a more rigorous study
• Not too concerned about Type I errors
Means

Treatment 1 Mean = 13

Treatment 2 Mean = 27

Treatment 3 Mean = 14

Treatment 4 Mean = 24

Overall F-test – Significant, p < 0.001

Pairwise Tests

1v2: <0.001

1v3: 0.8721

1v4: <0.001

2v3: <0.001

2v4: 0.0473

3v4: <0.001

Example
Example (2)
• There are two clear groups here (1,3) and (2,4). Between these groups the differences are clear.
• Because the p-value for 2v4 is so borderline, we should not consider these to be different.
Lines Plot (Example)
• A convenient way to represent this information is via a “lines” plot.

Treatment Mean Grouping

TRT 2 27 A

TRT 4 24 A

TRT 3 14 B

TRT 1 13 B

Lines Plot (2)
• There can be overlapping groups. For example, we might wind up with something like:

Treatment Mean Grouping

TRT 2 27 A

TRT 4 24 A

TRT 5 19 A B

TRT 3 14 B

TRT 1 13 B

TRT 6 1 C

• Still uses a t critical value, but we formally adjust our T-tests and use a Bonferroni t
• There are (a)(a – 1)/2 pairwise tests. Divide alpha by this number for the pairwise comparisons (can be expensive)
• 6 treatments, 15 pairs: effective α=0.00333
• 8 treatments, 28 pairs: effective α=0.00178.
• We are formally adjusting the t-critical value to avoid Type I error inflation.
Bonferroni (2)
• The advantage here is that you don’t need to worry about the F-test. (It is possible that you can have significant T-tests without a significant F-test!)
• Bonferroni works the best when:
• you are only interested in a few of the comparisons (not all pairs are being compared, don’t have to break up α as much!)
• you have planned your tests in advance (you know which ones you want to compare before the analysis)
Comparison LSD vs. Bonferroni
• Control of the Type I Error Rate?
• Power?
Tukey’s Method
• Concept: The pairwise comparisons are dependent (they involve the same means). We can take advantage of that dependence to get more power than a Bonferroni adjustment (with the same alpha).
• The change is in the critical value. Instead of a T-distribution, we use the studentized range distribution (Q)
• Critical values in Table A-6 (similar to F-tables); to actually get a usable critical value “Q” we must divide q from the table by .
Tukey’s Method (2)
• Our CI becomes:
• This CI will be narrower than the Bonferroni intervals, but still wider than the LSD intervals since it does take care of the overall Type I error rate.
• The Tukey method can only be used for pairwise comparisons of means
• It also works better when cell sizes are equal!
• It is best for all pairwise comparisons!
Tukey vs. Bonferroni
• Remember the only thing that changes is the critical value!
• Tukey is always better if you are doing ALL pairwise comparisons
• If you only need a small number (planned in advance), Bonferroni can be superior
• So by comparing the critical values you can see which method is advantageous (you’ll do this in the homework)
• Bonferroni t vs. Tukey Q crit. values
• The smaller critical value gives more power!
Minimum Significant Differences
• Because of the structure of the confidence interval, zero will be included in the interval if and only if the difference in means is less than:
• Or if the cell sizes are the same:
Minimum Significant Difference (2)
• This is the half-width of the CI, and is called the minimum significant difference
• Any two means that differ by a larger value will be considered statistically different.
• Note that this value will generally be shown in the SAS output and it depends upon the comparison method in use.
Suppose that you have six treatment groups and the treatment means are:

TRT 1: 52

TRT 2: 76

TRT 3: 58

Suppose we want to compare all 6 treatments, which adjustment is appropriate? ______

From this adjustment, we calculate the Minimum Significant Difference as 10. Which groups are significantly different? Construct a “LINES” plot

TRT 4: 54

TRT 5: 83

TRT 6: 46

Example
Example (2)
• First sort the means (increasing or decreasing order):

Treatment Mean Grouping

TRT 5 83

TRT 2 76

TRT 3 58

TRT 4 54

TRT 1 52

TRT 6 46

Example (2)
• Now, starting at the top, form the first group (remember the Tukey-MSD is 10).

Treatment Mean Grouping

TRT 5 83 A

TRT 2 76 A

TRT 3 58 B

TRT 4 54

TRT 1 52

TRT 6 46

Example (3)
• Continue down the table (algorithmically):

Treatment Mean Grouping

TRT 5 83 A

TRT 2 76 A

TRT 3 58 B

TRT 4 54 B C

TRT 1 52 B C

TRT 6 46 C

Example (4)
• Notice that when a group ends, you simply drop down to the next group mean and start comparing again
• It is not unusual at all to have some overlap between groups, so you may have to backward check groups above
• Remember this process only works for cell sizes that are the same (or very similar). WHY?
Dunnett’s Method
• Specifically designed for comparing each treatment to a control group! Based on another distribution (similar to Tukey) that reflects the dependence between these a-1 tests.
• Like Tukey for “all pairwise comparisons”, Dunnett is the most powerful method for “treatment vs. control” comparisons.
• Our book does not have these critical values, but it is easy to use Dunnett in SAS (and it will provide you with the minimum significant difference as well).
Example
• Suppose in our previous example, treatment 6 was a control. We should have used Dunnett’s instead of Tukey.
• We calculate the Dunnet MSD as 7

Treatment Mean

TRT 5 83

TRT 2 76

TRT 3 58

TRT 4 54

TRT 1 52

CONTROL 46

• Which groups are now different?
Summary: Pairwise Comps.
• For pairwise comparison of treatments:
• Dunnett is the most powerful if considering treatments versus control.
• Tukey is the most powerful if considering ALL pairwise comparisons.
• Bonferroni should only be used if you have a relatively small number of pre-planned comparisons of interest
• LSD is appropriate for exploratory studies (to be followed up by a more well-planned study).
SAS Code & Output
• MEANS statement is added to PROC GLM in order to compare levels for a variable listed in the CLASS statement.
Other Options / Formatting
• BON – use Bonferroni instead of Tukey (will produce full output, but you should want only part of it, right?)
• ALPHA = ??? changes your significance level
• CLM calls for CI’s for the means (BON would apply)
• CLDIFF calls for the CI’s for differences
• DUNNETT <‘xxx’> uses Dunnett’s method where xxx is the name of the control group
• DUNNETTU / DUNNETTL if you want one-sided comparisons (strictly better or worse than control)
Output (Tukey, Lines)
• Blood Type Example
Different Sample Sizes
• For illustration, we just delete one of the points from Type B.
• Sample sizes are now 3, 2, 3, 3.
• What will happen to the CI’s?
Confidence Limits
• Confidence Limits involving Type B are of width 8.55. Those not involving Type B are of width 7.64. Why?

### Beyond Pairwise Comparisons

We may want to compare “groupings” of the means rather than individual means. This involves linear combinations and contrasts of means.

Linear Combination of Means
• A linear combination of means is the sum of means that have been multiplied by constants.
• Constants may be anything you like. Sometimes some of them will be zero.
• If the constants sum to zero – then we call the linear combination a contrast.
• You should note that any pairwise comparison is a contrast.
Linear Combination (2)
• Consider the fixed effects model
• It is not difficult to conduct a hypothesis test related to any linear combination of means that we choose:
Linear Combinations (Example)
• Take one example:
• Let’s put it in “standard” form
• Do the constants sum to zero? What does this mean?
• Contrasts are “fair” comparisons
• Not all linear combinations are contrasts
Linear Combinations (Examples)
• Which of these are contrasts?
Construction of the t-test
• Our statistic under H0 has a T distribution with N – k (error) degrees of freedom
Why T test instead of overall F test?
• With T tests, you can address specific hypotheses that you are interested in rather than just testing the overall equality of means.
• A note on the F-test: The ANOVA F-test in reality jointly tests all possible contrasts
• It decreases the power that we would get if we only test those of interest to the experiment
• This is why on occasion individual T tests may test significant while the overall F test does not.
• Anova F test just can’t look close enough to see what is going on!
Multiplicity Issues
• Because we are often looking at multiple tests or confidence intervals, if we use standard t-critical values the Overall Type I Error Rate (as we’ve seen in the past) will not be well controlled.
• Another issue is that not all of the linear combinations you test will be independent.
• This actually turns out to be a good thing, because it is possible to take advantage of the dependencies in developing, e.g., Tukey or Dunnett Adjustments).
Multiplicity Issues (2)
• Another issue of particular importance is data snooping. This is often done in an exploratory study where we want to search for differences. In this case, we’ll probably decide what to test after seeing the sample means.
• By doing this, we effectively perform all possible tests, and as we’ve discussed before the testing procedure becomes biased in favor of rejecting the null for at least one test.
Can we data snoop?
• It turns out that in some cases, we can “data snoop” in a fair and reasonable manner.
• We’ve already seen that the Tukey adjustment may be used to perform all pairwise comparisons (we sacrifice a bit of power for control of alpha).
• It is possible to expand this to all-possible-contrasts using a Scheffe adjustment.
Scheffé’s Method
• Scheffe’s method obtains a critical value S that may be used to set up simultaneous CI’s for all contrasts. Again, we sacrifice power for control of the significance level.
• The critical value is based on the F distribution:
• CI is given by:
Scheffé’s Method (2)
• Remember to apply Scheffe, you MUST have a contrast. That is you must have:
• Chosen when you have unplanned contrasts
• Also chosen AND recommended even for pairwise comparisons if you have vastly different cell sizes
Comparison Of Methods
• LSD Procedure will always have the most power (but won’t control Type I errors)
• Usually for exploratory studies to be followed by a more well planned experiment
• Bonferroni will be most powerful for a few pre-planned comparisons while controlling the Type I Error Rate
• Tukey will be the most powerful for all pairwise comparisons while controlling the Type I Error Rate
Comparison of Methods
• Dunnett will be the most powerful for comparing treatments to a control while controlling for the Type I error rate.
• Scheffe will usually the least powerful! But it will control the Type I error rate for ALL CONTRASTS
• Allows data snooping!
• Also useful if cell sizes are vastly different
General Form of Test / CI
• A confidence interval for any linear combination may be obtained by considering:
• As long as we make an appropriate choice for the critical value, everything else is identical.
Contrasts in SAS
• Consider testing whether B/AB groups are the same as A/O groups in the blood type example.