Multiple testing adjustments

Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17th Nov 2009 Tim Massingham, tim.massingham@ebi.ac.uk

Motivation Already come across several cases where need to correct p-values Pairwise gene expression data What happens if we perform several vaccine trials?

Motivation 10 new vaccines are trialled Declare vaccine a success if test has p-value of less than 0.05 If none of the vaccines work, what is our chance of success?

Motivation 10 new vaccines are trialled Declare vaccine a success if test has p-value of less than 0.05 If none of the vaccines work, what is our chance of a “success”? Each trial has probability of 0.05 of “success” (false positive) Each trial has probability of 0.95 of “failure” (true negative) Probability of at least one = 1 - Probability of none = 1 - (Probability a trial unsuccessful)10 = 1 - 0.9510 = 0.4 Rule of Thumb Multiple size of test by number of tests

Motivation More extreme example: test entire population for disease Mixture: some of population have disease, some don’t Find individuals with disease Test report Healthy Diseased Healthy Diseased True status Family Wise Error Rate Control probability that any false positive occurs False Discovery Rate Control proportion of false positives discovered FDR = # false positives = # false positives # positives # true positives + # false positives

Cumulative distribution Simple examination by eye: The cumulative distribution should be approximately linear • Rank data • Plot rank against p-value n Rank Start (0,1) End (1,n) Never decreases 1 0 1 P-value N.B. Often scale ranks to (0,1] by dividing by largest rank

Cumulative distribution Examples: For 910 p-values Could use a one-sided Kolmogorov test if desired Five sets of uniformly distributed p-values Non-uniformly distributed data. Excess of extreme p-values (small)

A little set theory Represent all possible outcomes of three tests in a Venn diagram Areas are probabilities of events happening Test 1 false positive No test gives false positive All tests give false positive Test 3 false positive Test 2 false positive

A little set theory P(any test gives a false positive) + ≤ +

A little set theory + ≤ +

Bonferroni adjustment Want to control this Know how to control each of these (the size of each test) Keep things simple: do all tests at same size If we have n tests, each at size a/nthen

Bonferroni adjustment If we have n tests, each at size a/nthen Family-Wise Error Rate Bonferroni adjustment (correction) For a FWER of less than a, perform all tests at size a/n Equivalently: multiple p-values of all tests by n(maximum of 1) to give adjusted p-value

Example 1 Look at deviations from Chargaff’s 2nd parity rule A and T content of genomes for 910 bugs Many show significant deviations First 9 pvalues 3.581291e-66 3.072432e-12 1.675474e-01 6.687600e-01 1.272040e-05 1.493775e-23 2.009552e-26 1.024890e-14 1.519195e-24 Unadjusted pvalues pvalue < 0.05 764 pvalue < 0.01 717 pvalue < 1e-5 559 First 9 adjusted pvalues 3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21 Bonferroni adjusted pvalues pvalue < 0.05 582 pvalue < 0.01 560 pvalue < 1e-5 461

Aside: pvalues measure evidence Shown that many bugs deviate substantial from Chargaff’s 2nd rule p-values tell us that there is significant evidence for a deviation Lots of bases and so ability to detect small deviations from 50% Powerful test Upper quantile median 1st Qu. Median 3rd Qu. 0.4989 0.4999 0.5012 Lower quantile

Bonferroni is conservative Conservative: actual size of test is less than bound • Not too bad for independent tests • Worst when positively correlated • Applying same test to subsets of data • Applying similar tests to same data More subtle problem Mixture of blue and red circles Null hypothesis: Is blue Red circles are never false positives

Bonferroni is conservative Number of potential false positives may be less than number of tests If experiment really is different from null, then + ≤ + Over adjusted p-value

Holm’s method Holm(1979) suggests repeatedly applying Bonferroni Initial Bonferroni: Insignificant Significant No false positive? Been overly strict, apply Bonferroni only to insignificant set. False positive? More won’t hurt, so may as well test again Step 2 Insignificant Significant Step 3 Insignificant Significant Stop when “insignificant” set does not shrink further

Example 2 Return to Chargaff data 910 bugs but more than half are significantly different after adjustment First 9 adjusted pvalues 3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21 Bonferroni adjusted pvalues pvalue < 0.05 582 pvalue < 0.01 560 pvalue < 1e-5 461 There is strong evidence that we’ve over-corrected First 9 Holm adjusted pvalues 2.915171e-63 1.591520e-09 1.000000e+00 1.000000e+00 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21 Holm adjusted pvalues pvalue < 0.05 606 (+24) pvalue < 0.01 574 (+14) pvalue < 1e-5 472 (+12) Gained a couple of percent more but notice that gains tail off

Hochberg’s method Consider a pathological case Apply same test to same data multiple times # Ten identical pvalues pvalues <- rep(0.01,10) # None are significant with Bonferroni p.adjust(pvalues,method=“bonferroni”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 # None are significant with Holm p.adjust(pvalues,method=“holm”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 # Hochberg recovers correctly adjusted pvalues p.adjust(pvalues,method=“hochberg”) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Hochberg adjustment is identical to Holm for Chargaff data …. but requires additional assumptions First 9 Hochberg adjusted pvalues 2.915171e-63 1.591520e-09 9.972469e-01 9.972469e-01 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21 Hochberg adjusted pvalues pvalue < 0.05 606 pvalue < 0.01 574 pvalue < 1e-5 472

False Discovery Rates New methods, dating back to 1995 Gaining popularity in literature but mainly used for large data sets Useful for enriching data sets for further analysis Recap FWER: control probability of any false positive occurring FDR: control proportion of false positives that occur “q-value” is proportion of significant tests expected to be false positives q-value times number significant = expected number of false positives Methods Benjamini & Hochberg (1995) Benjamini & Yekutieli (2001) Storey (2002,2003) aka “positive false discovery rate”

Example 3 Returning once more to the Chargaff data First 9 FDR q-values 3.359768e-65 7.114283e-12 1.891664e-01 6.931340e-01 2.063380e-05 5.481191e-23 8.350193e-26 2.569283e-14 5.760281e-24 FDR q-values qvalue < 0.05 759 qvalue < 0.01 713 qvalue < 1e-5 547 Q-values have a different interpretation from p-values Use qvalues to get the expected number of false positives qvalue = 0.05 expect 38 false positives (759 x 0.05) qvalue = 0.01 expect 7 false positives (713 x 0.01) qvalue = 1e-5 expect 1/200 false positives

Summary • Holm is always better than Bonferroni • Hochberg can be better but has additional assumptions • FDR is a more powerful approach - finds more things significant • controls a different criteria • more useful for exploratory analyses than publications A little question Suppose results are published if the p-value is less than 0.01, what proportion of the scientific literature is wrong?

Multiple testing adjustments

Multiple testing adjustments

Presentation Transcript

Multiple Testing, Permutation, False Discovery

Stock Adjustments

Analogies for Multiple Choice Testing

Adjustments

Multiple Testing Procedures

Year 9 Multiple Intelligence Testing

Multiple Testing

Multiple testing

Multiple testing

Multiple testing

Multiple testing correction

Multiple testing etc.

Multiple Testing of Causal Hypotheses

Multiple testing

Different Expression Multiple Hypothesis Testing

Multiple testing

Resource Adjustments

Multiple Testing Procedures

Cardiovascular Adjustments

Multiple Testing Procedures

Multiple Testing of Causal Hypotheses

Multiple Testing