Aesthetics and power in multiple testing – a contradiction?

Aesthetics and power in multiple testing – a contradiction? MCP 2007, Vienna Gerhard Hommel

Introduction: Economics and Statistics Economics: profit is not everything • Ethical / social component • Competing interests • Aesthetics: protection of environment, industrial art, patronage Statistics: power is not everything • Ethics: decisions are logical, conceivable, simple • Competing interests • Aesthetics: “beauty of mathematics” (subjective), but also same points as for ethics

Examples for (non-) aesthetics: • Closure test + : principle simply to describe + : coherence directly obtained – : often very cumbersome to perform • Bonferroni-Holm: SD(α/n, α/(n-1), … , α/2, α) • Hochberg : SU(α/n, α/(n-1), … , α/2, α) • FDP, e.g. control of P(FDP > 0.2): SD(α/n, α/(n-1), α/(n-2), α/(n-3), 2α/(n-3), 2α/(n-4), … , 3α/(n-7), …) not beautiful (and not powerful)!

Logical decisions: Coherence Coherence:When a hypothesis (= subset of the parameter space) is rejected, every of its subsets can be rejected. Closure test: Local level α tests for all - hypotheses + coherence  control of multiple level (FWER) α. Closure tests form a complete class within all MTP’s controlling the FWER α. But: Bonferroni-Holm is not coherent, in general! Quasi-coherence: coherence for all index sets forming an intersection. Quasi-closure test: Local level α tests for all index sets + quasi-coherence  control of multiple level (FWER) α.

Monotonic decisions Consider: monotonicity between different hypotheses: p1, … ,pn = p-values pi  pj and Hj rejected  Hi rejected. Not obligatory: weights for hypotheses (from importance or expected power) • See Benjamini / Hochberg (1997) • Fixed sequence tests • Gatekeeping procedures

Monotonic decisions:nested hypotheses Example: Yi = ß0 + ß1 xi + ß2 xi² +i H1: ß1 = ß2 = 0 H2: ß2 = 0 F test of H1: p = .051 t test of H2: p = .024 Bonferroni-Holm ( = .05) rejects only H2 Logical: reject H1, too. Size of a p-value is not the only criterion for rejection!

Monotonic decisions:multiple comparisons Example: Comparison of k=4 means (ANOVA) Hij: i = j , 1  i < j  4 p13 = .0241 < p34 = .0244 (t test; pooled variance) Closure test rejects H14, H24, H34, but not H13! (same result with regwq) Non-monotonicity may be reasonable: It is easier to separate group 4 from the cluster of groups 1,2,3 than to find differences within the cluster.

Monotonic decisions My conclusion: Only for equal weights and no logical constraints, it is mandatory that • decisions are monotonic in p-values, and • decisions are exchangeable.

Monotonicity within same hypothesis(α-consistency) Given p-values p1, …, pn; q1, …, qn with qi pi for i=1,…,n. When a hypothesis is rejected, based on pi‘s, it should also be rejected when based on qi‘s. Counterexample 1 (WAP procedure of Benjamini-Hochberg, 1997): Stepdown based on p(j)  w(j)α/(w(j)+…+w(n)): Controls the FWER, but is not α-consistent.

Monotonicity within same hypothesis(α-consistency) Counterexample 2: Tarone‘s (1990) MTP Uses information about minimum attainable p-values α1*, …, αn* n=2, α1*=.03, α2*=.04: • α = .05: no Hj can be rejected; • α = .035: H1 can be rejected if p1 .035. Hommel/Krummenauer (1998): monotonic improvement of Tarone‘s procedure (using a „rejection function“ b(α))

The fallback procedure (I) Wiens (2003): „fixed sequence testing procedure“ with possibility to continue Dmitrienko, Wiens, Westfall (2005): „fallback procedure“ Wiens + Dmitrienko (2005): Proof that FWER is controlled, suggestion for improvement Two types of weights: • sequence of hypotheses; • „assigned weights“ α1‘,…,αn‘ with Σαi‘=α.

The fallback procedure (II) Use „assigned weights“ α1‘,…,αn‘ with Σαi‘=α . Actual significance levels: α1 = α1‘ αi = αi‘ + αi-1 if Hi-1 has been rejected αi = αi‘ if Hi-1 has not been rejected. α1‘= α, α2‘ = ... = αn‘ = 0 fixed sequence test.

Example for n = 2 • Endpoint 1: Functional capacity of heart • Endpoint 2: Mortality • α = .05,α1‘= .04, α2‘= .01 • p1  .04: Reject H1 and test H2 with α2 = .05 . • p1 > .04: Retain H1 and test H2 with α2 = .01 . Weighted Bonferroni-Holm with α1‘= .04, α2‘= .01 : Rejects H1, in addition, when p2 .01 and .04 < p1  .05 !

Comparison with weighted Bonferroni-Holm • For n = 2: WBH is strictly more powerful than the fallback procedure. The improvement by Wiens + Dmitrienko is identical to WBH. • For n  3: There exist situations where fallback rejects and WBH not, and conversely. ( the improvement by W+D is not identical to WBH)

The fallback procedure for n=3:weights for intersection hypotheses αi‘= wiα   wi = 1 (see W+D)

The fallback procedure for n=3:equal weights αi‘= wiα  wi = 1/3 Consequence for importance: H2 H3 H1?

The fallback procedure for n=3:equal weights; improvement by W+D αi‘= wiα  wi = 1/3 Consequence for importance: H2 H3 H1 (remains)

The fallback procedure for n=3:equal weights The decisions of the fallback procedure (with equal weights) are not exchangeable (and can never become!). Example: p(1)=.015, p(2)=.02, p(3)=1; α=.05. (Bonferroni-Holm: rejects H(1) and H(2) ) • p1 < p2 < p3 : reject H1, H2 • p1 < p3 < p2 : reject H1 • p2 < p1 < p3 : reject H2 • p2 < p3 < p1 : reject H2, H3 • p3 < p1 < p2 : reject H3 (, H1) • p3 < p2 < p1 : reject H3

The fallback procedure:critical questions • What are the relations of the two different types of weighting? • Can it be meaningful to give higher assigned weights for higher indices? • Can one give „guidelines“ how to choose the weights? • Equal assigned weights: what is the influence of ordering? (anyway: the procedure has „aesthetic“ drawbacks) • For which situations can one expect that the fallback procedure is more powerful than WBH? • Or should one better renounce it completely?

Thank you for your attendance! Are there more questions? Or some answers?

Aesthetics and power in multiple testing – a contradiction?