Aesthetics and power in multiple testing – a contradiction?. MCP 2007, Vienna Gerhard Hommel. Introduction: Economics and Statistics. Economics: profit is not everything Ethical / social component Competing interests Aesthetics: protection of environment, industrial art, patronage
Aesthetics and power in multiple testing – a contradiction?
MCP 2007, Vienna
Economics: profit is not everything
Statistics: power is not everything
+ : principle simply to describe
+ : coherence directly obtained
– : often very cumbersome to perform
SD(α/n, α/(n-1), α/(n-2), α/(n-3), 2α/(n-3), 2α/(n-4), … , 3α/(n-7), …)
not beautiful (and not powerful)!
Coherence:When a hypothesis (= subset of the parameter space) is rejected, every of its subsets can be rejected.
Closure test: Local level α tests for all - hypotheses + coherence control of multiple level (FWER) α.
Closure tests form a complete class within all MTP’s controlling the FWER α.
But: Bonferroni-Holm is not coherent, in general!
Quasi-coherence: coherence for all index sets forming an intersection.
Quasi-closure test: Local level α tests for all index sets + quasi-coherence control of multiple level (FWER) α.
Consider: monotonicity between different hypotheses:
p1, … ,pn = p-values
pi pj and Hj rejected Hi rejected.
Not obligatory: weights for hypotheses (from importance or expected power)
Example: Yi = ß0 + ß1 xi + ß2 xi² +i
H1: ß1 = ß2 = 0 H2: ß2 = 0
F test of H1: p = .051
t test of H2: p = .024
Bonferroni-Holm ( = .05) rejects only H2
Logical: reject H1, too.
Size of a p-value is not the only criterion for rejection!
Example: Comparison of k=4 means (ANOVA)
Hij: i = j , 1 i < j 4
p13 = .0241 < p34 = .0244 (t test; pooled variance)
Closure test rejects H14, H24, H34, but not H13!
(same result with regwq)
Non-monotonicity may be reasonable:
It is easier to separate group 4 from the cluster of groups 1,2,3 than to find differences within the cluster.
Only for equal weights and no logical constraints, it is mandatory that
Given p-values p1, …, pn; q1, …, qn
with qi pi for i=1,…,n.
When a hypothesis is rejected, based on pi‘s, it should also be rejected when based on qi‘s.
Counterexample 1 (WAP procedure of Benjamini-Hochberg, 1997):
Stepdown based on p(j) w(j)α/(w(j)+…+w(n)):
Controls the FWER, but is not α-consistent.
Counterexample 2: Tarone‘s (1990) MTP
Uses information about minimum attainable p-values α1*, …, αn*
n=2, α1*=.03, α2*=.04:
Hommel/Krummenauer (1998): monotonic improvement of Tarone‘s procedure (using a „rejection function“ b(α))
Wiens (2003): „fixed sequence testing procedure“ with possibility to continue
Dmitrienko, Wiens, Westfall (2005): „fallback procedure“
Wiens + Dmitrienko (2005): Proof that FWER is controlled, suggestion for improvement
Two types of weights:
Use „assigned weights“ α1‘,…,αn‘ with Σαi‘=α .
Actual significance levels:
α1 = α1‘
αi = αi‘ + αi-1 if Hi-1 has been rejected
αi = αi‘ if Hi-1 has not been rejected.
α1‘= α, α2‘ = ... = αn‘ = 0 fixed sequence test.
Weighted Bonferroni-Holm with α1‘= .04, α2‘= .01 :
Rejects H1, in addition, when p2 .01 and
.04 < p1 .05 !
wi = 1
wi = 1/3
H2 H3 H1?
wi = 1/3
H2 H3 H1?
wi = 1/3
H2 H3 H1
The decisions of the fallback procedure (with equal weights) are not exchangeable (and can never become!).
Example: p(1)=.015, p(2)=.02, p(3)=1; α=.05.
(Bonferroni-Holm: rejects H(1) and H(2) )