- 100 Views
- Uploaded on
- Presentation posted in: General

Aesthetics and power in multiple testing – a contradiction?

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Aesthetics and power in multiple testing – a contradiction?

MCP 2007, Vienna

Gerhard Hommel

Economics: profit is not everything

- Ethical / social component
- Competing interests
- Aesthetics: protection of environment, industrial art, patronage
Statistics: power is not everything

- Ethics: decisions are logical, conceivable, simple
- Competing interests
- Aesthetics: “beauty of mathematics” (subjective), but also same points as for ethics

- Closure test
+ : principle simply to describe

+ : coherence directly obtained

– : often very cumbersome to perform

- Bonferroni-Holm: SD(α/n, α/(n-1), … , α/2, α)
- Hochberg : SU(α/n, α/(n-1), … , α/2, α)
- FDP, e.g. control of P(FDP > 0.2):
SD(α/n, α/(n-1), α/(n-2), α/(n-3), 2α/(n-3), 2α/(n-4), … , 3α/(n-7), …)

not beautiful (and not powerful)!

Coherence:When a hypothesis (= subset of the parameter space) is rejected, every of its subsets can be rejected.

Closure test: Local level α tests for all - hypotheses + coherence control of multiple level (FWER) α.

Closure tests form a complete class within all MTP’s controlling the FWER α.

But: Bonferroni-Holm is not coherent, in general!

Quasi-coherence: coherence for all index sets forming an intersection.

Quasi-closure test: Local level α tests for all index sets + quasi-coherence control of multiple level (FWER) α.

Consider: monotonicity between different hypotheses:

p1, … ,pn = p-values

pi pj and Hj rejected Hi rejected.

Not obligatory: weights for hypotheses (from importance or expected power)

- See Benjamini / Hochberg (1997)
- Fixed sequence tests
- Gatekeeping procedures

Example: Yi = ß0 + ß1 xi + ß2 xi² +i

H1: ß1 = ß2 = 0 H2: ß2 = 0

F test of H1: p = .051

t test of H2: p = .024

Bonferroni-Holm ( = .05) rejects only H2

Logical: reject H1, too.

Size of a p-value is not the only criterion for rejection!

Example: Comparison of k=4 means (ANOVA)

Hij: i = j , 1 i < j 4

p13 = .0241 < p34 = .0244 (t test; pooled variance)

Closure test rejects H14, H24, H34, but not H13!

(same result with regwq)

Non-monotonicity may be reasonable:

It is easier to separate group 4 from the cluster of groups 1,2,3 than to find differences within the cluster.

My conclusion:

Only for equal weights and no logical constraints, it is mandatory that

- decisions are monotonic in p-values, and
- decisions are exchangeable.

Given p-values p1, …, pn; q1, …, qn

with qi pi for i=1,…,n.

When a hypothesis is rejected, based on pi‘s, it should also be rejected when based on qi‘s.

Counterexample 1 (WAP procedure of Benjamini-Hochberg, 1997):

Stepdown based on p(j) w(j)α/(w(j)+…+w(n)):

Controls the FWER, but is not α-consistent.

Counterexample 2: Tarone‘s (1990) MTP

Uses information about minimum attainable p-values α1*, …, αn*

n=2, α1*=.03, α2*=.04:

- α = .05: no Hj can be rejected;
- α = .035: H1 can be rejected if p1 .035.
Hommel/Krummenauer (1998): monotonic improvement of Tarone‘s procedure (using a „rejection function“ b(α))

Wiens (2003): „fixed sequence testing procedure“ with possibility to continue

Dmitrienko, Wiens, Westfall (2005): „fallback procedure“

Wiens + Dmitrienko (2005): Proof that FWER is controlled, suggestion for improvement

Two types of weights:

- sequence of hypotheses;
- „assigned weights“ α1‘,…,αn‘ with Σαi‘=α.

Use „assigned weights“ α1‘,…,αn‘ with Σαi‘=α .

Actual significance levels:

α1 = α1‘

αi = αi‘ + αi-1 if Hi-1 has been rejected

αi = αi‘ if Hi-1 has not been rejected.

α1‘= α, α2‘ = ... = αn‘ = 0 fixed sequence test.

- Endpoint 1: Functional capacity of heart
- Endpoint 2: Mortality
- α = .05,α1‘= .04, α2‘= .01
- p1 .04: Reject H1 and test H2 with α2 = .05 .
- p1 > .04: Retain H1 and test H2 with α2 = .01 .
Weighted Bonferroni-Holm with α1‘= .04, α2‘= .01 :

Rejects H1, in addition, when p2 .01 and

.04 < p1 .05 !

- For n = 2: WBH is strictly more powerful than the fallback procedure. The improvement by Wiens + Dmitrienko is identical to WBH.
- For n 3: There exist situations where fallback rejects and WBH not, and conversely. ( the improvement by W+D is not identical to WBH)

αi‘= wiα

wi = 1

(see W+D)

αi‘= wiα

wi = 1/3

Consequence

for importance:

H2 H3 H1?

αi‘= wiα

wi = 1/3

Consequence

for importance:

H2 H3 H1?

αi‘= wiα

wi = 1/3

Consequence

for importance:

H2 H3 H1

(remains)

The decisions of the fallback procedure (with equal weights) are not exchangeable (and can never become!).

Example: p(1)=.015, p(2)=.02, p(3)=1; α=.05.

(Bonferroni-Holm: rejects H(1) and H(2) )

- p1 < p2 < p3 : reject H1, H2
- p1 < p3 < p2 : reject H1
- p2 < p1 < p3 : reject H2
- p2 < p3 < p1 : reject H2, H3
- p3 < p1 < p2 : reject H3 (, H1)
- p3 < p2 < p1 : reject H3

- What are the relations of the two different types of weighting?
- Can it be meaningful to give higher assigned weights for higher indices?
- Can one give „guidelines“ how to choose the weights?
- Equal assigned weights: what is the influence of ordering? (anyway: the procedure has „aesthetic“ drawbacks)
- For which situations can one expect that the fallback procedure is more powerful than WBH?
- Or should one better renounce it completely?