Help! Statistics! Multiple testing . Problems and some solutions.

Help! Statistics!Multiple testing.Problems and some solutions. Hans Burgerhof j.g.m.burgerhof@umcg.nl February 12 2019

Help! Statistics! Lunchtime Lectures What? frequently used statistical methods and questions in a manageable timeframe for all researchers at the UMCG No knowledge of advanced statistics is required. When? Lectures take place every 2nd Tuesday of the month, 12.00-13.00 hrs. Who? Unit for Medical Statistics and Decision Making do not Slides canbedownloadedfrom http://www.rug.nl/research/epidemiology/download-area

Program Today Multiple testing. What is the problem? (Stochastically) independent tests versus dependent tests 3. Controlling the Familywise Error Rate (FWER) 4. Controlling the False Discovery Rate (FDR) 5. Somereferences (for finding more solutions)

Type I and Type II errorsfor a statistical test H0: effect newtreatment = effect standardtreatment H1: effect newtreatment > effect standardtreatment The significance level  is generally 0.05; We allow 5% probabilitytoreject H0, while in factit is true

The classicalproblem of multiple testing • In statisticaltesting, we usuallydefine the significance level αat 0.05. This means we accept a probability of 0.05 toreject a null hypothesis, while in fact the null hypothesis is true • This is called the Comparison-wiseerror rate (CWER) • Whatcan we say about the probability of rejecting at leastonenull hypothesis if we have more thanone hypothesis to test? Chance capitalisation! “Overall alpha” Family-wise error rate (FWER)

FWER and CWER If we performn independent tests, each with CWER = 0.05, then the probability of making a type I error = 0.05 (in one test) The probabilty of not making a type I error (per test) = 1 – 0.05 = 0.95 The probability of making no type I errors in n independent tests = So, the probability of making at leastonetype 1 error equals 1 - Number of testsoverall alpha n (FWER) 30.143 10 0.401 100 0.994

A simple, classical, example We wouldliketocomparethree independent groupswith respect to a continuous, normallydistributed, outcomevariable. Performingallpairwisecomparisonswill take three tests (1 to 2, 1 to 3, 2 to 3). Three null hypotheses: , , willbetested. Are these hypotheses independent of eachother? No: ifis true and is true, automatically has tobetrue!

How to control the FWER at 0.05in thissituation? A. One-step procedure of Bonferroni. Performallpairwisecomparisons on CWER = 0.05 / c in which c equals the number of comparisons (in this case c = 3, so take CWER ≈ 0.0167) B. Two step procedure: Oneway ANOVA followed Post-hoc bypairwisecomparisons. If the P-value of Oneway ANOVA ≤ 1, choose a suitable Post-hoc procedure. Performpairwise tests on 2. What is a goodchoice for the ’s?

Choice of ’s for threegroups In the case of threegroups, youcan take 1 = 2 = 0.05 and your overall  is still 0.05! There are onlythreepossiblesituations: protectedbecause of the overall test youcan make only the type I error for youcannot make a type I error at all Using Bonferronicorrectionafter a significant ANOVA is tooconservative!

Multiple tests on cumulating data(dependent tests) • Theory is used for interim analyses • Armitage, McPherson en Rowe (1969) • Tables with overall alpha after sequential tests for observations from Binomial, Normal en Exponential distributions • As an illustration we will recalculate an example (n patients are treated with both A and B and have to tell which is better). • H0: A = B = 0.5. We will test after each new patient.

A A B etcetera A A B B A A B B X~B(n, 0.5) A B B Overall alpha increases, but not as extreme as in the case of independent tests (100 independent tests: overall α> 0.99) does no longer hold

Binomialdistribution n = 1, .., 10 and  = 0.5 • P(k = 0) = P(k=n)

H0:  = 0.5; α = 0.01 twosided (per test): totalprobabilitytorejectif H0 is true :0.00781 (1) Boundary is hit once Number of successes for A X ~ B(7, 0.5) P(X = 0) = P(X = 7) = 0.57 ≈ 0.0078 Two-sided: 0.0156 > α X ~ B(8, 0.5) P(X=0) = P(X=8) ≈ 0.0039 Two sided: 0.0078 Reject H0 10 9 8 7 6 5 4 3 2 1 0 X ~ B(10, 0.5) P(X1) = P(X9) ≈ 0.0107 Do not reject H0 X~B(n, 0.5) 1 2 3 4 5 6 7 8 9 10 n Actual overall alpha is 0.0078

α = 0.03 twosided (for each test): totalprobabilitytorejectif H0 is true = 0.02930 (2) Number of successes for A X ~ B(10, 0.5) P(X1) = P(X9) ≈ 0.0107 Reject H0 10 9 8 7 6 5 4 3 2 1 0 X ~ B(7, 0.5) P(X = 0) = P(X= 7) ≈ 0.0078 P(X = 1) if n = 10 is P(X = 1) if n = 7 followed by three failures: 0.0547*(0.5)³ ≈ 0.0068 1 2 3 4 5 6 7 8 9 10 n Overall α rounded 2*0.0078 + 2*0.0068 = 0.0293

Many independent tests We are interested in genes, possiblyrelatedto a certaindisease. Example: We have 100 candidategenes and comparetheirexpressions in a group of diseased respondents with the expressions in a group of non-diseased respondents. We test 100 (more or less) independent tests (H0: no effect). How to correct for multiple testing?

The 10 genes with smallest P-values No correction:  = 0.05; 14 genes are significant Simple Bonferronicorrection: * = 0.05/100 = 0.0005 Conclusion: onlytwogenes are significant Can we do better?

The False Discovery Rate (FDR)Benjaminien Hochberg, 1995 • FDR = the expectedproportion of allrejectednull hypotheses that has been rejectedfalsely not significant total significant True null hypotheses U V m0 Falsenull hypotheses T S m1 m – R R m Onlym is known OnlyR canbeobserved! FDR = E(V/R)

The FDR not significant total significant True null hypotheses U V m0 Falsenull hypotheses T S m1 m – R R m Benjaminiand Hochberg (1995): ifallnull hypotheses are true, soT = S = m1 = 0, than controlling the FDR equals controlling the FWER (so the overall alpha is smaller than a defined max)

About the FDR • If, in reality, some of the null hypotheses are false, the FDR is smaller than the FWER. Controlling the FDR does notimply control over FWER, but willgiveyou more power. • The more null hypotheses are false, the larger the gainin power

Multiple testingaccordingtoBenjaminiand Hochberg: FDR procedure • m nulhypotheses: H1, H2, … , Hm • m P-values: P1, P2, … , Pm • Rank the P-values: P(1)≤ P(2)≤ … ≤ P(m) • Findk = the largesti holding q = chosen level of control (e.g. 0.05 or 0.1) • RejectallH(i)i = 1, 2, … , k

Closer look at the FDR • SequentialFDR is a bit conservative, speciallyif the number of falsenull hypotheses is relatively large • Benjamini e.a. (2001): two step procedure in which the proportiontruenull hypotheses (π0) is estimated in the first step and usedtodetermine q is the second: • Storey(2002): direct methodtoestimateπ0

How toestimateπ0? • π0 = m0/m not significant total significant True null hypotheses U V m0 Falsenull hypotheses T S m1 m – R R m What does the distribution of P-values look like, if the null hypothesis is true?

H0: µ = 100 sample What do youexpect for the P-value? IfH0is true

H0: µ = 100 Equalareas P(P-value < k) = k for 0 ≤ k ≤ 1 IfH0is true … the P-value has a uniform distribution on [0,1]

If the null hypothesis is false … • ..the P-value does not have a uniform distribution on [0 ; 1], youwillfindrelatively more often small P-values Number of P-values P-values fromm1 P-values fromm0 0 1

Back toour 100 genes Findk = the largesti holding (i) For example, if q = 0.05: 3 geneswillbe significant (of whichprobably 5% are falsediscoveries)

Back toour 100 genes If we take q = 0.1? (we are willingto accept thatabout 10% of the selectedgenes in fact are falsediscoveries) (i) The FDR is a step-upprcedure!

literature • Armitage P., McPherson K. and Rowe B. (1969) Journal of the Royal Statistical Society Series A. 132(2) 235 - 244 • Austin S., Dialsingh I. and Altman N. (2014) Multiple hypothesis testing: a review. http://personal.psu.edu/nsa1/paperPdfs/Mult_Hyp_Review_final.pdf • Benjamini Y. and Hochberg Y. (1995). Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 289-300 • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165-1188. • Storey J.D. (2002). A direct approach tofalsediscoveryrates. Journal of the Royal Statistical Society Series B, 479 – 498.

Help! Statistics! Multiple testing . Problems and some solutions.

Help! Statistics! Multiple testing . Problems and some solutions.

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7