1 / 25

Approximate Randomization tests - PowerPoint PPT Presentation

Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Approximate Randomization tests

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

ApproximateRandomization tests

February 5th, 2013

Why ar testing?

• Classic tests oftenassume a givendistribution (student t, normal, …) of the variable

• This is ≈ok forrecall, but notforprecision or F-score

• Possible hypotheses to test with non-parametric tests is limited

Illustration

• 30,000 runs, 1000 instances, 500 of class A

• True positives (TP): 400 (stdev:80)

• Falsepositives (FP): 60 (stdev: 15)

• Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.

Definitions

• Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal.

• Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal.

• F-score: non-linearcombination of recallandprecisionNotnormal.

Approximaterandomization test

• No assumption on distribution

• Can handle complicatedstatistics

• Onlyassumption: independencebetweenshuffledelements

• References:

• Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989.

• More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.

Basic idea

• Exact randomization test

Exact probability

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24

= 0.29

Thus, do notreject H0 because the probability is largerthanalpha=0.05.

Approximateprobability

• The number of permutations is n! => quickincrease of number of permutations

• Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1)

• nge : number of timespseudostatistic ≥ actualstatistic

• NS: number of shuffles

• +1: correctionforvalidity

Translationtoinstances

• Eachglass is aninstance

• Contents and expert are twolabeling systems

• Contents has anaccuracy of 100%, expert has anaccuracy of 50%

• Statistic is precision, f-score, recall, … instead of accuracy

Stratifiedshuffling

• For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother

• Onlyshufflelabels per instance

MBT

• Assumpton of independencebetweeninstances

• Shuffle per sentenceratherthan per token

Term extraction

• Shufflingextractedtermsbetween output of two term extraction systems

Script

• http://www.clips.ua.ac.be/~vincent/software.html#art

• http://www.clips.ua.ac.be/scripts/art

• Options:

• Exact andapproximaterandomization tests

• Instancebased, alsofor MBT

• Term extractionbased

• StratifiedShuffling

• Twosided / one-sided (check code!)

Remarks on usage

• It makes no sense toshuffleif exact randomizationcanbecomputed

• The value of p depends on NS. The larger NS, the lower p canbe

• Validity check

• Sign-test

Sign test

• Canbecomparedwith P foraccuracy

• H0: correctness is independent ofsystem i.e.P(groen) = 0.5

• Binomial test

Interpretation (1)

• How much do these two systems differbased on precisionfor the A label?

• Maximally

• Intermediate

• Minimally

Conclusion

• Approximaterandomizationtestingcanbeusedformanyapplications.

• The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated.

• Differencecanbecomputed in manyways as long as the shuffledelements are independent.