Approximate randomization tests
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Approximate Randomization tests PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score

Download Presentation

Approximate Randomization tests

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Approximate randomization tests

ApproximateRandomization tests

February 5th, 2013


Classic t test

Classic t-test


Why ar testing

Why ar testing?

  • Classic tests oftenassume a givendistribution (student t, normal, …) of the variable

  • This is ≈ok forrecall, but notforprecision or F-score

  • Possible hypotheses to test with non-parametric tests is limited


Illustration

Illustration

  • 30,000 runs, 1000 instances, 500 of class A

  • True positives (TP): 400 (stdev:80)

  • Falsepositives (FP): 60 (stdev: 15)

  • Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.


Definitions

Definitions

  • Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal.

  • Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal.

  • F-score: non-linearcombination of recallandprecisionNotnormal.


Approximate randomization test

Approximaterandomization test

  • No assumption on distribution

  • Can handle complicatedstatistics

  • Onlyassumption: independencebetweenshuffledelements

  • References:

    • Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989.

    • More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.


Basic idea

Basic idea

  • Exact randomization test


Exact probability

Exact probability

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24

= 0.29

Thus, do notreject H0 because the probability is largerthanalpha=0.05.


Approximate probability

Approximateprobability

  • The number of permutations is n! => quickincrease of number of permutations

  • Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1)

    • nge : number of timespseudostatistic ≥ actualstatistic

    • NS: number of shuffles

    • +1: correctionforvalidity


Different setups

Different setups


Translation to instances

Translationtoinstances

  • Eachglass is aninstance

  • Contents and expert are twolabeling systems

  • Contents has anaccuracy of 100%, expert has anaccuracy of 50%

  • Statistic is precision, f-score, recall, … instead of accuracy


Stratified shuffling

Stratifiedshuffling

  • For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother

  • Onlyshufflelabels per instance


Approximate randomization tests

MBT

  • Assumpton of independencebetweeninstances

  • Shuffle per sentenceratherthan per token


Term extraction

Term extraction

  • Shufflingextractedtermsbetween output of two term extraction systems


Script

Script

  • http://www.clips.ua.ac.be/~vincent/software.html#art

  • http://www.clips.ua.ac.be/scripts/art

  • Options:

    • Exact andapproximaterandomization tests

    • Instancebased, alsofor MBT

    • Term extractionbased

    • StratifiedShuffling

    • Twosided / one-sided (check code!)


R emarks on usage

Remarks on usage

  • It makes no sense toshuffleif exact randomizationcanbecomputed

  • The value of p depends on NS. The larger NS, the lower p canbe

  • Validity check

    • Sign-test

    • Re-test: toalleviate bad randomization


Sign test

Sign test

  • Canbecomparedwith P foraccuracy

  • H0: correctness is independent ofsystem i.e.P(groen) = 0.5

  • Binomial test


Interpretation 1

Interpretation (1)

  • How much do these two systems differbased on precisionfor the A label?

  • Maximally

  • Intermediate

  • Minimally


Interpretation 2

Interpretation (2)


Conclusion

Conclusion

  • Approximaterandomizationtestingcanbeusedformanyapplications.

  • The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated.

  • Differencecanbecomputed in manyways as long as the shuffledelements are independent.


  • Login