approximate randomization tests
Download
Skip this Video
Download Presentation
Approximate Randomization tests

Loading in 2 Seconds...

play fullscreen
1 / 25

Approximate Randomization tests - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Approximate Randomization tests' - ashby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
why ar testing
Why ar testing?
  • Classic tests oftenassume a givendistribution (student t, normal, …) of the variable
  • This is ≈ok forrecall, but notforprecision or F-score
  • Possible hypotheses to test with non-parametric tests is limited
illustration
Illustration
  • 30,000 runs, 1000 instances, 500 of class A
  • True positives (TP): 400 (stdev:80)
  • Falsepositives (FP): 60 (stdev: 15)
  • Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.
definitions
Definitions
  • Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal.
  • Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal.
  • F-score: non-linearcombination of recallandprecisionNotnormal.
approximate randomization test
Approximaterandomization test
  • No assumption on distribution
  • Can handle complicatedstatistics
  • Onlyassumption: independencebetweenshuffledelements
  • References:
    • Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989.
    • More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.
basic idea
Basic idea
  • Exact randomization test
exact probability
Exact probability

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24

= 0.29

Thus, do notreject H0 because the probability is largerthanalpha=0.05.

approximate probability
Approximateprobability
  • The number of permutations is n! => quickincrease of number of permutations
  • Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1)
    • nge : number of timespseudostatistic ≥ actualstatistic
    • NS: number of shuffles
    • +1: correctionforvalidity
translation to instances
Translationtoinstances
  • Eachglass is aninstance
  • Contents and expert are twolabeling systems
  • Contents has anaccuracy of 100%, expert has anaccuracy of 50%
  • Statistic is precision, f-score, recall, … instead of accuracy
stratified shuffling
Stratifiedshuffling
  • For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother
  • Onlyshufflelabels per instance
slide18
MBT
  • Assumpton of independencebetweeninstances
  • Shuffle per sentenceratherthan per token
term extraction
Term extraction
  • Shufflingextractedtermsbetween output of two term extraction systems
script
Script
  • http://www.clips.ua.ac.be/~vincent/software.html#art
  • http://www.clips.ua.ac.be/scripts/art
  • Options:
    • Exact andapproximaterandomization tests
    • Instancebased, alsofor MBT
    • Term extractionbased
    • StratifiedShuffling
    • Twosided / one-sided (check code!)
r emarks on usage
Remarks on usage
  • It makes no sense toshuffleif exact randomizationcanbecomputed
  • The value of p depends on NS. The larger NS, the lower p canbe
  • Validity check
    • Sign-test
    • Re-test: toalleviate bad randomization
sign test
Sign test
  • Canbecomparedwith P foraccuracy
  • H0: correctness is independent ofsystem i.e.P(groen) = 0.5
  • Binomial test
interpretation 1
Interpretation (1)
  • How much do these two systems differbased on precisionfor the A label?
  • Maximally
  • Intermediate
  • Minimally
conclusion
Conclusion
  • Approximaterandomizationtestingcanbeusedformanyapplications.
  • The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated.
  • Differencecanbecomputed in manyways as long as the shuffledelements are independent.
ad