- 123 Views
- Uploaded on
- Presentation posted in: General

Approximate Randomization tests

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

ApproximateRandomization tests

February 5th, 2013

- Classic tests oftenassume a givendistribution (student t, normal, …) of the variable
- This is ≈ok forrecall, but notforprecision or F-score
- Possible hypotheses to test with non-parametric tests is limited

- 30,000 runs, 1000 instances, 500 of class A
- True positives (TP): 400 (stdev:80)
- Falsepositives (FP): 60 (stdev: 15)
- Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.

- Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal.
- Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal.
- F-score: non-linearcombination of recallandprecisionNotnormal.

- No assumption on distribution
- Can handle complicatedstatistics
- Onlyassumption: independencebetweenshuffledelements
- References:
- Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989.
- More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.

- Exact randomization test

H0: expert is independent of contents

P(ncorrect ≥ 2) = 7/24

= 0.29

Thus, do notreject H0 because the probability is largerthanalpha=0.05.

- The number of permutations is n! => quickincrease of number of permutations
- Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1)
- nge : number of timespseudostatistic ≥ actualstatistic
- NS: number of shuffles
- +1: correctionforvalidity

- Eachglass is aninstance
- Contents and expert are twolabeling systems
- Contents has anaccuracy of 100%, expert has anaccuracy of 50%
- Statistic is precision, f-score, recall, … instead of accuracy

- For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother
- Onlyshufflelabels per instance

- Assumpton of independencebetweeninstances
- Shuffle per sentenceratherthan per token

- Shufflingextractedtermsbetween output of two term extraction systems

- http://www.clips.ua.ac.be/~vincent/software.html#art
- http://www.clips.ua.ac.be/scripts/art
- Options:
- Exact andapproximaterandomization tests
- Instancebased, alsofor MBT
- Term extractionbased
- StratifiedShuffling
- Twosided / one-sided (check code!)

- It makes no sense toshuffleif exact randomizationcanbecomputed
- The value of p depends on NS. The larger NS, the lower p canbe
- Validity check
- Sign-test
- Re-test: toalleviate bad randomization

- Canbecomparedwith P foraccuracy
- H0: correctness is independent ofsystem i.e.P(groen) = 0.5
- Binomial test

- How much do these two systems differbased on precisionfor the A label?
- Maximally
- Intermediate
- Minimally

- Approximaterandomizationtestingcanbeusedformanyapplications.
- The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated.
- Differencecanbecomputed in manyways as long as the shuffledelements are independent.