1 / 18

EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP

EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP. Universita’ di Venezia 1 Ottobre 2003. The rise of empiricism. CL up until the 1980s primarily a theoretical discipline. The experimental methodology is now paid much more attention to. Empirical methodology & evaluation.

jean
Download Presentation

EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003

  2. The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to

  3. Empirical methodology & evaluation • Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL • DARPA Speech initiative • MUC • TREC • GOOD: • Much easier for community (& researchers themselves) to understand which proposals are really improvements • BAD: • too much focus on small improvements • cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)

  4. Typical developmental methodology in CL

  5. Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be- representative of the task- as large as possible- well-known and understood

  6. The test set Estimated models evaluated using a TEST SET The test set should be- disjoint from the training set- large enough for results to be reliable- unseen

  7. Possible problems with the training set Too small  performance drops OVERFITTINGcan be reduced using- cross-validation(large variance may mean training set too small)- large priors

  8. Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard- training set and test set may be too different (language non stationary)

  9. Evaluation Two types:- BLACK BOX (system as a whole)- WHITE BOX (components independently) Typically QUANTITATIVE(but need QUALITATIVE as well)

  10. Simplest quantitative evaluation metrics ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

  11. A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK

  12. Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES

  13. Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

  14. The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

  15. Simple vs. multiple runs Single run may be lucky:- Do multiple runs- Report averaged results- Report degree of variation- Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.

  16. Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of .7 may not look very high unless told that humans only achieve .71 at this task: need UPPER BOUND

  17. Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX

  18. Readings • Manning and Schütze, chapter 8.1

More Related