EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP

EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003

The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to

Empirical methodology & evaluation • Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL • DARPA Speech initiative • MUC • TREC • GOOD: • Much easier for community (& researchers themselves) to understand which proposals are really improvements • BAD: • too much focus on small improvements • cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)

Typical developmental methodology in CL

Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be- representative of the task- as large as possible- well-known and understood

The test set Estimated models evaluated using a TEST SET The test set should be- disjoint from the training set- large enough for results to be reliable- unseen

Possible problems with the training set Too small  performance drops OVERFITTINGcan be reduced using- cross-validation(large variance may mean training set too small)- large priors

Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard- training set and test set may be too different (language non stationary)

Evaluation Two types:- BLACK BOX (system as a whole)- WHITE BOX (components independently) Typically QUANTITATIVE(but need QUALITATIVE as well)

Simplest quantitative evaluation metrics ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK

Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES

Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

Simple vs. multiple runs Single run may be lucky:- Do multiple runs- Report averaged results- Report degree of variation- Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.

Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of .7 may not look very high unless told that humans only achieve .71 at this task: need UPPER BOUND

Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX

Readings • Manning and Schütze, chapter 8.1

EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP