1 / 25

Evaluation issues in anaphora resolution and beyond

Evaluation issues in anaphora resolution and beyond. Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002. Evaluation. Evaluation is a driving force for every NLP task/approach/application

yclements
Download Presentation

Evaluation issues in anaphora resolution and beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002

  2. Evaluation • Evaluation is a driving force for every NLP task/approach/application • Evaluation is indicative of the performance of a specific approach/application but not less importantly, reports where it stands as compared to other approaches/applications • Growing research in evaluation inspired by the availability of annotated corpora

  3. Major impediments to fulfilling evaluation’s mission • Different approaches evaluated on different data • Different approaches evaluated in different modes • Results not independently confirmed • As a result, no comparison or objective evaluation possible

  4. Anaphora resolution vs. coreference resolution • Anaphora resolution has to do with tracking down an antecedent of an anaphor • Coreference resolution seeks to identify all coreference classes (chains)

  5. Anaphora resolution • For nominal anaphora which involves coreference it would be logical to regard each of the preceding noun phrases which are coreferential with the anaphor(s) as a legitimate antecedent Computational Linguistsfrom many different countries attended PorTAL. The participants enjoyed the presentations; they also took an active part in the discussions.

  6. Evaluation in anaphora resolution Two perspectives: • Evaluation of anaphora resolution algorithms • Evaluation of anaphora resolution systems

  7. Recall and Precision • MUC introduced the measures recall and precision for coreference resolution. • These measures, as defined, are not satisfactory in terms of clarity and coverage (Mitkov 2001).

  8. Evaluation package for anaphora resolution algorithms (Mitkov 1998; 2000) Evaluation package for anaphora resolution algorithms (i) performance measures (ii) comparative evaluation tasks and (iii) component measures.

  9. Performance measures • Success rate • Critical success rate Critical success rate applies only to those ‘tough’ anaphors which still have more than one candidate for antecedent after gender and number filter

  10. Example • Evaluation data: 100 anaphors • Number of anaphors correctly resolved: 80 • Number of anaphors correctly resolved after gender and number constraints: 30 Success rate: 80/100 = 80%, Critical success rate 50/70 = 71.4%

  11. Comparative evaluation tasks • Evaluation against baseline models • Comparison to similar approaches • Comparison with well-established approaches Approaches frequently used for comparison: Hobbs (1978), Brenan et al. (1987), Lappin and Leass (1994), Kennedy and Boguraev (1996), Baldwin (1997), Mitkov (1996; 1998)

  12. Component measures • Relative importance • Decision power (Mitkov 2001)

  13. Evaluation measures for anaphora resolution systems • Success rate • Critical success rate • Resolution etiquette (Mitkov et al. 2002)

  14. Reliability of evaluation results Evaluation results can be regarded as reliable if evaluation covers/employs • All naturally occurring texts • Sampling procedures

  15. Relative vs. absolute results • Results may be relative with regard to a specific evaluation set or other approach • More “absolute” figures may be obtained if there existed a measure which quantified for the complexity of anaphors to be resolved

  16. Measures quantifying complexity in anaphora resolution Measures for complexity (Mitkov 2001): • Knowledge required for resolution • Distance between anaphor and antecedent (in NPs, clauses, sentences) • Number of competing candidates

  17. Fair evaluation Algorithms should be evaluated on the basis of the same • Evaluation data • Pre-processing tools

  18. Evaluation workbench Evaluation workbench for anaphora resolution (Mitkov 2000; Barbu and Mitkov 2001) • Allows the comparison of approaches sharing common principles or similar pre-processing • Enables the ‘plugging in’ and testing of different anaphora resolution algorithms All algorithms implemented operate in a fully automatic mode

  19. The need for annotated corpora Annotated corpora are vital for training and evaluation Annotation should cover anaphoric or coreferential chains and not only anaphor-antecedent pairs only

  20. Scarce commodity • Lancaster Anaphoric Treebank (100 000 words) • MUC coreference task annotated data (65 000) • Part of the Penn Treebank (90 000)

  21. Additional issues • Annotation scheme • Annotating tools • Annotation strategy Interannotators’ (dis)agreement is a major issue!

  22. The Wolverhampton coreference annotation project A 500 000-word corpus annotated for anaphoric and coreferential links (identity-of-sense direct nominal anaphora) Less ambitious in terms of coverage, but much more consistent

  23. Watch out for the traps! • Are all annotated data reliable? • Are all original documents reliable? • Are all results reported “honest”?

  24. Morale and motivation important! If I may offer you my advice.... • Do not despair if your first evaluation results are not as high as you wanted them to be • Be prepared to provide considerable input in exchange of minor performance improvement • Work hard • Be transparent ... and you´ll get there!

  25. Anaphora resolution projects Ruslan Mitkov’s home page http://www.wlv.ac.uk/~le1825 Research Group in Computational Linguistics http://clg.wlv.ac.uk

More Related