Using Corpora and Evaluation Tools - PowerPoint PPT Presentation

colt-turner
using corpora and evaluation tools n.
Skip this Video
Loading SlideShow in 5 Seconds..
Using Corpora and Evaluation Tools PowerPoint Presentation
Download Presentation
Using Corpora and Evaluation Tools

play fullscreen
1 / 14
Download Presentation
Using Corpora and Evaluation Tools
71 Views
Download Presentation

Using Corpora and Evaluation Tools

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http://gate.ac.uk/http://nlp.shef.ac.uk/ March 2004 1/(13)

  2. Corpus structure • Located in gatecorpora in cvs • Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace • Each corpus can have sub-parts, e.g. ace/bnews • Each (sub-)corpus has a clean and marked directory, these are important • Clean holds the unannotated version, while marked holds the human-marked ones • There may also be a processed subdirectory – this is a datastore (unlike the other two) • Corresponding files in each subdirectory must have the same name 2/(13)

  3. Tools for corpus manipulation • There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus • Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations • Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars) 3/(13)

  4. Corpora available • MUC7 (newswires) • MUSE (news texts from the web) • ACE • ACE Chinese • ACE Arabic • Romanian (news texts; 1984) • CMU seminars • Jobs • CONLL’03 – part of Reuters with NEs • Bulgarian - news 4/(13)

  5. MUC 7 corpus • Newswires used in the official MUC 7 evaluation • Data available in MUC format and GATE format • Annotation types: Person, Location, Organization, Money, Percent, Date, Time • Division into training and test sets 5/(13)

  6. MUSE corpus • News texts from various websites (BBC, Guardian, etc.) • Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address • Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names • Available from gatecorpora/news in various subdirectories 6/(13)

  7. ACE corpus • 3 types of text: newswire, broadcast news and newspaper • Broadcast news and newspaper available as ground truth and original (degraded) texts • Annotation types: Person, Organisation, Location, GPE, Facility • Some annotations have roles to indicate metonymous usage • Guidelines are different from MUC and MUSE • Available from gatecorpora/ace in various subdirectories 7/(13)

  8. Multilingual ACE • As for ACE, but in Chinese and Arabic • Texts are in UTF-8 • No degraded versions of these texts • Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/ 8/(13)

  9. CMU Seminars & Jobs • Corpora frequently used to evaluate relation extraction and wrapper induction systems • gatecorpora/jobs-corpus and gatecorpora/cmu-seminars • Converted into gate xml, ready for use 9/(13)

  10. CONLL’03 shared task • Corpus used in the CONLL’03 shared task for evaluating NE recognition • In English, part of the Reuters corpus • Markup is e.g., <I-LOC>, not converted to Muse tags • Use reuterstogate.jape to convert to Muse tags • gatecorpora/ReutersWithNamedEntities 10/(13)

  11. Annotation Diff:per-document evaluation 11/(13)

  12. Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time 12/(13)

  13. How it works • Clean, marked, and processed • Corpus_tool.properties – must be in the directory from where gate is executed • Specifies configuration information about • What annotations types are to be evaluated • Threshold below which to print out debug info • Input set name and key set name • Modes • Default – regression testing • Human marked against already stored, processed • Human marked against current processing results 13/(13)

  14. Conclusion This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt More information: http://gate.ac.uk/ 14/(13)