1 / 14

Using Corpora and Evaluation Tools

Using Corpora and Evaluation Tools. Diana Maynard Kalina Bontcheva. http://gate.ac.uk/ http://nlp.shef.ac.uk/. March 2004. Corpus structure. Located in gatecorpora in cvs Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace Each corpus can have sub-parts, e.g. ace/bnews

colt-turner
Download Presentation

Using Corpora and Evaluation Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http://gate.ac.uk/http://nlp.shef.ac.uk/ March 2004 1/(13)

  2. Corpus structure • Located in gatecorpora in cvs • Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace • Each corpus can have sub-parts, e.g. ace/bnews • Each (sub-)corpus has a clean and marked directory, these are important • Clean holds the unannotated version, while marked holds the human-marked ones • There may also be a processed subdirectory – this is a datastore (unlike the other two) • Corresponding files in each subdirectory must have the same name 2/(13)

  3. Tools for corpus manipulation • There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus • Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations • Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars) 3/(13)

  4. Corpora available • MUC7 (newswires) • MUSE (news texts from the web) • ACE • ACE Chinese • ACE Arabic • Romanian (news texts; 1984) • CMU seminars • Jobs • CONLL’03 – part of Reuters with NEs • Bulgarian - news 4/(13)

  5. MUC 7 corpus • Newswires used in the official MUC 7 evaluation • Data available in MUC format and GATE format • Annotation types: Person, Location, Organization, Money, Percent, Date, Time • Division into training and test sets 5/(13)

  6. MUSE corpus • News texts from various websites (BBC, Guardian, etc.) • Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address • Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names • Available from gatecorpora/news in various subdirectories 6/(13)

  7. ACE corpus • 3 types of text: newswire, broadcast news and newspaper • Broadcast news and newspaper available as ground truth and original (degraded) texts • Annotation types: Person, Organisation, Location, GPE, Facility • Some annotations have roles to indicate metonymous usage • Guidelines are different from MUC and MUSE • Available from gatecorpora/ace in various subdirectories 7/(13)

  8. Multilingual ACE • As for ACE, but in Chinese and Arabic • Texts are in UTF-8 • No degraded versions of these texts • Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/ 8/(13)

  9. CMU Seminars & Jobs • Corpora frequently used to evaluate relation extraction and wrapper induction systems • gatecorpora/jobs-corpus and gatecorpora/cmu-seminars • Converted into gate xml, ready for use 9/(13)

  10. CONLL’03 shared task • Corpus used in the CONLL’03 shared task for evaluating NE recognition • In English, part of the Reuters corpus • Markup is e.g., <I-LOC>, not converted to Muse tags • Use reuterstogate.jape to convert to Muse tags • gatecorpora/ReutersWithNamedEntities 10/(13)

  11. Annotation Diff:per-document evaluation 11/(13)

  12. Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time 12/(13)

  13. How it works • Clean, marked, and processed • Corpus_tool.properties – must be in the directory from where gate is executed • Specifies configuration information about • What annotations types are to be evaluated • Threshold below which to print out debug info • Input set name and key set name • Modes • Default – regression testing • Human marked against already stored, processed • Human marked against current processing results 13/(13)

  14. Conclusion This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt More information: http://gate.ac.uk/ 14/(13)

More Related