140 likes | 256 Views
This document provides an overview of various corpora available in the GATE framework, including their structure, annotation types, and tools for manipulation. Key corpora such as MUC7, MUSE, and ACE are discussed, highlighting their unique features and annotations. The focus is on the file organization within GATE, including clean and marked directories, and the multilingual capabilities of ACE. Additionally, regression testing and corpus evaluation methodologies are presented. Useful for researchers in natural language processing and text analysis, this guide serves as a foundational resource.
E N D
Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http://gate.ac.uk/http://nlp.shef.ac.uk/ March 2004 1/(13)
Corpus structure • Located in gatecorpora in cvs • Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace • Each corpus can have sub-parts, e.g. ace/bnews • Each (sub-)corpus has a clean and marked directory, these are important • Clean holds the unannotated version, while marked holds the human-marked ones • There may also be a processed subdirectory – this is a datastore (unlike the other two) • Corresponding files in each subdirectory must have the same name 2/(13)
Tools for corpus manipulation • There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus • Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations • Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars) 3/(13)
Corpora available • MUC7 (newswires) • MUSE (news texts from the web) • ACE • ACE Chinese • ACE Arabic • Romanian (news texts; 1984) • CMU seminars • Jobs • CONLL’03 – part of Reuters with NEs • Bulgarian - news 4/(13)
MUC 7 corpus • Newswires used in the official MUC 7 evaluation • Data available in MUC format and GATE format • Annotation types: Person, Location, Organization, Money, Percent, Date, Time • Division into training and test sets 5/(13)
MUSE corpus • News texts from various websites (BBC, Guardian, etc.) • Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address • Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names • Available from gatecorpora/news in various subdirectories 6/(13)
ACE corpus • 3 types of text: newswire, broadcast news and newspaper • Broadcast news and newspaper available as ground truth and original (degraded) texts • Annotation types: Person, Organisation, Location, GPE, Facility • Some annotations have roles to indicate metonymous usage • Guidelines are different from MUC and MUSE • Available from gatecorpora/ace in various subdirectories 7/(13)
Multilingual ACE • As for ACE, but in Chinese and Arabic • Texts are in UTF-8 • No degraded versions of these texts • Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/ 8/(13)
CMU Seminars & Jobs • Corpora frequently used to evaluate relation extraction and wrapper induction systems • gatecorpora/jobs-corpus and gatecorpora/cmu-seminars • Converted into gate xml, ready for use 9/(13)
CONLL’03 shared task • Corpus used in the CONLL’03 shared task for evaluating NE recognition • In English, part of the Reuters corpus • Markup is e.g., <I-LOC>, not converted to Muse tags • Use reuterstogate.jape to convert to Muse tags • gatecorpora/ReutersWithNamedEntities 10/(13)
Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time 12/(13)
How it works • Clean, marked, and processed • Corpus_tool.properties – must be in the directory from where gate is executed • Specifies configuration information about • What annotations types are to be evaluated • Threshold below which to print out debug info • Input set name and key set name • Modes • Default – regression testing • Human marked against already stored, processed • Human marked against current processing results 13/(13)
Conclusion This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt More information: http://gate.ac.uk/ 14/(13)