160 likes | 182 Views
Explore challenges & solutions in building language resources for information retrieval systems using multilingual test collections from newspapers; emphasizes the importance of basic corpora, tools, and evaluation processes.
E N D
Infrastructures and Evaluation Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland http://trec.nist.gov
Workshop on Cross-Linguistic Information Retrieval, SIGIR 1996 • Paper “Building a Large Multilingual Test Collection from Comparable News Documents” by Páraic Sheridan, Jean Paul Ballerini and Peter Schäuble • Used Swiss news agency (SDA) data in French, German and Italian
TREC-6 Cross-Language Track • In cooperation with the Swiss Federal Institute of Technology (ETH) • Task Summary: retrieval of English, French, and German documents, both in a monolingual and a cross-lingual mode • Guidelines: ad hoc task guidelines, plus all groups had to submit a monolingual baseline • Documents: • Neue Zürcher Zeitung (1994): German (200 MB) • SDA (1988-1990): French (250 MB), German (330 MB) • AP (1988-1990): English (759 MB) • Topics and relevance assessments all done at NIST
Major issues with language resources • No public domain stopword lists, stemmers, etc. for German and French • Jacques Savoy contributed a Porter-like stemmer for French and a stopword list • Martin Braschler and Paul Over from NIST built a simple German stemmer and decompounder • Questions from participants about how much of the final result was based on having access to “better” resources
Major issues in CLIR resources • Major lack of machine-readable bilingual dictionaries • Resulted in the use of limited dictionaries • Resulted in the use of assorted mapped word lists that were found on the web • Major lack of parallel corpora • Resulted in the use of comparable corpora • (Later) resulted in the mining of the web for parallel text • Heavy use of SYSTRAN in query translation
Lessons learned from TREC-6 • Importance of basic corpora • Difficulty in locating public domain tools • Problems of building multilingual testing data in the U.S.; this led to European cooperation in later TRECs
Importance of Basic Corpora • The public availability of corpora, including text, speech and other multimedia data, is the most critical infrastructure • Newspapers (and their multimedia counterparts) are particularly valuable • Large volume readily available • Available in most languages • General purpose domain • Other genre also important
Uses of this Corpora • The basic building block for IR test collections • A rich source of vocabulary and language structure information for many tasks • Use of comparable corpora, e.g. corpora from the same time period, allows statistical mining of cross-language, cross-media “word” pairs
Importance of Basic Tools • For IR – stopword lists, stemmers, decompounders, segmenters, etc. • For other NLP tasks, add parsers, part-of-speech taggers, noun phrase detectors, named entity recognizers, etc. • For MT, add sentence aligners, etc. • These need to be readily available for all languages
Other Basic Infrastructures • Parallel text • WordNets • Treebanks • Thesaurii (often domain specific) • Machine readable dictionaries • Knowledge bases such as CYC • Gazetteers, etc.
Critical Issues for Infrastructure • Widespread availability of what already exists; this is both an issue of good dissemination and reasonable costs • Serious examination of the cost/benefit ratio of building any new infrastructure by the funding agencies • A clearer relationship between infrastructure, tools, and evaluation
Proposal: Widespread availability • Set up a central worldwide site with links to a site in each country that catalogs publicly available corpora and tools • Be realistic about the costs of corpora; the costs of building corpora should be paid by funding agencies and therefore should be available at a TRULY minimal cost
Proposal: Cost/Benefit Model • Look at basic corpora first • Prime target – a worldwide newspaper collection with at least 250 MB per language; look for publishing locations with multiple languages • Look at simple infrastructures also • Examples: lists of proper nouns, “crude” bilingual dictionaries, stemmers • Continue support of basic infrastructures like WordNets
Proposal: Role of Evaluation • Evaluation forums are critical to making progress in language technology • Encourage “friendly” competition; provide a common task focal point for research groups worldwide • Enable identification of good tools for broader dissemination • Identify what the real issues are; what are the most useful types of new infrastructure needed