Enhancing Multilingual Information Retrieval Infrastructure

Infrastructures and Evaluation Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland http://trec.nist.gov

TREC Tasks

Workshop on Cross-Linguistic Information Retrieval, SIGIR 1996 • Paper “Building a Large Multilingual Test Collection from Comparable News Documents” by Páraic Sheridan, Jean Paul Ballerini and Peter Schäuble • Used Swiss news agency (SDA) data in French, German and Italian

TREC-6 Cross-Language Track • In cooperation with the Swiss Federal Institute of Technology (ETH) • Task Summary: retrieval of English, French, and German documents, both in a monolingual and a cross-lingual mode • Guidelines: ad hoc task guidelines, plus all groups had to submit a monolingual baseline • Documents: • Neue Zürcher Zeitung (1994): German (200 MB) • SDA (1988-1990): French (250 MB), German (330 MB) • AP (1988-1990): English (759 MB) • Topics and relevance assessments all done at NIST

TREC-6 Cross-Language Results - revised 01/20/98

Major issues with language resources • No public domain stopword lists, stemmers, etc. for German and French • Jacques Savoy contributed a Porter-like stemmer for French and a stopword list • Martin Braschler and Paul Over from NIST built a simple German stemmer and decompounder • Questions from participants about how much of the final result was based on having access to “better” resources

Major issues in CLIR resources • Major lack of machine-readable bilingual dictionaries • Resulted in the use of limited dictionaries • Resulted in the use of assorted mapped word lists that were found on the web • Major lack of parallel corpora • Resulted in the use of comparable corpora • (Later) resulted in the mining of the web for parallel text • Heavy use of SYSTRAN in query translation

Lessons learned from TREC-6 • Importance of basic corpora • Difficulty in locating public domain tools • Problems of building multilingual testing data in the U.S.; this led to European cooperation in later TRECs

Importance of Basic Corpora • The public availability of corpora, including text, speech and other multimedia data, is the most critical infrastructure • Newspapers (and their multimedia counterparts) are particularly valuable • Large volume readily available • Available in most languages • General purpose domain • Other genre also important

Uses of this Corpora • The basic building block for IR test collections • A rich source of vocabulary and language structure information for many tasks • Use of comparable corpora, e.g. corpora from the same time period, allows statistical mining of cross-language, cross-media “word” pairs

Importance of Basic Tools • For IR – stopword lists, stemmers, decompounders, segmenters, etc. • For other NLP tasks, add parsers, part-of-speech taggers, noun phrase detectors, named entity recognizers, etc. • For MT, add sentence aligners, etc. • These need to be readily available for all languages

Other Basic Infrastructures • Parallel text • WordNets • Treebanks • Thesaurii (often domain specific) • Machine readable dictionaries • Knowledge bases such as CYC • Gazetteers, etc.

Critical Issues for Infrastructure • Widespread availability of what already exists; this is both an issue of good dissemination and reasonable costs • Serious examination of the cost/benefit ratio of building any new infrastructure by the funding agencies • A clearer relationship between infrastructure, tools, and evaluation

Proposal: Widespread availability • Set up a central worldwide site with links to a site in each country that catalogs publicly available corpora and tools • Be realistic about the costs of corpora; the costs of building corpora should be paid by funding agencies and therefore should be available at a TRULY minimal cost

Proposal: Cost/Benefit Model • Look at basic corpora first • Prime target – a worldwide newspaper collection with at least 250 MB per language; look for publishing locations with multiple languages • Look at simple infrastructures also • Examples: lists of proper nouns, “crude” bilingual dictionaries, stemmers • Continue support of basic infrastructures like WordNets

Proposal: Role of Evaluation • Evaluation forums are critical to making progress in language technology • Encourage “friendly” competition; provide a common task focal point for research groups worldwide • Enable identification of good tools for broader dissemination • Identify what the real issues are; what are the most useful types of new infrastructure needed

Enhancing Multilingual Information Retrieval Infrastructure

Enhancing Multilingual Information Retrieval Infrastructure

Presentation Transcript

Query Processing and Networking Infrastructures

Standards and Critical Network Infrastructures

Information Infrastructures

Model Checking for Survivability Evaluation Critical Infrastructures

Certificateless encryption and its infrastructures

Standardisation and e-Infrastructures

HealthGrid Infrastructures and Applications

Language documentation and Infrastructures

Global Infrastructures

Health Infrastructures

e-infrastructures

Participating Infrastructures

Critical Infrastructures

Common Goals and Infrastructures

Pixel detector status and infrastructures

Research Infrastructures and Horizon 2020

Language documentation and Infrastructures

Distribution Infrastructures

Innovation and Economic Infrastructures

Global Infrastructures