Some Thoughts on Evaluation of Information Retrieval and its Impacts

Some Thoughts on Evaluation of Information Retrieval and its Impacts Donna Harman National Institute of Standards and Technology

What is information retrieval • It’s the techniques behind the Goggle-like search engines that find information on the web • It’s the techniques behind local search engines that work within our organization to help us search archives • It’s the techniques behind online catalogs that have been available in libraries since the early 1980s

So how is IR evaluated? • By its moneymaking ability (ad revenues) • By its interface, cost, ease of installation • BUT ALSO • By how accurate the search is, i.e., what percentage of the documents returned are useful (precision) • By how complete the search is, i.e., what percentage of the total number of useful documents are actually returned (recall)

Test Collections • Ideally we would like to evaluate search engines by using thousands of users pounding away at them • As a surrogate, we build a test collection • Document set • User questions • The set of documents that answer those questions

Cranfield test collection (1960) • 1400 abstracts in the field of aeronautical engineering • 225 questions (called requests) • Complete list of those abstracts that answered the questions (created by users reading all 1400 abstracts)

How was this used • Researchers used the test collection for development of the early search engines such as those at Cornell University (SMART), and at Cambridge University • The first statistical methods for creating ranked lists of documents were tested on the Cranfield test collection • Additional tools such as variations on automatic indexing were also developed

TREC-1 collection (1992) • 2 gigabytes of documents from newspapers (Wall Street Journal), newswires (AP), other government documents • 50 user questions (called topics in TREC) • The set of documents that answer these questions (called relevant documents); valid sample set judged by surrogate users

Text REtrieval Conference (TREC) • Started in 1992 with 25 participating groups and 2 tasks • Major goal: to encourage research in text retrieval based on large (reusable) test collections • TREC 2004 had 103 participating groups from over 20 countries and 7 tasks

TREC Impacts

TREC Tracks Genome Retrieval in a domain Answers, not docs Web searching, size Beyond text Beyond just English Human-in-the-loop Streamed text Static text Novelty Q&A Terabyte Web VLC Video Speech OCR X{X,Y,Z} Chinese Spanish Interactive, HARD Filtering Routing Ad Hoc, Robust

TREC-6 Cross Language Task • Task: starting in one language, search for documents written in a variety of languages • Document collection • SDA: French (257 MB), German (331 MB) • Neue Zürcher Zeitung (198 MB, in German) • AP (759MB, in English) • 25 topics: • Mostly built in English and translated at NIST into French and German; also translated elsewhere into Spanish and Dutch • Relevance judgments made at NIST by two tri-lingual surrogate users (who built the topics)

TREC-7 Cross Language Task • Document collection: • SDA: French (257 MB), German (331 MB), Italian (194MB) • Neue Zürcher Zeitung (198 MB, in German) • AP (759MB, in English) • 28 topics (7 each from 4 sites) • Relevance judgments made independently at each site by native speakers

CLIR Evaluation Issues • How to ensure that the “translated” topics represent how an information request would be made in a given language • How to ensure that there is enough common understanding of the topic so that the relevance judgments are consistent • How to ensure that the sampling techniques for the relevance judging are complete enough

CLEF 2000 • TREC European CLIR task moved to the new Cross Language Evaluation Forum (CLEF) in 2000; number of participants grew from 12 to 20!! • Document collection stayed the same but the topics were created in 8 languages • The “seriousness” of the experiments also took a big jump

CLEF 2004 • Document collection expanded to include major newspapers in 10 languages!! • This means that the CLEF test collection is now a cooperative project across at least 10 groups in Europe • Topics translated into 14 languages • Fifty-five participating groups • Six tasks, including question answering and image retrieval

Spoken Document Retrieval • Documents: TREC-7 had 100 hours of news broadcasts; 550 hours/21,500 stories in TREC-8. • Topics: similar to “standard” TREC topics, 23 in TREC-7 and 50 in TREC-8 • “Documents” were available as several baseline recognizer outputs (at different error rates), along with transcripts

Video Retrieval • Video retrieval is not just speech retrieval, even though that is a major component of many current systems • TREC 2001 had a video retrieval track with 11 hours of video, 2 tasks (shot boundary and search), and 12 participants • TRECvid 2004 had 70 hours, 4 tasks (feature extraction and story segmentation added) and 33 participants

TREC-8 Factoid QA • Document collection was “regular” TREC-8 • Topics are now questions (198 of them) • What many calories in a Big Mac? • Where is the Taj Mahal? • Task is to retrieve answer strings of 50 or 250 bytes, not a document list • Humans determine correct answers from what is submitted

Moving beyond factoid QA • Use of questions from MSN, AskJeeves logs in 2000 and 2001 • Addition of questions with no answers in 2001 and reduction to 50-byte answers • Requirement of exact answers in 2002 • Addition of “definition”/who is questions in 2003 • Expansion of these questions to include exact and list slots in 2004 • Addition of events, pilot of relationship questions planned for 2005

Cross-language QA • CLEF started Cross-language QA in 2004 • Lots of interest, lots of unique issues • Do cultural aspects make this more difficult than cross-language document retrieval?? • What language should the answers be in?? • Are there specific types of questions that should be tested?? • Would it be interesting to test question answering in a specific domain??

For more information trec.nist.gov for more on TREC www.clef-campaign.org for more on CLEF research.nii.ac.jp/ntcir for more on NTCIR (evaluation in Japanese, Chinese, and Korean)

Some Thoughts on Evaluation of Information Retrieval and its Impacts

Some Thoughts on Evaluation of Information Retrieval and its Impacts

Presentation Transcript

Some Thoughts on Pentecost

Some Thoughts on Discipleship

Some thoughts on Journaling

Some thoughts on PATO

Some Thoughts on Causation

Some thoughts on engagement

Some thoughts on XLINAC

Evaluation of Information Retrieval Systems

Evaluation of XML Information Retrieval Systems

Some Thoughts on Ecology

Information Retrieval Evaluation

Evaluation in Information Retrieval

Some Guiding Thoughts : Curriculum and Evaluation

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems

Some thoughts on PATO

Evaluation of Information Retrieval Systems