240 likes | 338 Views
Some Thoughts on Evaluation of Information Retrieval and its Impacts. Donna Harman National Institute of Standards and Technology. What is information retrieval. It’s the techniques behind the Goggle-like search engines that find information on the web
E N D
Some Thoughts on Evaluation of Information Retrieval and its Impacts Donna Harman National Institute of Standards and Technology
What is information retrieval • It’s the techniques behind the Goggle-like search engines that find information on the web • It’s the techniques behind local search engines that work within our organization to help us search archives • It’s the techniques behind online catalogs that have been available in libraries since the early 1980s
So how is IR evaluated? • By its moneymaking ability (ad revenues) • By its interface, cost, ease of installation • BUT ALSO • By how accurate the search is, i.e., what percentage of the documents returned are useful (precision) • By how complete the search is, i.e., what percentage of the total number of useful documents are actually returned (recall)
Test Collections • Ideally we would like to evaluate search engines by using thousands of users pounding away at them • As a surrogate, we build a test collection • Document set • User questions • The set of documents that answer those questions
Cranfield test collection (1960) • 1400 abstracts in the field of aeronautical engineering • 225 questions (called requests) • Complete list of those abstracts that answered the questions (created by users reading all 1400 abstracts)
How was this used • Researchers used the test collection for development of the early search engines such as those at Cornell University (SMART), and at Cambridge University • The first statistical methods for creating ranked lists of documents were tested on the Cranfield test collection • Additional tools such as variations on automatic indexing were also developed
TREC-1 collection (1992) • 2 gigabytes of documents from newspapers (Wall Street Journal), newswires (AP), other government documents • 50 user questions (called topics in TREC) • The set of documents that answer these questions (called relevant documents); valid sample set judged by surrogate users
Text REtrieval Conference (TREC) • Started in 1992 with 25 participating groups and 2 tasks • Major goal: to encourage research in text retrieval based on large (reusable) test collections • TREC 2004 had 103 participating groups from over 20 countries and 7 tasks
TREC Tracks Genome Retrieval in a domain Answers, not docs Web searching, size Beyond text Beyond just English Human-in-the-loop Streamed text Static text Novelty Q&A Terabyte Web VLC Video Speech OCR X{X,Y,Z} Chinese Spanish Interactive, HARD Filtering Routing Ad Hoc, Robust
TREC-6 Cross Language Task • Task: starting in one language, search for documents written in a variety of languages • Document collection • SDA: French (257 MB), German (331 MB) • Neue Zürcher Zeitung (198 MB, in German) • AP (759MB, in English) • 25 topics: • Mostly built in English and translated at NIST into French and German; also translated elsewhere into Spanish and Dutch • Relevance judgments made at NIST by two tri-lingual surrogate users (who built the topics)
TREC-7 Cross Language Task • Document collection: • SDA: French (257 MB), German (331 MB), Italian (194MB) • Neue Zürcher Zeitung (198 MB, in German) • AP (759MB, in English) • 28 topics (7 each from 4 sites) • Relevance judgments made independently at each site by native speakers
CLIR Evaluation Issues • How to ensure that the “translated” topics represent how an information request would be made in a given language • How to ensure that there is enough common understanding of the topic so that the relevance judgments are consistent • How to ensure that the sampling techniques for the relevance judging are complete enough
CLEF 2000 • TREC European CLIR task moved to the new Cross Language Evaluation Forum (CLEF) in 2000; number of participants grew from 12 to 20!! • Document collection stayed the same but the topics were created in 8 languages • The “seriousness” of the experiments also took a big jump
CLEF 2004 • Document collection expanded to include major newspapers in 10 languages!! • This means that the CLEF test collection is now a cooperative project across at least 10 groups in Europe • Topics translated into 14 languages • Fifty-five participating groups • Six tasks, including question answering and image retrieval
TREC Tracks Genome Retrieval in a domain Answers, not docs Web searching, size Beyond text Beyond just English Human-in-the-loop Streamed text Static text Novelty Q&A Terabyte Web VLC Video Speech OCR X{X,Y,Z} Chinese Spanish Interactive, HARD Filtering Routing Ad Hoc, Robust
Spoken Document Retrieval • Documents: TREC-7 had 100 hours of news broadcasts; 550 hours/21,500 stories in TREC-8. • Topics: similar to “standard” TREC topics, 23 in TREC-7 and 50 in TREC-8 • “Documents” were available as several baseline recognizer outputs (at different error rates), along with transcripts
Video Retrieval • Video retrieval is not just speech retrieval, even though that is a major component of many current systems • TREC 2001 had a video retrieval track with 11 hours of video, 2 tasks (shot boundary and search), and 12 participants • TRECvid 2004 had 70 hours, 4 tasks (feature extraction and story segmentation added) and 33 participants
TREC Tracks Genome Retrieval in a domain Answers, not docs Web searching, size Beyond text Beyond just English Human-in-the-loop Streamed text Static text Novelty Q&A Terabyte Web VLC Video Speech OCR X{X,Y,Z} Chinese Spanish Interactive, HARD Filtering Routing Ad Hoc, Robust
TREC-8 Factoid QA • Document collection was “regular” TREC-8 • Topics are now questions (198 of them) • What many calories in a Big Mac? • Where is the Taj Mahal? • Task is to retrieve answer strings of 50 or 250 bytes, not a document list • Humans determine correct answers from what is submitted
Moving beyond factoid QA • Use of questions from MSN, AskJeeves logs in 2000 and 2001 • Addition of questions with no answers in 2001 and reduction to 50-byte answers • Requirement of exact answers in 2002 • Addition of “definition”/who is questions in 2003 • Expansion of these questions to include exact and list slots in 2004 • Addition of events, pilot of relationship questions planned for 2005
Cross-language QA • CLEF started Cross-language QA in 2004 • Lots of interest, lots of unique issues • Do cultural aspects make this more difficult than cross-language document retrieval?? • What language should the answers be in?? • Are there specific types of questions that should be tested?? • Would it be interesting to test question answering in a specific domain??
For more information trec.nist.gov for more on TREC www.clef-campaign.org for more on CLEF research.nii.ac.jp/ntcir for more on NTCIR (evaluation in Japanese, Chinese, and Korean)