Some thoughts on evaluation of information retrieval and its impacts
Download
1 / 24

Some Thoughts on Evaluation of Information Retrieval and its Impacts - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Some Thoughts on Evaluation of Information Retrieval and its Impacts. Donna Harman National Institute of Standards and Technology. What is information retrieval. It’s the techniques behind the Goggle-like search engines that find information on the web

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Some Thoughts on Evaluation of Information Retrieval and its Impacts' - prentice


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Some thoughts on evaluation of information retrieval and its impacts

Some Thoughts on Evaluation of Information Retrieval and its Impacts

Donna Harman

National Institute of Standards and Technology


What is information retrieval
What is information retrieval Impacts

  • It’s the techniques behind the Goggle-like search engines that find information on the web

  • It’s the techniques behind local search engines that work within our organization to help us search archives

  • It’s the techniques behind online catalogs that have been available in libraries since the early 1980s


So how is ir evaluated
So how is IR evaluated? Impacts

  • By its moneymaking ability (ad revenues)

  • By its interface, cost, ease of installation

  • BUT ALSO

    • By how accurate the search is, i.e., what percentage of the documents returned are useful (precision)

    • By how complete the search is, i.e., what percentage of the total number of useful documents are actually returned (recall)


Test collections
Test Collections Impacts

  • Ideally we would like to evaluate search engines by using thousands of users pounding away at them

  • As a surrogate, we build a test collection

    • Document set

    • User questions

    • The set of documents that answer those questions


Cranfield test collection 1960
Cranfield test collection (1960) Impacts

  • 1400 abstracts in the field of aeronautical engineering

  • 225 questions (called requests)

  • Complete list of those abstracts that answered the questions (created by users reading all 1400 abstracts)


How was this used
How was this used Impacts

  • Researchers used the test collection for development of the early search engines such as those at Cornell University (SMART), and at Cambridge University

  • The first statistical methods for creating ranked lists of documents were tested on the Cranfield test collection

  • Additional tools such as variations on automatic indexing were also developed


Trec 1 collection 1992
TREC-1 collection (1992) Impacts

  • 2 gigabytes of documents from newspapers (Wall Street Journal), newswires (AP), other government documents

  • 50 user questions (called topics in TREC)

  • The set of documents that answer these questions (called relevant documents); valid sample set judged by surrogate users


Text retrieval conference trec
Text REtrieval Conference (TREC) Impacts

  • Started in 1992 with 25 participating groups and 2 tasks

  • Major goal: to encourage research in text retrieval based on large (reusable) test collections

  • TREC 2004 had 103 participating groups from over 20 countries and 7 tasks


Trec impacts
TREC Impacts Impacts


Trec tracks
TREC Tracks Impacts

Genome

Retrieval in a domain

Answers, not docs

Web searching, size

Beyond text

Beyond just English

Human-in-the-loop

Streamed text

Static text

Novelty

Q&A

Terabyte

Web

VLC

Video

Speech

OCR

X{X,Y,Z}

Chinese

Spanish

Interactive, HARD

Filtering

Routing

Ad Hoc, Robust


Trec 6 cross language task
TREC-6 Cross Language Task Impacts

  • Task: starting in one language, search for documents written in a variety of languages

    • Document collection

      • SDA: French (257 MB), German (331 MB)

      • Neue Zürcher Zeitung (198 MB, in German)

      • AP (759MB, in English)

    • 25 topics:

      • Mostly built in English and translated at NIST into French and German; also translated elsewhere into Spanish and Dutch

    • Relevance judgments made at NIST by two tri-lingual surrogate users (who built the topics)


Trec 7 cross language task
TREC-7 Cross Language Task Impacts

  • Document collection:

    • SDA: French (257 MB), German (331 MB), Italian (194MB)

    • Neue Zürcher Zeitung (198 MB, in German)

    • AP (759MB, in English)

  • 28 topics (7 each from 4 sites)

  • Relevance judgments made independently at each site by native speakers


Clir evaluation issues
CLIR Evaluation Issues Impacts

  • How to ensure that the “translated” topics represent how an information request would be made in a given language

  • How to ensure that there is enough common understanding of the topic so that the relevance judgments are consistent

  • How to ensure that the sampling techniques for the relevance judging are complete enough


Clef 2000
CLEF 2000 Impacts

  • TREC European CLIR task moved to the new Cross Language Evaluation Forum (CLEF) in 2000; number of participants grew from 12 to 20!!

  • Document collection stayed the same but the topics were created in 8 languages

  • The “seriousness” of the experiments also took a big jump


Clef 2004
CLEF 2004 Impacts

  • Document collection expanded to include major newspapers in 10 languages!!

  • This means that the CLEF test collection is now a cooperative project across at least 10 groups in Europe

  • Topics translated into 14 languages

  • Fifty-five participating groups

  • Six tasks, including question answering and image retrieval


Trec tracks1
TREC Tracks Impacts

Genome

Retrieval in a domain

Answers, not docs

Web searching, size

Beyond text

Beyond just English

Human-in-the-loop

Streamed text

Static text

Novelty

Q&A

Terabyte

Web

VLC

Video

Speech

OCR

X{X,Y,Z}

Chinese

Spanish

Interactive, HARD

Filtering

Routing

Ad Hoc, Robust


Spoken document retrieval
Spoken Document Retrieval Impacts

  • Documents: TREC-7 had 100 hours of news broadcasts; 550 hours/21,500 stories in TREC-8.

  • Topics: similar to “standard” TREC topics, 23 in TREC-7 and 50 in TREC-8

  • “Documents” were available as several baseline recognizer outputs (at different error rates), along with transcripts


Video retrieval
Video Retrieval Impacts

  • Video retrieval is not just speech retrieval, even though that is a major component of many current systems

  • TREC 2001 had a video retrieval track with 11 hours of video, 2 tasks (shot boundary and search), and 12 participants

  • TRECvid 2004 had 70 hours, 4 tasks (feature extraction and story segmentation added) and 33 participants


Trec tracks2
TREC Tracks Impacts

Genome

Retrieval in a domain

Answers, not docs

Web searching, size

Beyond text

Beyond just English

Human-in-the-loop

Streamed text

Static text

Novelty

Q&A

Terabyte

Web

VLC

Video

Speech

OCR

X{X,Y,Z}

Chinese

Spanish

Interactive, HARD

Filtering

Routing

Ad Hoc, Robust


Trec 8 factoid qa
TREC-8 Factoid QA Impacts

  • Document collection was “regular” TREC-8

  • Topics are now questions (198 of them)

    • What many calories in a Big Mac?

    • Where is the Taj Mahal?

  • Task is to retrieve answer strings of 50 or 250 bytes, not a document list

  • Humans determine correct answers from what is submitted


Moving beyond factoid qa
Moving beyond factoid QA Impacts

  • Use of questions from MSN, AskJeeves logs in 2000 and 2001

  • Addition of questions with no answers in 2001 and reduction to 50-byte answers

  • Requirement of exact answers in 2002

  • Addition of “definition”/who is questions in 2003

  • Expansion of these questions to include exact and list slots in 2004

  • Addition of events, pilot of relationship questions planned for 2005


Cross language qa
Cross-language QA Impacts

  • CLEF started Cross-language QA in 2004

  • Lots of interest, lots of unique issues

    • Do cultural aspects make this more difficult than cross-language document retrieval??

    • What language should the answers be in??

    • Are there specific types of questions that should be tested??

    • Would it be interesting to test question answering in a specific domain??


For more information
For more information Impacts

trec.nist.gov for more on TREC

www.clef-campaign.org for more on CLEF

research.nii.ac.jp/ntcir for more on NTCIR (evaluation in Japanese, Chinese, and Korean)


ad