Chapter 3 retrieval evaluation
1 / 34

Chapter 3 Retrieval Evaluation - PowerPoint PPT Presentation

  • Uploaded on

Chapter 3 Retrieval Evaluation. Modern Information Retrieval Ricardo Baeza-Yates Berthier Ribeiro-Neto. Hsu Yi-Chen, NCU MIS 88423043. Outline. Introduction Retrieval Performance Evaluation Recall and precision Alternative measures Reference Collections TREC Collection

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Chapter 3 Retrieval Evaluation' - vega

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 3 retrieval evaluation

Chapter 3Retrieval Evaluation

Modern Information Retrieval

Ricardo Baeza-Yates

Berthier Ribeiro-Neto

Hsu Yi-Chen, NCU MIS



  • Introduction

  • Retrieval Performance Evaluation

    • Recall and precision

    • Alternative measures

  • Reference Collections

    • TREC Collection

    • CACM&ISI Collection

    • CF Collection

  • Trends and Research Issues


  • Type of evaluation

    • Functional analysis phase, and Error analysis phase

    • Performance evaluation

  • Performance of the IR system

    • Performance evaluation

      • Response time/space required

    • Retrieval performance evaluation

      • The evaluation of how precise is the answer set

Retrieval performance evaluation for ir system
Retrieval performance evaluation for IR system

Goodness of retrieval strategy S =

the similarity between

Set of retrieval documents by S

Set of relevant documents provided by specialists

quantified by Evaluation measure

Retrieval performance evaluation cont
Retrieval Performance Evaluation(Cont.)

  • 評估以batch query 為主的IR 系統

Relevant Docs

In Answer Set





Answer Set


Relevant Docs


Sorted by relevance

Precision versus recall curve
Precision versus recall curve

  • Rq={d3,d5,d9,d25,d39,d44,d56,d89,d123}

Ranking for query q:
















  • 100% at10%

  • 66% at 20%

  • 50% at 30%

  • Usally based on 11 standard recall levels:0%,10%,..100%

Precision versus recall curve1
Precision versus recall curve

  • For a single query



  • P(r)= Σ(Pi(r)/Nq)

  • P(r)=average precision at the recall leval

  • Nq=number of queries used

  • Pi(r)=the precision at recall level r for the i-th query



Interpolated precision
Interpolated precision

  • Rq={d3,d56,d129}

  • Let rj,j={0,1,2,…,10},be a reference to the j-th standard recall level

  • P(rj)=max ri≦r≦rj+1P(r)

Average recall versus precision figure
兩個不同演算法的Average recall versus precision figure

Single value summaries
Single Value Summaries

  • 之前的average precision versus recall:

    • 比較retrieval algorithms over a set of example queries

  • But! Individual query的performance也很重要,因為:

    • Average precision可能會隱藏演算法中不正常的部分

    • 可能需要知道,兩個演算法中,對某特定query的performance為何

  • 解決方法:

    • 考慮每一個query的single precision value

    • The single value should be interpreted as a summary of the corresponding precision versus recall curve

    • 通常 ,single value summary被用來當作某一個recall level 的precision值

Average precision at seen relevant documents
Average Precision at Seen Relevant Documents

  • Averaging the precision figures obtained after each new relevant document is observed.

  • F3.2,(1+0.66+0.5+0.4+0.3)/5=0.57

  • 此方法對於很快找到相關文件的系統是相當有利的(相關文件被排在越前面,precision值越高)

R precision

  • Computing the precision at the R-th position in the ranking(在R 篇文章中出現相關文章數目的比例)

  • R:the total number of relevant documents of the current query(total number in Rq)

  • Fig3.2:R=10,value=0.4

  • Fig3.3,R=3,value=0.33

  • 易於觀察每一個單一query的演算法performance

Precision histograms
Precision Histograms

  • 利用長條圖比較兩個query的R-precision值

  • RPA/B(i )=RPA(i )-RPB(i )

  • RPA(i),RPB(i):R-precision value of A,B for i-th query

  • Compare the retrieval performance history of two algorithms through visual inspection

Summary table statistics
Summary Table Statistics

  • 將所有query相關的single value summary 放在table中

    • 如the number of queries ,

    • total number of documents retrieved by all queries,

    • total number of relevant documents were effectively retrieved when all queries are considered

    • Total number of relevant documents retrieved by all queries…

Precision and recall
Precision and Recall 的適用性

  • Maximum recall值的產生,需要知道所有文件相關的背景知識

  • Recall and precision是相對的測量方式,兩者要合併使用比較適合。

  • Measures which quantify the informativeness of the retrieval process might now be more appropriate

  • Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

Alternative measures















Alternative Measures

  • The Harmonic Mean

    • ,介於0,1

  • The E Measure-加入喜好比重

    • b=1,E(j)=F(j)

    • b>1,more interested in precision

    • b<1,more interested in recall

User oriented measure
User-Oriented Measure

  • 假設:Query與使用者有相關,不同使用者有不同的relevant docs

    • coverage=|Rk|/|U|

    • Novelty=|Ru|/|Ru|+|Rk|

  • Coverage越高,系統找到使用者期望的文件越多

  • Noverlty越高,系統找到許多使用者之前不知道相關的文件越多

User oriented measure cont
User-Oriented Measure(cont.)

  • relative recall:系統找到的相關文章數佔使用者預期找到的文章數比例

    • (|Ru|+|Rk|)/ |U|

  • Recall effort:使用者期望找到的相關文章數佔符合使用者期望的相關文章數(the number of documents examined in an attempt to find the expected relevant documents)

    • |U|/|Rk|

Reference collection
Reference Collection

  • 用來作為評估IR系統reference test collections

    • TIPSTER/TREC:量大,實驗用

    • CACM,ISI:歷史意義

    • Cystic Fibrosis :small collections,relevant documents由專家研討後產生

Ir system
IR system遇到的批評

  • Lacks a solid formal framework as a basic foundation

    • 無解!一個文件是否與查詢相關,是相當主觀的!

  • Lacks robust and consistent testbeds and benchmarks

    • 較早,發展實驗性質的小規模測試資料

    • 1990後,TREC成立,蒐集上萬文件,提供給研究團體作IR系統評量之用

Trec text retrieval conference
TREC(Text REtrieval Conference)

  • Initiated under the National Institute of Standards and Technology(NIST)

  • Goals:

    • Providing a large test collection

    • Uniform scoring procedures

    • Forum

  • 7th TREC conference in 1998:

    • Document collection:test collections,example information requests(topics),relevant docs

    • The benchmarks tasks

The documents collection
The Documents Collection

  • 由SGML編輯



<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>

<author>Janet GuyonWSJ Staff)</author>

<dateline>New York</dateline>


American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…



The example information requests topics
The Example Information Requests(Topics)

  • 用自然語言將資訊需求描述出來

  • Topic number:給不同類型的topics


<num> Number:168

<title>Topic:Financing AMTRAK



<nar>Narrative:A …..


The relevant documents for each example information request
The relevant Documents for Each Example Information Request

  • The set of relevant documents for each topic obtained from a pool of possible relevant documents

  • Pool:由數各參與的 IR系統中所找到的相關文件,依照相關性排序後的前K個文章。

  • K通常為100

  • 最後透過人工鑑定,判斷是否為相關文件

  • ->pooling method

    • 相關文件有數個組合的pool取得

    • 不在pool內的文件視為不相關文件

The benchmark tasks at the trec conferences
The (Benchmark)Tasks at the TREC Conferences

  • ad hoc task:

    • Receive new requests and execute them on a pre-specified document collection

  • routing task

    • Receive test info. Requests,two document collections

    • first doc:training and tuning retrieval algorithm

    • Second doc:testing the tuned retrieval algorithm

Other tasks
Other tasks:

  • *Chinese

  • Filtering

  • Interactive

  • *NLP(natural language procedure)

  • Cross languages

  • High precision

  • Spoken document retrieval

  • Query Task(TREC-7)

Evaluation measures at the trec conferences
Evaluation Measures at the TREC Conferences

  • Summary table statistics

  • Recall-precision

  • Document level averages*

  • Average precision histogram

The cacm collection
The CACM Collection

  • Small collections about computer science literature

  • Text of doc

  • structured subfields

    • word stems from the title and abstract sections

    • Categories

    • direct references between articles:a list of pairs of documents[da,db]

    • Bibliographic coupling connections:a list of triples[d1,d2,ncited]

    • Number of co-citations for each pair of articles[d1,d2,nciting]

  • A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

The isi collection
The ISI Collection

  • ISI 的test collection是由之前在ISI(Institute of Scientific Information) 的Small組合而成

  • 這些文件大部分是由當初Small計畫中有關cross-citation study中挑選出來

  • 支持有關於terms和cross-citation patterns的相似性研究

The cystic fibrosis collection
The Cystic Fibrosis Collection

  • 有關於“囊胞性纖維症”的文件

  • Topics和相關文件由具有此方面在臨床或研究的專家所產生

  • Relevance scores

    • 0:non-relevance

    • 1:marginal relevance

    • 2:high relevance

Characteristics of cf collection
Characteristics of CF collection

  • Relevance score均由專家給定

  • Good number of information requests(relative to the collection size)

    • The respective query vectors present overlap among themselves

    • 利用之前的query增加檢索效率

Trends and research issues
Trends and Research Issues

  • Interactive user interface

    • 一般認為feedback的檢索可以改善效率

    • 如何決定此情境下的評估方式(Evaluation measures)?

  • 其它有別於precise,recall的評估方式研究