1 / 38

Retrieval Evaluation

Retrieval Evaluation. J. H. Wang Mar. 18, 2008. Outline. Chap. 3, Retrieval Evaluation Retrieval Performance Evaluation Reference Collections. Introduction. Types of evaluation Functional analysis phase, and error analysis phase Performance evaluation Performance evaluation

Download Presentation

Retrieval Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Retrieval Evaluation J. H. Wang Mar. 18, 2008

  2. Outline • Chap. 3, Retrieval Evaluation • Retrieval Performance Evaluation • Reference Collections

  3. Introduction • Types of evaluation • Functional analysis phase, and error analysis phase • Performance evaluation • Performance evaluation • Response time/space required • Retrieval performance evaluation • The evaluation of how precise is the answer set

  4. Retrieval Performance Evaluation • Query in batch mode vs. interactive sessions Relevant Docs In Answer Set |Ra| Recall=|Ra|/|R| Precision=|Ra|/|A| collection Answer Set |A| Relevant Docs |R| Sorted by relevance

  5. 6.d9* 7.d511 8.d129 9.d187 10.d25* 11.d38 12.d48 13.d250 14.d11 15.d3* 1.d123* 2.d84 3.d56* 4.d6 5.d8 Precision versus Recall Curve • Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123} • P=100% at R=10% • P= 66% at R=20% • P= 50% at R=30% Ranking for query q: Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

  6. Precision versus Recall Curve • For a single query Fig3.2

  7. Average Over Multiple Queries • P(r)=average precision at the recall level r • Nq= Number of queries used • Pi(r)=The precision at recall level r for the i-th query

  8. 6.d9 7.d511 8.d129* 9.d187 10.d25 11.d38 12.d48 13.d250 14.d11 15.d3* 1.d123 2.d84 3.d56* 4.d6 5.d8 Interpolated Precision • Rq={d3,d56,d129} • P=33% at R=33% • P=25% at R=66% • P=20% at R=100% • P(rj)=max ri≦r≦rj+1P(r)

  9. Interpolated Precision • Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level • P(rj)=max ri≦r≦rj+1P(r) R=30%, P3(r)~P4(r)=33% R=40%, P4(r)~P5(r) R=50%, P5(r)~P6(r) R=60%, P6(r)~P7(r)=25%

  10. Average Recall vs. Precision Figure

  11. Single Value Summaries • Average precision versus recall • Compare retrieval algorithms over a set of example queries • Sometimes we need to compare individual query’s performance • Averaging precision over many queries might disguise important anomalies in the retrieval algorithms • We might be interested in whether one of them outperforms the other for each query • Need a single value summary • The single value should be interpreted as a summary of the corresponding precision versus recall curve

  12. Single Value Summaries • Average Precision at Seen Relevant Documents • Averaging the precision figures obtained after each new relevant document is observed • Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57 • This measure favors systems which retrieve relevant documents quickly (i.e., early in the ranking) • R-Precision • The precision at the R-th position in the ranking • R: the total number of relevant documents of the current query (number of documents in Rq) • Fig3.2: R=10, value=0.4 • Fig3.3: R=3, value=0.33

  13. Precision Histograms • Use R-precision measures to compare the retrieval history of two algorithms through visual inspection • RPA/B(i)=RPA(i)-RPB(i)

  14. Summary Table Statistics • Single value measures can be stored in a table regarding the set of all queries • the number of queries • total number of documents retrieved by all queries • total number of relevant documents which were effectively retrieved when all queries are considered • total number of relevant documents which could have been retrieved by all queries • …

  15. Precision and Recall Appropriateness • Proper estimation of maximum recall for a query requires knowledge of all documents in the collection • Recall and precision are related measures which capture different aspects of the documents • Measures which quantify the informativeness of the retrieval process might be more appropriate • Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

  16. Alternative Measures • The Harmonic Mean • Values in [0,1] • The E Measure • Relative importance of recall and precision • b=1, E(j)=F(j) • b>1, more interested in precision • b<1, more interested in recall

  17. User-Oriented Measure • Assumption: different users might have a different interpretation of which document is relevant

  18. User-Oriented Measure • Coverage=|Rk|/|U| • Novelty=|Ru|/(|Ru|+|Rk|) • A high coverage ratio indicates that the system is finding most of the relevant documents that the user expected to see • A high novelty ratio indicates that the system is revealing many new documents which were previously unknown

  19. Other Measures • Relative recall: the ratio between the number of relevant documents found and the number of relevant documents the user expected to find • Recall effort: the ratio between the number of relevant documents the user expected to find and the number of documents examined • Others: expected search length, satisfaction, frustration

  20. Reference Collections • Reference test collections for the evaluation of IR systems • TIPSTER/TREC: large size, thorough experimentation • CACM, ISI: historical importance • Cystic Fibrosis: small collections, extensively studied by specialists before generation of relevant documents

  21. Criticisms for IR Research • Lacks a solid formal framework as a basic foundation • It’s difficult to dismiss due to the subjectiveness associated with the task of deciding on the relevance of a document • Lacks robust and consistent testbeds and benchmarks • Early experimentation was based on relatively small test collections, and there were no widely accepted benchmarks • In early 1990s, TREC conference under Donna Harman (NIST) dedicated to experimentation with a large test collection

  22. TREC (Text REtrieval Conference) • Initiated under the National Institute of Standards and Technology(NIST) • Goals: • Providing a large test collection • Uniform scoring procedures • Forum for comparing results • 7th TREC conference in 1998 • Document collection: test collections, example information requests (topics), relevant docs • The benchmarks tasks

  23. The Documents Collection • Tagged with SGML to allow easy parsing <doc> <docno>WSJ880406-0090</docno> <hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl> <author>Janet GuyonWSJ Staff)</author> <dateline>New York</dateline> <text> American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… </text> </doc>

  24. TREC1-6 Documents

  25. The Example Information Requests (Topics) • Each request (topic) is a description of an information need in natural language • Topic number for different topics <top> <num> Number:168 <title>Topic:Financing AMTRAK <desc>Description: ….. <nar>Narrative:A ….. </top>

  26. TREC~Topics

  27. TREC~Relevance Assessment • Relevance assessment • Pooling Method • The documents in the pool are shown to human assessor to decide on the relevance • Two assumptions • Vast majority of the relevant documents is collected in the assembled pool • Documents that are not in the pool can be considered to be not relevant

  28. Pooling Method • The set of relevant documents for each example information request is obtained from a pool of possible relevant documents • This pool is created by taking the top K documents (usually, K=100) in the rankings generated by the various participating retrieval systems

  29. The (Benchmark) Tasks at the TREC Conferences • Ad hoc task • Receive new requests and execute them on a pre-specified document collection • Routing task • Receive test info. requests, two document collections • First doc: training and tuning retrieval algorithm • Second doc: testing the tuned retrieval algorithm

  30. Other Tracks • *Chinese • Filtering • Interactive • *NLP (natural language processing) • Cross languages • High precision • Spoken document retrieval • Query (TREC-7) • Others: Web, Terabyte, SPAM, Blog, Novelty, Question Answering, HARD, …

  31. TREC~Evaluation

  32. Evaluation Measures at the TREC Conferences • Summary table statistics • Recall-precision • Document level averages* • Average precision histogram

  33. The CACM Collection • Small collections about computer science literature (1958-1979) • Text of 3,204 documents • Structured subfields • word stems from the title and abstract sections • Categories • direct references between articles: a list of document pairs [da,db] • Bibliographic coupling connections: a list of triples [d1,d2,ncited] • Number of co-citations for each pair of articles [d1,d2,nciting] • A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

  34. CACM collection also includes a set of 52 test information requests • Ex: “What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers?” • Also includes two Boolean query formulations and a set of relevant documents • Since the requests are fairly specific, the average number of relevant documents for each request is small (around 15) • Precision and recall tend to be low

  35. The ISI Collection • The 1,460 documents in the ISI test collection were selected from a previous collection assembled by Small at ISI (Institute of Scientific Information) • The documents selected were those most cited in a cross-citation study done by Small • The main purpose is to support investigation of similarities based on terms and on cross-citation patterns

  36. The Cystic Fibrosis (CF) Collection • 1,239 documents indexed with the term “cystic fibrosis” (“囊狀纖維化”) in Medline database • Information requests were generated by an expert in cystic fibrosis • Relevance scores were provided by subject experts • 0: non-relevance • 1: marginal relevance • 2: high relevance

  37. Characteristics of CF collection • Relevance score was generated directly by human experts • It includes a good number of information requests (relative to the collection size) • The respective query vectors present overlap among themselves • This allows experimentation with retrieval strategies which take advantage of past query sessions to improve retrieval performance

  38. Trends and Research Issues • Interactive user interface • A general belief: effective retrieval is highly dependent on obtaining proper feedback from the user • Deciding which evaluation measures are most appropriate in this scenario • Ex: informativeness measure in 1992 • The proposal, the study, the characterization of alternative measures to recall and precision

More Related