1 / 39

Test Collections and Associated Evaluation Metrics based on Graded Relevance

Test Collections and Associated Evaluation Metrics based on Graded Relevance. Kal Jarvelin kalervo.jarvelin@uta.fi. Introduction. Real-life vs. Cranfield paradigm in IR evaluation Cranfield, with TREC as an example: binary relevance with bias

ros
Download Presentation

Test Collections and Associated Evaluation Metrics based on Graded Relevance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test Collections and Associated Evaluation Metrics based on Graded Relevance Kal Jarvelin kalervo.jarvelin@uta.fi

  2. Introduction • Real-life vs. Cranfield paradigm in IR evaluation • Cranfield, with TREC as an example: • binary relevance with bias • very liberal relevance criterion, system-oriented evaluation • document overlaps not considered • Real-life: • document floods • all documents are not created equal • often strict relevance criteria, little scanning, sessions • document overlaps do matter, bias strong but variable • Evaluate by graded/high relevance? Why? How?

  3. Outline • Introduction • Relevance • degree, types, and dimensions of relevance • topical graded relevance defined for test collections • Building a Test Collection with Graded Relevance • The TREC-UTA Experience • A Glance at Evaluation Metrics based on Graded Relevance • binarization; gMAP; (s)(n)(D)CG • Does it Matter? • Kinds of results graded relevance may offer • Discussion

  4. Relevance • A difficult concept - Saracevic (1975) proposed a generator for relevance concepts: • Relevance is <any of measurement instruments>, which measures <any aspect of relevance>, which exists between <any object to be assessed> and <any scope of assessment> as seen by <any assessor>. • For example, relevance is the quantity, which measures the match, which exists between a text and an information need as seen by the inquirer.

  5. Relevance • Relevance is multi-graded (Cuadra & Katter 1967; Rees & Schultz 1967) • grades can be measured (Tang & al. 1999) • Relevance is amultidimensional cognitive concept • it depends on searchers’ perceptions of information and of their situations • Relevance is complex (Schamber & al. 1990) • main categories: objective, subjective • Relevance is dynamic • assessments of objects may change over time

  6. Relevance • Types of relevance (Saracevic 1996; Cosijn & Ingwersen 2000) • algorithmic relevance • topical relevance • pertinence • situational relevance • socio-cognitive relevance • Graded relevance in test collections: topical? • Why?

  7. Outline • Introduction • Relevance • Building a Test Collection with Graded Relevance • The TREC-UTA Experience • A Glance at Evaluation Metrics based on Graded Relevance • binarization; gMAP; (s)(n)(D)CG • Does it Matter? • Kinds of results graded relevance may offer • Discussion

  8. Building Test Collections with Graded Relevance: The UTA-TREC Experience • Goal • to create a sub-collection of TREC-7 and TREC-8 where the capability of IR systems to focus on highly relevant documents could be studied • 38 topics from were reassessed with a 4-point scale • a subset of TREC topics 351-450 • # TREC relevant documents required > 30 per topic • cost consideration: the number of relevant docs with a 5% sample of non-relevant docs not to exceed 200 • cultural match requirements: general topics rather than American (Sormunen @ SIGIR 2002)

  9. Building Test Collections: Sample Relevance Grades (0) The doc does not contain information on topic (1) The doc only points to the topic - no more information than in the topic (marginal) • Typical extent: one sentence or fact (2) The doc is topical by not exhaustive (fair) • Typically: one paragraph, 2-3 sentences or facts (3) The doc is exhaustive on the topic (highly rel) • Typically : several paragraphs, 4+ sentences or facts

  10. Building Graded Test Collections: Process • Assessors: paid students, 3 + 3 • Training • guidelines, intro, two training topics • Materials: printed, long ones on screen • Process • initial scanning to establish stable criteria • notes on interpretations, conflicts • partially parallel assessments (quality control) • rough marking of relevant passages • Comparison to original: divergence analysis

  11. Building Graded Test Collections: Results Reassessments: Level of TREC relevant TREC non-relevant relevance # % # % Rel=3 353 13 % 11 0 % Rel=2 724 26 % 40 1 % Rel=1 1004 36 % 134 5 % Rel=0 691 25 % 2780 94 % Total 2772 100 % 2965 100 %

  12. Building Graded Test Collections: Results Levels of Total graded UTA relevant relevance # % #/topic # % Rel=3 364 6 % 10 364 16 % Rel=2 764 13 % 20 764 34 % Rel=1 1138 20 % 30 1138 50 % Rel=0 3471 61 % 91 Total 5737 100 % 151 2266 100 % Thus among the TREC-relevant, one half were marginal, one third fair, and one sixth highly relevant

  13. Building Graded Test Collections: Economics • Judging liberal relevance is fast • In graded assessment, extra work is required to specify the degree of relevance • The total time to assess slightly 7000+ documents was ~78 weeks, i.e. ~20 PM • note that about one half relevant in this secondary analysis (slows the process) • experience improved judgment speed

  14. Outline • Introduction • Relevance • Building a Test Collection with GR • Evaluation Metrics based on Graded Relevance • binarization; gMAP; (s)(n)(D)CG • Does it Matter? • Kinds of results graded relevance may offer • Discussion

  15. Evaluation Metrics: Binarization • The recall base is binarized by dividing it into relevant / non-relevant in different ways by relevance scores • Which documents are relevant enough? • Supports IR evaluation by traditional metrics • recall-precision curves; MAP; P@10 • Kekalainen & Jarvelin (JASIST 2002)

  16. Binarization Example: Structured QE P-R curves of SUM, BOOL, and SSYN-C queries at relevance threshold 0/1-3 P-R curves of SUM, BOOL, and SSYN-C queries at relevance threshold 0-2/3

  17. Evaluation Metrics: Generalized Metrics • Graded assessments are incorporated into traditional metrics through generalization: • generalized precision: gP = dR r(d) / |R| • generalized recall: gR = dR r(d) / dD r(d) • R is a set |R| retrieved documents, and D = {d1, d2, … , dN} a database, R  D. The docs di in the database have relevance scores 0.0 <= r(di) <= 1.0 • similarly, generalized MAP, gMAP • weighting of relevance scores possible • see Kekalainen & Jarvelin (JASIST 2002)

  18. Generalization Example: Structured QE P-R curves of SUM, BOOL, and SSYN-C queries at relevance threshold 0/1-3 Generalized P-R curves of SUM, BOOL, and SSYN-C queries (0-1-4-10)

  19. Evaluation Metrics: (n)(S)(D)CG Family • (n)(D)CG - normalized discounted cumulated gain • document value is discounted by the log of its rank - late-found documents are less valuable than their nominal value • normalization of performance by dividing by ideal performance • sDCG - session-based discounted cumulated gain • documents found by later queries are further discounted by a query-dependent factor • normalization possible

  20. Discounted Cumulated Gain • Cumulate gain and divide it by the log of the rank of each document where • j is the rank of each document • G[j] is the gain of document at rank j • b is the logarithm base for the query discount • The normalized version through division by the ideal gain • nDCG[i] = DCG[i] / ICG[i]

  21. TREC-UTA ad hoc TDN (41 topics) nDCG & DCG, weighting 0-1-5-10, discount log=10 vs. 1.5 log=10 log=1.5

  22. Evaluation Metrics: Session DCG, sDCG • … as above, but further discount gain by the log of its query rank q ( 1 … m) sDCG(q) = DCG/(1 + bqlog q) • where bqR is the logarithm base for the query discount; 1 < bq < 1000 • q is the position of the query. • Can be normalized • Average these over a set of queries • Jarvelin & al. (ECIR 2008).

  23. nsDCG, Top-10, Systems A – B, all sessions (b=2; bq=4; 0-1-10-100)

  24. Outline • Introduction • Relevance • Building a Test Collection with GR • A Glance at Evaluation Metrics based on GR • Does it Matter? • Kinds of results graded relevance may offer • Discussion

  25. Does It Matter? – User View • Recognition of relevance by degree of relevance • Vakkari & Sormunen (JASIST 2004) • Searchers seem more likely and consistent to identify highly relevant documents than marginal ones • They are likely to err on marginal documents, i.e. find these non-relevant • Should such docs then be used for evaluation?

  26. Does It Matter? Systems Ranking • Ranking systems by graded relevance performance • Kekalainen (IP&M 2005) compared the rankings of IR systems based on binary and graded relevance (n, m, f, h); TREC 7 and 8 data • 21 topics and 90 systems from TREC 7 • 20 topics and 121 systems from TREC 8 • Binary MAP, CG, DCG, and nDCG • Different weighting schemes tested • Kendalls rank correlations are computed to determine to what extent the rankings produced by different measures are similar • High correlation in the binary end, less with sharp weighting • Voorhees (Sigir 2001) has similar findings

  27. Correlation of MAP vs. nDCG • Some of Kekalainen’s (2005) findings: • DCG variation DCV10 DCV100 • nDCG-0.1.1.1 0.934 0.852 • nDCG-0.1.2.3 0.894 0.828 • nDCG-0.1.5.10 0.859 0.805 • nDCG-0.1.10.100 0.7370.719 • Discounting and normalizing affect system ranking along with weighting when compared to MAP – some systems retrieve highly relevant documents earlier than other

  28. Does It Matter: Relevance Feedback • Keskustalo & al (ECIR 2006; IR 2008) • RFB with relevance requirement • Evaluation by relevance weighting • Full document FB key extraction • Liberal RFB in short window effective

  29. Effect of RFB Amount and Quality RFB with CG evaluation with scenarios baseline, <3,10,10>, <1,10,10>, weighting 0-1-10-100

  30. Does It Matter: Relevance Feedback • Järvelin (CIKM 2009) • Simulated RFB with relevance requirement • Evaluation by relevance weighting • Feedback through sentence extraction & key weighting • Liberal RFB in short window effective

  31. Results: Liberal RFB Effectiveness of simulated RFB runs with liberal RFB B = browsing depth F = # RFB docs E = # QE keys extracted

  32. Does It Matter: Transitive Translation • Lehtokangas & al. (2006, 2008) • Dictionary-based CLIR, structured queries • Direct translation loses highly-relevant docs • Transitive translation loses highly-relevant docs • Both relatively better with liberal relevance • Pseudo-relevance feedback brings transitive translation to the level of direct translation

  33. Does It Matter: Negative Weighting • Keskustalo & al (SIGIR 2008) • What if non-relevant docs are negatively weighted? • Evaluation by different weighting scenarios, including negative weights • Opportunities • Searcher’s stopping becomes understandable • Test collection analysis

  34. DCG - Binary Relevance, Neg Weights Moderately patient user (log base 4) with traditional weighting (0/1) and with negative weighting (-1/1). Average DCG for 50 topics (TREC 8).

  35. DCG – Weighted (Negative) Relevance Patient user (log base 10) with traditional weighting (0/0/5/10) and with negative values (-2/0/5/10). Average DCG for 41 topics (TREC 7 and 8).

  36. Outline • Introduction • Relevance • Building a Test Collection with GR • A Glance at Evaluation Metrics based on GR • Does it Matter? • Discussion and Conclusion

  37. Discussion • Collections with graded relevance can be created • directly or through secondary analysis • Graded relevance matters - HRDs matter • searchers recognize and value it • more real-life flavor to evaluation • affects system ranking by performance • Many ways of using graded relevance • early precision matters - discounting • weighting supports varying use scenarios • sessions with short browsing can be evaluated - important for realistic evaluation

  38. Conclusion • Graded relevance is an opportunity in IR evaluation • brings a bit more realism to evaluation • affects relative system performance • but there is no revolution: best systems tend to be the best under various metrics employing GR • However, bear in mind the questions: • Is result quality the only thing to look at? • Is system performance alone of interest? • GR is an important opportunity if the answer is no

  39. Thanks!

More Related