1 / 15

Evaluation

Evaluation. INST 734 Module 5 Doug Oard. Agenda. Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies. = relevant document. Which is the Best Rank Order?. A. B. C. D. E. F. = relevant document.

Download Presentation

Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation INST 734 Module 5 Doug Oard

  2. Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies

  3. = relevant document Which is the Best Rank Order? A. B. C. D. E. F.

  4. = relevant document Measuring Precision and Recall Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Hits 1-10 P@10 = 0.4 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14

  5. = relevant document Uninterpolated Average Precision • Average of precision at each retrieved relevant doc • Relevant docs not retrieved contribute zero to score Hits 1-10 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 The 8 relevant documents not retrieved contribute eight zeros AP = 0.2307

  6. Some Topics are Easier Than Others Ellen Voorhees, 1999

  7. R R R R R R 1/1=1.00 1/2=0.50 R 2/2=1.00 1/3=0.33 R R 3/5=0.60 3/5=0.60 2/7=0.29 4/8=0.50 4/9=0.44 AP=0.31 AP=0.53 AP=0.76 MAP=0.53 Mean Average Precision (MAP)

  8. Visualizing Mean Average Precision 1.0 0.8 0.6 Average Precision 0.4 0.2 0.0 Topic

  9. What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  10. Some Other Evaluation Measures • Mean Reciprocal Rank (MRR) • Geometric Mean Average Precision (GMAP) • Normalized Discounted Cumulative Gain (NDCG) • Binary Preference (BPref) • Inferred AP (infAP)

  11. Relevance Judgment Strategies • Exhaustiveassessment • Usually impractical • Known-itemqueries • Limited to MRR, requires hundreds of queries • Search-guidedassessment • Hard to quantify risks to completeness • Sampledjudgments • Good when relevant documents are common • Pooledassessment • Requires cooperative evaluation

  12. Pooled Assessment Methodology • Systems submit top 1000 documents per topic • Top 100 documents for each are judged • Single pool, without duplicates, arbitrary order • Judged by the person that wrote the query • Treat unevaluated documents as not relevant • Compute MAP down to 1000 documents • Average in misses at 1000 as zero

  13. Some Lessons From TREC • Incomplete judgments are useful • If sample is unbiased with respect to systems tested • Additional relevant documents are highly skewed across topics • Different relevance judgments change absolute score • But rarely change comparative advantages when averaged • Evaluation technology is predictive • Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

  14. Recap: “Batch” Evaluation • Evaluation measures focus on relevance • Users also want utility and understandability • Goal is typically to compare systems • Values may vary, but relative differences are stable • Mean values obscure important phenomena • Statistical significance tests address generalizability • Failure analysis case studies can help you improve

  15. Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies

More Related