Evaluation

Evaluation INST 734 Module 5 Doug Oard

Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies

= relevant document Which is the Best Rank Order? A. B. C. D. E. F.

= relevant document Measuring Precision and Recall Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Hits 1-10 P@10 = 0.4 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14

= relevant document Uninterpolated Average Precision • Average of precision at each retrieved relevant doc • Relevant docs not retrieved contribute zero to score Hits 1-10 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 The 8 relevant documents not retrieved contribute eight zeros AP = 0.2307

Some Topics are Easier Than Others Ellen Voorhees, 1999

R R R R R R 1/1=1.00 1/2=0.50 R 2/2=1.00 1/3=0.33 R R 3/5=0.60 3/5=0.60 2/7=0.29 4/8=0.50 4/9=0.44 AP=0.31 AP=0.53 AP=0.76 MAP=0.53 Mean Average Precision (MAP)

Visualizing Mean Average Precision 1.0 0.8 0.6 Average Precision 0.4 0.2 0.0 Topic

What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Some Other Evaluation Measures • Mean Reciprocal Rank (MRR) • Geometric Mean Average Precision (GMAP) • Normalized Discounted Cumulative Gain (NDCG) • Binary Preference (BPref) • Inferred AP (infAP)

Relevance Judgment Strategies • Exhaustiveassessment • Usually impractical • Known-itemqueries • Limited to MRR, requires hundreds of queries • Search-guidedassessment • Hard to quantify risks to completeness • Sampledjudgments • Good when relevant documents are common • Pooledassessment • Requires cooperative evaluation

Pooled Assessment Methodology • Systems submit top 1000 documents per topic • Top 100 documents for each are judged • Single pool, without duplicates, arbitrary order • Judged by the person that wrote the query • Treat unevaluated documents as not relevant • Compute MAP down to 1000 documents • Average in misses at 1000 as zero

Some Lessons From TREC • Incomplete judgments are useful • If sample is unbiased with respect to systems tested • Additional relevant documents are highly skewed across topics • Different relevance judgments change absolute score • But rarely change comparative advantages when averaged • Evaluation technology is predictive • Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Recap: “Batch” Evaluation • Evaluation measures focus on relevance • Users also want utility and understandability • Goal is typically to compare systems • Values may vary, but relative differences are stable • Mean values obscure important phenomena • Statistical significance tests address generalizability • Failure analysis case studies can help you improve

Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies

Evaluation

Evaluation

Presentation Transcript

evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation Economic Evaluation

Evaluation

Evaluation