1 / 27

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au. People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg.

rocco
Download Presentation

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User Performance versus Precision Measures for Simple Search Tasks(Don’t bother improving MAP) Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au

  2. People in glass houses should not throw stones http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg

  3. Scientists should not live in glass houses. Nor straw, nor wood… http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg

  4. Scientists should do more than throw stones www.worth1000.com/entries/ 161000/161483INPM_w.jpg

  5. Overview • How are IR systems compared? • Mean Average Precision: MAP • Do metrics match user experience? • First grain (Turpin & Hersh SIGIR 2000) • Second pebble (Turpin & Hersh SIGIR 2001) • Third stone (Allan et al SIGIR 2005) • This golf ball (Turpin & Scholer SIGIR 2006)

  6. 0.00 0.00 0.67 0.25 0.40 0.33 P@5 1/5 = 0.20 2/5 = 0.40 P@1 0/1 = 0.00 0/1 = 0.00 AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54 0 0 0.00 0 0 0.00 1 0 0.00 1 0.25 0 1 0 0.20 0 0.17 0

  7. Sum of all precision values at relevant documents Number of relevant docs in the list AP = (0.25) / 1 = 0.25 (0.67 + 0.40) / 2 = 0.54 Sum of all precision values at relevant documents Number of relevant docs in all lists AP = 0.08 (0.25) / 3 = 0.36 (0.67 + 0.40) / 3 =

  8. Mean Average Precision (MAP) • Previous example showed precision for one query • Ideally need many queries (50 or more) • Take the mean of the AP values over all queries: MAP • Do a paired t-test, Wilcoxon, Tukey HSD,… • Compares systems on the same collection and same queries

  9. Typical IR empirical systems paper Turpin & Moffat SIGIR 1999

  10. Monz et al SIGIR 2005 Fang et al SIGIR 2004 Shi et al SIGIR 2005 Jordan et al JCDL June 2006

  11. Implicit assumption More relevant documents high in the list is good • Do users generally want more than one relevant document? • Do users read lists top to bottom? • Who determines relevance? Binary? Conditional or state-based? • While MAP is tractable, does it reflect user experience? • Is Yahoo! really better than Google, or vice-versa?

  12. General Experiment • Get a collection, set of queries, relevance judgments • Compare System A and System B using MAP (Cranfield) • Get users to do queries with System A or System B (balanced design…) • Did the users do better with A or B? • Did the users prefer A or B?

  13. Experiment 2000 MAP 0.275 IR 0.330 24 Users Engine A MAP 0.324 IR 0.390 6 Queries Engine B

  14. Experiment 2001 MAP 0.270 QA 66% 32 Users Engine A MAP 0.354 QA 60% 8 Queries Engine B

  15. Experiment 2005 • James Allan et al, UMass, SIGIR2005 • Passage retrieval and a recall task • Used bpref, which “tracks MAP” • Small benefit to users when bpref goes from • 0.50 to 0.60 and 0.90 to 0.95 • No benefit in the mid range 0.60 to 0.90

  16. Experiments 2000, 2001, 2005 MAP Exp 2001 Exp 2002

  17. Experiment 2006 MAP 0.55 A MAP 0.65 32 Users B MAP 0.75 C MAP 0.85 D MAP 0.95 50 Queries (100 documents) E

  18. Our Sheep

  19. Time required to find first relevant document 300 250 200 Time (seconds) 150 100 50 0 0.55 0.65 0.75 0.85 0.95 MAP

  20. Failures % of queries with no relevant answer MAP

  21. “Better” MAP definition

  22. Conclusion • MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true • Supported by 4 different experiments • Don’t automatically choose MAP as a metric • P@1 for Web style tasks?

  23. P@1 300 250 200 Time (seconds) 150 100 50 0 0 1 P@1

  24. 0-10% 20-30% 40-50% 60-70% 80-90% 10-20% 30-40% 50-60% 70-80% 90-100%

  25. Rank of saved/viewed docs

  26. Number of relevant found

More Related