User performance versus precision measures for simple search tasks don t bother improving map
Download
1 / 27

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ) - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,[email protected] People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )' - rocco


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
User performance versus precision measures for simple search tasks don t bother improving map

User Performance versus Precision Measures for Simple Search Tasks(Don’t bother improving MAP)

Andrew Turpin

Falk Scholer

{aht,[email protected]


People in glass houses should not throw stones Tasks

http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg


Scientists should not live in glass houses. Tasks

Nor straw, nor wood…

http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg


Scientists should do more than throw stones Tasks

www.worth1000.com/entries/ 161000/161483INPM_w.jpg


Overview
Overview Tasks

  • How are IR systems compared?

    • Mean Average Precision: MAP

  • Do metrics match user experience?

  • First grain (Turpin & Hersh SIGIR 2000)

  • Second pebble (Turpin & Hersh SIGIR 2001)

  • Third stone (Allan et al SIGIR 2005)

  • This golf ball (Turpin & Scholer SIGIR 2006)


0.00 Tasks

0.00

0.67

0.25

0.40

0.33

[email protected]

1/5 = 0.20

2/5 = 0.40

[email protected]

0/1 = 0.00

0/1 = 0.00

AP

Av. of P at 1’s= 0.25

Av. of P at 1’s= 0.54

0

0

0.00

0

0

0.00

1

0

0.00

1

0.25

0

1

0

0.20

0

0.17

0


Sum of all precision values at relevant documents Tasks

Number of relevant docs in the list

AP =

(0.25) / 1 =

0.25

(0.67 + 0.40) / 2 =

0.54

Sum of all precision values at relevant documents

Number of relevant docs in all lists

AP =

0.08

(0.25) / 3 =

0.36

(0.67 + 0.40) / 3 =


Mean average precision map
Mean Average Precision (MAP) Tasks

  • Previous example showed precision for one query

  • Ideally need many queries (50 or more)

  • Take the mean of the AP values over all queries: MAP

  • Do a paired t-test, Wilcoxon, Tukey HSD,…

  • Compares systems on the same collection and same queries


Typical IR empirical systems paper Tasks

Turpin & Moffat SIGIR 1999


Monz et al SIGIR 2005 Tasks

Fang et al SIGIR 2004

Shi et al SIGIR 2005

Jordan et al JCDL June 2006


Implicit assumption
Implicit assumption Tasks

More relevant documents high in the list is good

  • Do users generally want more than one relevant document?

  • Do users read lists top to bottom?

  • Who determines relevance? Binary? Conditional or state-based?

  • While MAP is tractable, does it reflect user experience?

  • Is Yahoo! really better than Google, or vice-versa?


General experiment
General Experiment Tasks

  • Get a collection, set of queries, relevance judgments

  • Compare System A and System B using MAP (Cranfield)

  • Get users to do queries with System A or System B (balanced design…)

  • Did the users do better with A or B?

  • Did the users prefer A or B?


Experiment 2000
Experiment 2000 Tasks

MAP 0.275

IR 0.330

24 Users

Engine A

MAP 0.324

IR 0.390

6 Queries

Engine B


Experiment 2001 Tasks

MAP 0.270

QA 66%

32 Users

Engine A

MAP 0.354

QA 60%

8 Queries

Engine B


Experiment 2005
Experiment 2005 Tasks

  • James Allan et al, UMass, SIGIR2005

  • Passage retrieval and a recall task

  • Used bpref, which “tracks MAP”

  • Small benefit to users when bpref goes from

    • 0.50 to 0.60 and 0.90 to 0.95

  • No benefit in the mid range 0.60 to 0.90


Experiments 2000, 2001, 2005 Tasks

MAP

Exp 2001

Exp 2002


Experiment 2006
Experiment 2006 Tasks

MAP 0.55

A

MAP 0.65

32 Users

B

MAP 0.75

C

MAP 0.85

D

MAP 0.95

50 Queries

(100 documents)

E


Our Sheep Tasks


Time required to find first relevant document Tasks

300

250

200

Time (seconds)

150

100

50

0

0.55

0.65

0.75

0.85

0.95

MAP


Failures
Failures Tasks

% of queries with no relevant answer

MAP



Conclusion
Conclusion Tasks

  • MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true

    • Supported by 4 different experiments

  • Don’t automatically choose MAP as a metric

    • [email protected] for Web style tasks?


[email protected] Tasks

300

250

200

Time (seconds)

150

100

50

0

0

1

[email protected]


0-10% Tasks

20-30%

40-50%

60-70%

80-90%

10-20%

30-40%

50-60%

70-80%

90-100%




ad