user performance versus precision measures for simple search tasks don t bother improving map
Download
Skip this Video
Download Presentation
User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Loading in 2 Seconds...

play fullscreen
1 / 27

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ) - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au. People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )' - rocco


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
user performance versus precision measures for simple search tasks don t bother improving map

User Performance versus Precision Measures for Simple Search Tasks(Don’t bother improving MAP)

Andrew Turpin

Falk Scholer

{aht,fscholer}@cs.rmit.edu.au

slide2

People in glass houses should not throw stones

http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg

slide3

Scientists should not live in glass houses.

Nor straw, nor wood…

http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg

slide4

Scientists should do more than throw stones

www.worth1000.com/entries/ 161000/161483INPM_w.jpg

overview
Overview
  • How are IR systems compared?
    • Mean Average Precision: MAP
  • Do metrics match user experience?
  • First grain (Turpin & Hersh SIGIR 2000)
  • Second pebble (Turpin & Hersh SIGIR 2001)
  • Third stone (Allan et al SIGIR 2005)
  • This golf ball (Turpin & Scholer SIGIR 2006)
slide6

0.00

0.00

0.67

0.25

0.40

0.33

P@5

1/5 = 0.20

2/5 = 0.40

P@1

0/1 = 0.00

0/1 = 0.00

AP

Av. of P at 1’s= 0.25

Av. of P at 1’s= 0.54

0

0

0.00

0

0

0.00

1

0

0.00

1

0.25

0

1

0

0.20

0

0.17

0

slide7

Sum of all precision values at relevant documents

Number of relevant docs in the list

AP =

(0.25) / 1 =

0.25

(0.67 + 0.40) / 2 =

0.54

Sum of all precision values at relevant documents

Number of relevant docs in all lists

AP =

0.08

(0.25) / 3 =

0.36

(0.67 + 0.40) / 3 =

mean average precision map
Mean Average Precision (MAP)
  • Previous example showed precision for one query
  • Ideally need many queries (50 or more)
  • Take the mean of the AP values over all queries: MAP
  • Do a paired t-test, Wilcoxon, Tukey HSD,…
  • Compares systems on the same collection and same queries
slide9

Typical IR empirical systems paper

Turpin & Moffat SIGIR 1999

slide10

Monz et al SIGIR 2005

Fang et al SIGIR 2004

Shi et al SIGIR 2005

Jordan et al JCDL June 2006

implicit assumption
Implicit assumption

More relevant documents high in the list is good

  • Do users generally want more than one relevant document?
  • Do users read lists top to bottom?
  • Who determines relevance? Binary? Conditional or state-based?
  • While MAP is tractable, does it reflect user experience?
  • Is Yahoo! really better than Google, or vice-versa?
general experiment
General Experiment
  • Get a collection, set of queries, relevance judgments
  • Compare System A and System B using MAP (Cranfield)
  • Get users to do queries with System A or System B (balanced design…)
  • Did the users do better with A or B?
  • Did the users prefer A or B?
experiment 2000
Experiment 2000

MAP 0.275

IR 0.330

24 Users

Engine A

MAP 0.324

IR 0.390

6 Queries

Engine B

slide14

Experiment 2001

MAP 0.270

QA 66%

32 Users

Engine A

MAP 0.354

QA 60%

8 Queries

Engine B

experiment 2005
Experiment 2005
  • James Allan et al, UMass, SIGIR2005
  • Passage retrieval and a recall task
  • Used bpref, which “tracks MAP”
  • Small benefit to users when bpref goes from
    • 0.50 to 0.60 and 0.90 to 0.95
  • No benefit in the mid range 0.60 to 0.90
slide16

Experiments 2000, 2001, 2005

MAP

Exp 2001

Exp 2002

experiment 2006
Experiment 2006

MAP 0.55

A

MAP 0.65

32 Users

B

MAP 0.75

C

MAP 0.85

D

MAP 0.95

50 Queries

(100 documents)

E

slide20

Time required to find first relevant document

300

250

200

Time (seconds)

150

100

50

0

0.55

0.65

0.75

0.85

0.95

MAP

failures
Failures

% of queries with no relevant answer

MAP

conclusion
Conclusion
  • MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true
    • Supported by 4 different experiments
  • Don’t automatically choose MAP as a metric
    • P@1 for Web style tasks?
slide24
P@1

300

250

200

Time (seconds)

150

100

50

0

0

1

P@1

slide25

0-10%

20-30%

40-50%

60-70%

80-90%

10-20%

30-40%

50-60%

70-80%

90-100%

ad