learning more powerful test statistics for click based retrieval evaluation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation PowerPoint Presentation
Download Presentation
Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

Loading in 2 Seconds...

play fullscreen
1 / 29

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation - PowerPoint PPT Presentation


  • 207 Views
  • Uploaded on

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims. Retrieval Evaluation Using Click Data. Eliciting relative feedback E.g., is A better than B?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation' - Lucy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning more powerful test statistics for click based retrieval evaluation

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

SIGIR 2010

Yisong Yue

Cornell University

Joint work with:

Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims

retrieval evaluation using click data
Retrieval Evaluation Using Click Data
  • Eliciting relative feedback
    • E.g., is A better than B?
  • Evaluation pipeline
    • Online experiment design (example to follow)
    • Collect clicks
    • Use standard statistical tests (e.g., t-test)
  • Contribution: Supervised learning algorithm for training a more efficient test statistic
retrieval evaluation using click data3
Retrieval Evaluation Using Click Data
  • Eliciting relative feedback
    • E.g., is A better than B?
  • Evaluation pipeline
    • Online experiment design (example to follow)
    • Collect clicks
    • Use standard test statistics (e.g., t-test)
  • Contribution: Supervised learning algorithm for training a more efficient test statistic
slide4

Team-Game Interleaving(Online Experiment for Search Applications)

(u=thorsten, q=“svm”)

A(u,q)  r1

B(u,q)  r2

1. Kernel Machines http://svm.first.gmd.de/

2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html

4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html

5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk

1. Kernel Machines http://svm.first.gmd.de/

2. Support Vector Machinehttp://jbolivar.freeservers.com/

3. An Introduction to Support Vector Machineshttp://www.support-vector.net/

4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT...

5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

[Radlinski, Kurup, Joachims, CIKM 2008]

team game interleaving online experiment for search applications
Team-Game Interleaving(Online Experiment for Search Applications)

(u=thorsten, q=“svm”)

A(u,q)  r1

B(u,q)  r2

1. Kernel Machines http://svm.first.gmd.de/

2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html

4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html

5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk

1. Kernel Machines http://svm.first.gmd.de/

2. Support Vector Machinehttp://jbolivar.freeservers.com/

3. An Introduction to Support Vector Machineshttp://www.support-vector.net/

4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT...

5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

Interleaving(r1,r2)

1. Kernel Machines T2http://svm.first.gmd.de/

2. Support Vector Machine T1http://jbolivar.freeservers.com/

3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/

4. An Introduction to Support Vector Machines T1http://www.support-vector.net/

5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html

6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT...

7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html

  • Mix results of A and B
  • Relative feedback
  • More reliable

[Radlinski, Kurup, Joachims, CIKM 2008]

team game interleaving online experiment for search applications6
Team-Game Interleaving(Online Experiment for Search Applications)

(u=thorsten, q=“svm”)

A(u,q)  r1

B(u,q)  r2

1. Kernel Machines http://svm.first.gmd.de/

2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html

4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html

5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk

1. Kernel Machines http://svm.first.gmd.de/

2. Support Vector Machinehttp://jbolivar.freeservers.com/

3. An Introduction to Support Vector Machineshttp://www.support-vector.net/

4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT...

5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

Interleaving(r1,r2)

1. Kernel Machines T2http://svm.first.gmd.de/

2. Support Vector Machine T1http://jbolivar.freeservers.com/

3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/

4. An Introduction to Support Vector Machines T1http://www.support-vector.net/

5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html

6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT...

7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html

  • Mix results of A and B
  • Relative feedback
  • More reliable

Interpretation: (r1> r2) ↔ clicks(r1) > clicks(r2)

[Radlinski, Kurup, Joachims, CIKM 2008]

determining statistical significance
Determining Statistical Significance
  • Each q, interleave A(q) and B(q), log clicks
  • t-Test
    • For each q, score: % clicks on A(q)
      • E.g., 3/4 = 0.75
    • Sample mean score (e.g., 0.6)
determining statistical significance8
Determining Statistical Significance
  • Each q, interleave A(q) and B(q), log clicks
  • t-Test
    • For each q, score: % clicks on A(q)
      • E.g., 3/4 = 0.75
    • Sample mean score (e.g., 0.6)
    • Compute confidence (p value)
      • E.g., want p = 0.05 (i.e., 95% confidence)
    • More data, more confident
determining statistical significance9
Determining Statistical Significance
  • Each q, interleave A(q) and B(q), log clicks
  • Other Statistical Tests:
    • z-Test
      • (equal to t-Test for large samples)
    • Rank Test
    • Binomial Test
    • Etc…
    • All similar
limitation
Limitation
  • Example: query session with 2 clicks
    • One click at rank 1 (from A)
    • Later click at rank 4 (from B)
    • Normally would count this query session as a tie
limitation11
Limitation
  • Example: query session with 2 clicks
    • One click at rank 1 (from A)
    • Later click at rank 4 (from B)
    • Normally would count this query session as a tie
    • But second click is probably more informative…
    • …so B should get more credit for this query
linear model
Linear Model
  • Feature vector φ(q,c):
  • Weight of click is wTφ(q,c)
example
Example
  • wTφ(q,c) differentiates last clicks and other clicks
example14
Example
  • wTφ(q,c) differentiates last clicks and other clicks
  • Interleave A vs B
    • 3 clicks per session
    • Last click 60% on result from A
    • Other 2 clicks random
example15
Example
  • wTφ(q,c) differentiates last clicks and other clicks
  • Interleave A vs B
    • 3 clicks per session
    • Last click 60% on result from A
    • Other 2 clicks random
  • Conventional w = (1,1) – has significant variance
  • Only count last click w = (1,0) – minimizes variance
scoring query sessions
Scoring Query Sessions
  • Feature representation for query session:
scoring query sessions17
Scoring Query Sessions
  • Feature representation for query session:
  • Weighted score for query:
  • Positive score favors A, negative favors B
upgraded test statistic
Upgraded Test Statistic
  • t-Test:
    • Compute mean score wTψq over all queries
      • E.g., 0.2
    • Null hypothesis: mean = 0
    • Can reach statistical significance sooner
  • How to learn w?
supervised learning
Supervised Learning
  • Will optimize for z-Test: Inverse z-Test
    • Approximately equal t-Test for large samples
    • z-Score = mean / standard deviation

(Assumes A > B)

inverting other statistical tests
Inverting Other Statistical Tests
  • Most statistical tests uses a test statistic
    • E.g., z-score
    • Rank test: % concordant pairs in confidence ranking (ROC Area)
    • We optimized using logistic regression
recap
Recap
  • Collect training data
    • Pairs of retrieval functions A & B, (A > B)
    • Interleave them, collect usage logs
  • Build features
    • For each query session q: Ψq
    • Query session score: wTψq (positive favors A)
  • Train w to optimize test statistic
    • z-Test: maximize ( mean(w) / std(w) )
experiment setup
Experiment Setup
  • Data collection
    • Pool of retrieval functions
    • Hash users into partitions
    • Run interleaving of different pairs in parallel
  • Collected on arXiv.org
    • 2 pools of retrieval functions
    • Training Pool: (6 pairs) know A > B
    • New Pool: (12 pairs)
experimental results
Experimental Results
  • Inverse z-Test works well
    • Beats baseline on most of new interleaving pairs
    • Direction of tests all in agreement
    • In 6/12 pairs, for p=0.1, reduces sample size by 10%
    • In 4/12 pairs, achieves p=0.05, but not baseline
      • 400 to 650 queries per interleaving experiment
  • Weights hard to interpret (features correlated)
  • Largest weight: “1 if single click & rank > 1”
conclusion
Conclusion
  • Principled, offers practical benefits
  • Should perform better with more training data
  • Can be applied to other application domains
  • Limitations:
    • Treats training data as one sample
    • Might not work well when test data is very different from training data
    • Susceptible to adversarial behavior
training logistic regression
Training Logistic Regression
  • Need to mirror training data
  • E.g.,
    • ψq with label 1
    • -ψq with label 0
  • Otherwise, will learn a trivial model that always predicts 1