1 / 29

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims. Retrieval Evaluation Using Click Data. Eliciting relative feedback E.g., is A better than B?

Lucy
Download Presentation

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims

  2. Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard statistical tests (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic

  3. Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard test statistics (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic

  4. Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ [Radlinski, Kurup, Joachims, CIKM 2008]

  5. Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable [Radlinski, Kurup, Joachims, CIKM 2008]

  6. Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable Interpretation: (r1> r2) ↔ clicks(r1) > clicks(r2) [Radlinski, Kurup, Joachims, CIKM 2008]

  7. Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6)

  8. Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6) • Compute confidence (p value) • E.g., want p = 0.05 (i.e., 95% confidence) • More data, more confident

  9. Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • Other Statistical Tests: • z-Test • (equal to t-Test for large samples) • Rank Test • Binomial Test • Etc… • All similar

  10. Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie

  11. Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie • But second click is probably more informative… • …so B should get more credit for this query

  12. Linear Model • Feature vector φ(q,c): • Weight of click is wTφ(q,c)

  13. Example • wTφ(q,c) differentiates last clicks and other clicks

  14. Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random

  15. Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random • Conventional w = (1,1) – has significant variance • Only count last click w = (1,0) – minimizes variance

  16. Scoring Query Sessions • Feature representation for query session:

  17. Scoring Query Sessions • Feature representation for query session: • Weighted score for query: • Positive score favors A, negative favors B

  18. Upgraded Test Statistic • t-Test: • Compute mean score wTψq over all queries • E.g., 0.2 • Null hypothesis: mean = 0 • Can reach statistical significance sooner • How to learn w?

  19. Supervised Learning • Will optimize for z-Test: Inverse z-Test • Approximately equal t-Test for large samples • z-Score = mean / standard deviation (Assumes A > B)

  20. Inverting Other Statistical Tests • Most statistical tests uses a test statistic • E.g., z-score • Rank test: % concordant pairs in confidence ranking (ROC Area) • We optimized using logistic regression

  21. Recap • Collect training data • Pairs of retrieval functions A & B, (A > B) • Interleave them, collect usage logs • Build features • For each query session q: Ψq • Query session score: wTψq (positive favors A) • Train w to optimize test statistic • z-Test: maximize ( mean(w) / std(w) )

  22. Experiment Setup • Data collection • Pool of retrieval functions • Hash users into partitions • Run interleaving of different pairs in parallel • Collected on arXiv.org • 2 pools of retrieval functions • Training Pool: (6 pairs) know A > B • New Pool: (12 pairs)

  23. Training Pool – All Sessions Grouped Together

  24. Training Pool – Cross Validation

  25. Experimental Results • Inverse z-Test works well • Beats baseline on most of new interleaving pairs • Direction of tests all in agreement • In 6/12 pairs, for p=0.1, reduces sample size by 10% • In 4/12 pairs, achieves p=0.05, but not baseline • 400 to 650 queries per interleaving experiment • Weights hard to interpret (features correlated) • Largest weight: “1 if single click & rank > 1”

  26. Conclusion • Principled, offers practical benefits • Should perform better with more training data • Can be applied to other application domains • Limitations: • Treats training data as one sample • Might not work well when test data is very different from training data • Susceptible to adversarial behavior

  27. Extra Slides

  28. Training Logistic Regression • Need to mirror training data • E.g., • ψq with label 1 • -ψq with label 0 • Otherwise, will learn a trivial model that always predicts 1

More Related