Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation SIGIR 2010 Yisong Yue Cornell University Joint work with: Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims

Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard statistical tests (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic

Retrieval Evaluation Using Click Data • Eliciting relative feedback • E.g., is A better than B? • Evaluation pipeline • Online experiment design (example to follow) • Collect clicks • Use standard test statistics (e.g., t-test) • Contribution: Supervised learning algorithm for training a more efficient test statistic

Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ [Radlinski, Kurup, Joachims, CIKM 2008]

Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable [Radlinski, Kurup, Joachims, CIKM 2008]

Team-Game Interleaving(Online Experiment for Search Applications) (u=thorsten, q=“svm”) A(u,q)  r1 B(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of A and B • Relative feedback • More reliable Interpretation: (r1> r2) ↔ clicks(r1) > clicks(r2) [Radlinski, Kurup, Joachims, CIKM 2008]

Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6)

Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • t-Test • For each q, score: % clicks on A(q) • E.g., 3/4 = 0.75 • Sample mean score (e.g., 0.6) • Compute confidence (p value) • E.g., want p = 0.05 (i.e., 95% confidence) • More data, more confident

Determining Statistical Significance • Each q, interleave A(q) and B(q), log clicks • Other Statistical Tests: • z-Test • (equal to t-Test for large samples) • Rank Test • Binomial Test • Etc… • All similar

Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie

Limitation • Example: query session with 2 clicks • One click at rank 1 (from A) • Later click at rank 4 (from B) • Normally would count this query session as a tie • But second click is probably more informative… • …so B should get more credit for this query

Linear Model • Feature vector φ(q,c): • Weight of click is wTφ(q,c)

Example • wTφ(q,c) differentiates last clicks and other clicks

Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random

Example • wTφ(q,c) differentiates last clicks and other clicks • Interleave A vs B • 3 clicks per session • Last click 60% on result from A • Other 2 clicks random • Conventional w = (1,1) – has significant variance • Only count last click w = (1,0) – minimizes variance

Scoring Query Sessions • Feature representation for query session:

Scoring Query Sessions • Feature representation for query session: • Weighted score for query: • Positive score favors A, negative favors B

Upgraded Test Statistic • t-Test: • Compute mean score wTψq over all queries • E.g., 0.2 • Null hypothesis: mean = 0 • Can reach statistical significance sooner • How to learn w?

Supervised Learning • Will optimize for z-Test: Inverse z-Test • Approximately equal t-Test for large samples • z-Score = mean / standard deviation (Assumes A > B)

Inverting Other Statistical Tests • Most statistical tests uses a test statistic • E.g., z-score • Rank test: % concordant pairs in confidence ranking (ROC Area) • We optimized using logistic regression

Recap • Collect training data • Pairs of retrieval functions A & B, (A > B) • Interleave them, collect usage logs • Build features • For each query session q: Ψq • Query session score: wTψq (positive favors A) • Train w to optimize test statistic • z-Test: maximize ( mean(w) / std(w) )

Experiment Setup • Data collection • Pool of retrieval functions • Hash users into partitions • Run interleaving of different pairs in parallel • Collected on arXiv.org • 2 pools of retrieval functions • Training Pool: (6 pairs) know A > B • New Pool: (12 pairs)

Training Pool – All Sessions Grouped Together

Training Pool – Cross Validation

Experimental Results • Inverse z-Test works well • Beats baseline on most of new interleaving pairs • Direction of tests all in agreement • In 6/12 pairs, for p=0.1, reduces sample size by 10% • In 4/12 pairs, achieves p=0.05, but not baseline • 400 to 650 queries per interleaving experiment • Weights hard to interpret (features correlated) • Largest weight: “1 if single click & rank > 1”

Conclusion • Principled, offers practical benefits • Should perform better with more training data • Can be applied to other application domains • Limitations: • Treats training data as one sample • Might not work well when test data is very different from training data • Susceptible to adversarial behavior

Extra Slides

Training Logistic Regression • Need to mirror training data • E.g., • ψq with label 1 • -ψq with label 0 • Otherwise, will learn a trivial model that always predicts 1

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation

Presentation Transcript

Evaluation of Texture Features for Content-based Image Retrieval

Statistics for Performance Evaluation

For more slides click here

An Active Learning Framework for Content-Based Information Retrieval

Powerful Tools, Powerful Pedagogy Project-Based Learning!

Statistics for Performance Evaluation

Information Retrieval Evaluation

Evaluation for Learning

Retrieval Evaluation

Learning Embeddings for Similarity-Based Retrieval

click for more information

Retrieval Performance Evaluation

Statistics evaluation for Teleacademy.it

click more==<<>>==http://www.evergreenyouth.com/xl-test-plus/

click for more info>>>&@%>>http://www.supervision4health.com/fit4-max-test/

click for more

Statistics vs Machine Learning: Which is More Powerful

Retrieval Evaluation

Retrieval Performance Evaluation

Evaluation of Texture Features for Content-based Image Retrieval

Powerful Tools, Powerful Pedagogy Project-Based Learning!

Retrieval Evaluation - Measures