Optimizing Search Engines using Clickthrough Data. by Thorsten Joachims. Presentation by M. Şükrü Kuran. Outline. Search Engines Clickthrough Data Learning of Retrieval Functions Support Vector Machine (SVM) for Learning of Ranking Functions Experiment Setup Offline Experiment
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Presentation by M. Şükrü Kuran
Learning of Ranking Functions
Clickthrought Data to find
What is Clickthrough Data ?
Why is Clickthrough Data Important?
(Independent of the actual relevence)
Thus, clickthrough data is not the absolute relevence
value for the query but a good relative relevence value
Results for a search for SVM:
1. Kernel Machines 6. Archives of Support Vector
2. Support Vector Machine Machines
3. SVM-Light Support Vector Machine7. SVM demo Applet
4. Intr. To Support Vector Machines 8. Royal Holloway Support Vector
5. Support Vector Machine and Machine
Kernel Methods Ref. 9. Support Vector Machine
10. Lagrangian Support Vector
Machine Home Page
Among the 10 results, only links 1,3 and 7 is chosen (clickthrough data)
link3 < * link2
link7 < * link2
link7 < * link4
link7 < * link5
link7 < * link6
: ranking preferred by the user
We can generalize this preference
link i < * link j
for all pairs 1 <= j < i, with and
We have to find a retrival function whose results are close to
and , we have to use a performance metric
Good Performance Metric
D : Set of documents in a query result
P : # of concordant pairs in D x D
Q : # of discordant pairs in D x D
m : # of documents/links in D
where is the distribution of queries
(A document is either related to the query or not)
Selection will be based on minimizing
n : # of queries in the training set
More clicks mean that (for Google) users clicked more links in the learned engine
than they do in Google for 29 queries out of 88.
Less clicks mean that (for Google) users clicked less links in the learned engine
than they do in Google for 13 queries out of 88
 R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. Addison-Wesley-Longman,Harlow, UK, May 1999.
 B. Bartell, G. Cottrell, and R. Belew. Automaticcombination of multiple ranked retrieval systems. InAnnual ACM SIGIR Conf. on Research andDevelopment in Information Retrieval (SIGIR), 1994.
 D. Beeferman and A. Berger. Agglomerative clusteringof a search engine query log. In ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining (KDD), 2000.
 B. E. Boser, I. M. Guyon, and V. N. Vapnik. Atraininig algorithm for optimal margin classifiers. InD. Haussler, editor, Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory,pages 144–152, 1992.
 J. Boyan, D. Freitag, and T. Joachims. A machinelearning architecture for optimizing web searchengines. In AAAI Workshop on Internet BasedInformation Systems, August 1996.
 W. Cohen, R. Shapire, and Y. Singer. Learning toorder things. Journal of Artificial IntelligenceResearch, 10, 1999.
 C. Cortes and V. N. Vapnik. Support–vector networks.Machine Learning Journal, 20:273–297, 1995.
 K. Crammer and Y. Singer. Pranking with ranking. InAdvances in Neural Information Processing Systems(NIPS), 2001.
 Y. Freund, R. Iyer, R. Shapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.In International Conference on Machine Learning(ICML), 1998.
 N. Fuhr. Optimum polynomial retrieval functionsbased on the probability ranking principle. ACMTransactions on Information Systems, 7(3):183–204,1989.
 N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner,K. Tzeras, and G. Knorz. Air/x - a rule-basedmultistage indexing system for large subject fields. InRIAO, pages 606–623, 1991.
 R. Herbrich, T. Graepel, and K. Obermayer. Largemargin rank boundaries for ordinal regression. InAdvances in Large Margin Classifiers, pages 115–132.MIT Press, Cambridge, MA, 2000.
 K. H¨offgen, H. Simon, and K. van Horn. Robusttrainability of single neurons. Journal of Computerand System Sciences, 50:114–125, 1995.
 T. Joachims. Making large-scale SVM learningpractical. In B. Sch¨olkopf, C. Burges, and A. Smola,editors, Advances in Kernel Methods - Support VectorLearning, chapter 11. MIT Press, Cambridge, MA,1999.
 T. Joachims. Learning to Classify Text Using SupportVector Machines – Methods, Theory, and Algorithms.Kluwer, 2002.
 T. Joachims. Unbiased evaluation of retrieval qualityusing clickthrough data. Technical report, CornellUniversity, Department of Computer Science, 2002.http://www.joachims.org.
 T. Joachims, D. Freitag, and T. Mitchell.WebWatcher: a tour guide for the world wide web. InProceedings of International Joint Conference onArtificial Intelligence (IJCAI), volume 1, pages 770 –777. Morgan Kaufmann, 1997.
 J. Kemeny and L. Snell. Mathematical Models in theSocial Sciences. Ginn & Co, 1962.
 M. Kendall. Rank Correlation Methods. Hafner, 1955.
 H. Lieberman. Letizia: An agent that assists Webbrowsing. In Proceedings of the Fifteenth InternationalJoint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, 1995. Morgan Kaufmann.
 A. Mood, F. Graybill, and D. Boes. Introduction tothe Theory of Statistics. McGraw-Hill, 3 edition, 1974.
 L. Page and S. Brin. Pagerank, an eigenvector basedranking approach for hypertext. In 21st AnnualACM/SIGIR International Conference on Researchand Development in Information Retrieval, 1998.
 G. Salton and C. Buckley. Term weighting approachesin automatic text retrieval. Information Processingand Management, 24(5):513–523, 1988.
 C. Silverstein, M. Henzinger, H. Marais, andM. Moricz. Analysis of a very large altavista querylog. Technical Report SRC 1998-014, Digital SystemsResearch Center, 1998.
 V. Vapnik. Statistical Learning Theory. Wiley,Chichester, GB, 1998.
 Y. Yao. Measuring retrieval effectiveness based onuser preference of documents. Journal of the AmericanSociety for Information Science, 46(2):133–145, 1995.
Experiment Results ?
Clickthrough Data ?
Machine Learning for
Retrieval Functions ?
Retrieval Functions ?