optimizing search engines using clickthrough data
Skip this Video
Download Presentation
Optimizing Search Engines using Clickthrough Data

Loading in 2 Seconds...

play fullscreen
1 / 24

Optimizing Search Engines using Clickthrough Data - PowerPoint PPT Presentation

  • Uploaded on

Optimizing Search Engines using Clickthrough Data. by Thorsten Joachims. Presentation by M. Şükrü Kuran. Outline. Search Engines Clickthrough Data Learning of Retrieval Functions Support Vector Machine (SVM) for Learning of Ranking Functions Experiment Setup Offline Experiment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Optimizing Search Engines using Clickthrough Data' - fancy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
optimizing search engines using clickthrough data

Optimizing Search Engines using Clickthrough Data


Thorsten Joachims

Presentation by M. Şükrü Kuran

  • Search Engines
  • Clickthrough Data
  • Learning of Retrieval Functions
  • Support Vector Machine (SVM) for

Learning of Ranking Functions

  • Experiment Setup
  • Offline Experiment
  • Online Experiment
  • Analysis of The Online Experiment
  • Conclusion and Future Work
  • References
  • Questions
search engines
Search Engines
  • Search engines utilize ranking systems to list results based on their relevance to the query
  • Current ranking systems are not optimized for relevance
  • As an alternative solution we can use

Clickthrought Data to find

more relevance

optimized results

clickthrough data
Clickthrough Data

What is Clickthrough Data ?

  • Clickthrough data is the set of links that the user selects from the list of the links retreived by the search engine to a user-given query.

Why is Clickthrough Data Important?

  • These are the most relevant links among the query results
  • Easier to acquire than user feedback (since the data is already in the logs of the search engines)
clickthrough data 2
Clickthrough Data (2)
  • Users are less likely to click on a link that has a low ranking

(Independent of the actual relevence)

  • Users typically scan the first 10 links in the result set [24]

Thus, clickthrough data is not the absolute relevence

value for the query but a good relative relevence value

clickthrough data 3
Clickthrough Data (3)


Results for a search for SVM:

1. Kernel Machines 6. Archives of Support Vector

2. Support Vector Machine Machines

3. SVM-Light Support Vector Machine7. SVM demo Applet

4. Intr. To Support Vector Machines 8. Royal Holloway Support Vector

5. Support Vector Machine and Machine

Kernel Methods Ref. 9. Support Vector Machine

The Software

10. Lagrangian Support Vector

Machine Home Page

Among the 10 results, only links 1,3 and 7 is chosen (clickthrough data)

clickthrough data 4
Clickthrough Data (4)

link3 < * link2

link7 < * link2

link7 < * link4

link7 < * link5

link7 < * link6

: ranking preferred by the user

(binary relation)

We can generalize this preference


link i < * link j

for all pairs 1 <= j < i, with and

learning of retrieval functions
Learning of Retrieval Functions
  • Goal:

We have to find a retrival function whose results are close to

  • In order to calculate the similarity between any given

and , we have to use a performance metric

    • Average Precision (binary relevance)
    • Kendall’s


Good Performance Metric

learning of retrieval functions 2
Learning of Retrieval Functions (2)
  • Kendall’s
    • Between any two ranking functions the distance is,

D : Set of documents in a query result

P : # of concordant pairs in D x D

Q : # of discordant pairs in D x D

m : # of documents/links in D

learning of retrieval functions 3
Learning of Retrieval Functions (3)
  • Problem Defination of Learning an Appropriate Retrieval Function
    • For a fixed (but unknown) distribution of queries and target (user preferred) rankings the goal is,

where is the distribution of queries

support vector machine svm for learning of ranking functions
Support Vector Machine (SVM) for Learning of Ranking Functions
  • Usually machine learning in information learning is based on binary classification.

(A document is either related to the query or not)

  • Since the information gathered from clickthrought data is not an absoulte relevancy information we cannot use binary classification
support vector machine svm for learning of ranking functions 2
Support Vector Machine (SVM) for Learning of Ranking Functions (2)
  • Using a set of queries and user ranking sets (training data) we will select a ranking function among a family (F) of ranking functions

Selection will be based on minimizing

n : # of queries in the training set

support vector machine svm for learning of ranking functions 3
Support Vector Machine (SVM) for Learning of Ranking Functions (3)
  • Then, we need to find a sound family of ranking functions.
  • How to find an F which includes an efficent ranking function (f) ?
support vector machine svm for learning of ranking functions 4
Support Vector Machine (SVM) for Learning of Ranking Functions (4)
  • A set of functions,
  • Where ‘s are description based retrieval functions [10,11]
  • ‘s are weight vectors (2D) adjusted by learning
support vector machine svm for learning of ranking functions 5
Support Vector Machine (SVM) for Learning of Ranking Functions (5)
  • Instead of maximizing directly our goal function we can minimize the Q in our performance measure
  • By using calssification SVM’s [7]


subject to

experiment setup
Experiment Setup
  • A baseline meta-search engine called Striver is used for testing purposes
  • Striver forwards a query to “MSNSearch, Google, Excite, Altavista and Hotbot”
  • Acquires top 100 results from each search engine
  • Based on the learned retrival function it selects top 50 of the 500(may be lesser if more than one engine has found a specific document)
offline experiment
Offline Experiment
  • Using Striver 112 Queries are recorded
  • A huge set of features are used to calculate the description based retrieval functions
  • The testing is done with different values of training set queries
  • Results from Google and MSNSearch are used for benchmarking purposes
online experiment
Online Experiment
  • Striver is used by a group of people (20 people)
  • Based on these people’s queries training set of Striver is composed of 260 queries
  • The results are compared with results from Google, MSNSearch and Toprank (a simple meta-search engine)
online experiment 2
Online Experiment (2)

More clicks mean that (for Google) users clicked more links in the learned engine

than they do in Google for 29 queries out of 88.

Less clicks mean that (for Google) users clicked less links in the learned engine

than they do in Google for 13 queries out of 88

analysis of the online experiment
Analysis of the Online Experiment
  • Since all of the users have used the engine for academic searches the learned data is good for searches in academic research topics
  • But it may not give that good results for different groups of people
  • We can say that learned engine is a customizable engine unlike traditional engines
future work and conclusions
Future Work and Conclusions
  • What is the optimal group size for user custimization?
  • Features can be tuned for better performance
  • Clustering algorithms can cluster users in WWW into subgroups based on their clickthrough data’s ?
  • Can malicious users corrupt the learning process by clicking irrelevant links, how it is avoided?

[1] R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. Addison-Wesley-Longman,Harlow, UK, May 1999.

[2] B. Bartell, G. Cottrell, and R. Belew. Automaticcombination of multiple ranked retrieval systems. InAnnual ACM SIGIR Conf. on Research andDevelopment in Information Retrieval (SIGIR), 1994.

[3] D. Beeferman and A. Berger. Agglomerative clusteringof a search engine query log. In ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining (KDD), 2000.

[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik. Atraininig algorithm for optimal margin classifiers. InD. Haussler, editor, Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory,pages 144–152, 1992.

[5] J. Boyan, D. Freitag, and T. Joachims. A machinelearning architecture for optimizing web searchengines. In AAAI Workshop on Internet BasedInformation Systems, August 1996.

[6] W. Cohen, R. Shapire, and Y. Singer. Learning toorder things. Journal of Artificial IntelligenceResearch, 10, 1999.

[7] C. Cortes and V. N. Vapnik. Support–vector networks.Machine Learning Journal, 20:273–297, 1995.

[8] K. Crammer and Y. Singer. Pranking with ranking. InAdvances in Neural Information Processing Systems(NIPS), 2001.

[9] Y. Freund, R. Iyer, R. Shapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.In International Conference on Machine Learning(ICML), 1998.

[10] N. Fuhr. Optimum polynomial retrieval functionsbased on the probability ranking principle. ACMTransactions on Information Systems, 7(3):183–204,1989.

[11] N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner,K. Tzeras, and G. Knorz. Air/x - a rule-basedmultistage indexing system for large subject fields. InRIAO, pages 606–623, 1991.

[12] R. Herbrich, T. Graepel, and K. Obermayer. Largemargin rank boundaries for ordinal regression. InAdvances in Large Margin Classifiers, pages 115–132.MIT Press, Cambridge, MA, 2000.

[13] K. H¨offgen, H. Simon, and K. van Horn. Robusttrainability of single neurons. Journal of Computerand System Sciences, 50:114–125, 1995.

[14] T. Joachims. Making large-scale SVM learningpractical. In B. Sch¨olkopf, C. Burges, and A. Smola,editors, Advances in Kernel Methods - Support VectorLearning, chapter 11. MIT Press, Cambridge, MA,1999.

[15] T. Joachims. Learning to Classify Text Using SupportVector Machines – Methods, Theory, and Algorithms.Kluwer, 2002.

[16] T. Joachims. Unbiased evaluation of retrieval qualityusing clickthrough data. Technical report, CornellUniversity, Department of Computer Science, 2002.http://www.joachims.org.

[17] T. Joachims, D. Freitag, and T. Mitchell.WebWatcher: a tour guide for the world wide web. InProceedings of International Joint Conference onArtificial Intelligence (IJCAI), volume 1, pages 770 –777. Morgan Kaufmann, 1997.

[18] J. Kemeny and L. Snell. Mathematical Models in theSocial Sciences. Ginn & Co, 1962.

[19] M. Kendall. Rank Correlation Methods. Hafner, 1955.

[20] H. Lieberman. Letizia: An agent that assists Webbrowsing. In Proceedings of the Fifteenth InternationalJoint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, 1995. Morgan Kaufmann.

[21] A. Mood, F. Graybill, and D. Boes. Introduction tothe Theory of Statistics. McGraw-Hill, 3 edition, 1974.

[22] L. Page and S. Brin. Pagerank, an eigenvector basedranking approach for hypertext. In 21st AnnualACM/SIGIR International Conference on Researchand Development in Information Retrieval, 1998.

[23] G. Salton and C. Buckley. Term weighting approachesin automatic text retrieval. Information Processingand Management, 24(5):513–523, 1988.

[24] C. Silverstein, M. Henzinger, H. Marais, andM. Moricz. Analysis of a very large altavista querylog. Technical Report SRC 1998-014, Digital SystemsResearch Center, 1998.

[25] V. Vapnik. Statistical Learning Theory. Wiley,Chichester, GB, 1998.

[26] Y. Yao. Measuring retrieval effectiveness based onuser preference of documents. Journal of the AmericanSociety for Information Science, 46(2):133–145, 1995.



Experiment Results ?

Clickthrough Data ?

Machine Learning for

Retrieval Functions ?

Retrieval Functions ?