Loading in 5 sec....

Optimizing Search Engines using Clickthrough DataPowerPoint Presentation

Optimizing Search Engines using Clickthrough Data

- 103 Views
- Uploaded on
- Presentation posted in: General

Optimizing Search Engines using Clickthrough Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Optimizing Search Engines using Clickthrough Data

by

Thorsten Joachims

Presentation by M. Şükrü Kuran

- Search Engines
- Clickthrough Data
- Learning of Retrieval Functions
- Support Vector Machine (SVM) for
Learning of Ranking Functions

- Experiment Setup
- Offline Experiment
- Online Experiment
- Analysis of The Online Experiment
- Conclusion and Future Work
- References
- Questions

- Search engines utilize ranking systems to list results based on their relevance to the query
- Current ranking systems are not optimized for relevance
- As an alternative solution we can use
Clickthrought Data to find

more relevance

optimized results

What is Clickthrough Data ?

- Clickthrough data is the set of links that the user selects from the list of the links retreived by the search engine to a user-given query.
Why is Clickthrough Data Important?

- These are the most relevant links among the query results
- Easier to acquire than user feedback (since the data is already in the logs of the search engines)

- Users are less likely to click on a link that has a low ranking
(Independent of the actual relevence)

- Users typically scan the first 10 links in the result set [24]
Thus, clickthrough data is not the absolute relevence

value for the query but a good relative relevence value

Example:

Results for a search for SVM:

1. Kernel Machines6. Archives of Support Vector

2. Support Vector Machine Machines

3. SVM-Light Support Vector Machine7. SVM demo Applet

4. Intr. To Support Vector Machines 8. Royal Holloway Support Vector

5. Support Vector Machine and Machine

Kernel Methods Ref.9. Support Vector Machine

The Software

10. Lagrangian Support Vector

Machine Home Page

Among the 10 results, only links 1,3 and 7 is chosen (clickthrough data)

link3 < * link2

link7 < * link2

link7 <* link4

link7 <* link5

link7 <* link6

: ranking preferred by the user

(binary relation)

We can generalize this preference

information,

link i <* link j

for all pairs 1 <= j < i, with and

- Goal:
We have to find a retrival function whose results are close to

- In order to calculate the similarity between any given
and , we have to use a performance metric

- Average Precision (binary relevance)
- Kendall’s

VerySimple

Good Performance Metric

- Kendall’s
- Between any two ranking functions the distance is,
D : Set of documents in a query result

P : # of concordant pairs in D x D

Q : # of discordant pairs in D x D

m : # of documents/links in D

- Between any two ranking functions the distance is,

- Problem Defination of Learning an Appropriate Retrieval Function
- For a fixed (but unknown) distribution of queries and target (user preferred) rankings the goal is,
where is the distribution of queries

- For a fixed (but unknown) distribution of queries and target (user preferred) rankings the goal is,

- Usually machine learning in information learning is based on binary classification.
(A document is either related to the query or not)

- Since the information gathered from clickthrought data is not an absoulte relevancy information we cannot use binary classification

- Using a set of queries and user ranking sets (training data) we will select a ranking function among a family (F) of ranking functions

Selection will be based on minimizing

n : # of queries in the training set

- Then, we need to find a sound family of ranking functions.
- How to find an F which includes an efficent ranking function (f) ?

- A set of functions,
- Where ‘s are description based retrieval functions [10,11]
- ‘s are weight vectors (2D) adjusted by learning

- Instead of maximizing directly our goal function we can minimize the Q in our performance measure
- By using calssification SVM’s [7]

minimize

subject to

- A baseline meta-search engine called Striver is used for testing purposes
- Striver forwards a query to “MSNSearch, Google, Excite, Altavista and Hotbot”
- Acquires top 100 results from each search engine
- Based on the learned retrival function it selects top 50 of the 500(may be lesser if more than one engine has found a specific document)

- Using Striver 112 Queries are recorded
- A huge set of features are used to calculate the description based retrieval functions
- The testing is done with different values of training set queries
- Results from Google and MSNSearch are used for benchmarking purposes

- Striver is used by a group of people (20 people)
- Based on these people’s queries training set of Striver is composed of 260 queries
- The results are compared with results from Google, MSNSearch and Toprank (a simple meta-search engine)

More clicks mean that (for Google) users clicked more links in the learned engine

than they do in Google for 29 queries out of 88.

Less clicks mean that (for Google) users clicked less links in the learned engine

than they do in Google for 13 queries out of 88

- Since all of the users have used the engine for academic searches the learned data is good for searches in academic research topics
- But it may not give that good results for different groups of people
- We can say that learned engine is a customizable engine unlike traditional engines

- What is the optimal group size for user custimization?
- Features can be tuned for better performance
- Clustering algorithms can cluster users in WWW into subgroups based on their clickthrough data’s ?
- Can malicious users corrupt the learning process by clicking irrelevant links, how it is avoided?

[1] R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. Addison-Wesley-Longman,Harlow, UK, May 1999.

[2] B. Bartell, G. Cottrell, and R. Belew. Automaticcombination of multiple ranked retrieval systems. InAnnual ACM SIGIR Conf. on Research andDevelopment in Information Retrieval (SIGIR), 1994.

[3] D. Beeferman and A. Berger. Agglomerative clusteringof a search engine query log. In ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining (KDD), 2000.

[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik. Atraininig algorithm for optimal margin classifiers. InD. Haussler, editor, Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory,pages 144–152, 1992.

[5] J. Boyan, D. Freitag, and T. Joachims. A machinelearning architecture for optimizing web searchengines. In AAAI Workshop on Internet BasedInformation Systems, August 1996.

[6] W. Cohen, R. Shapire, and Y. Singer. Learning toorder things. Journal of Artificial IntelligenceResearch, 10, 1999.

[7] C. Cortes and V. N. Vapnik. Support–vector networks.Machine Learning Journal, 20:273–297, 1995.

[8] K. Crammer and Y. Singer. Pranking with ranking. InAdvances in Neural Information Processing Systems(NIPS), 2001.

[9] Y. Freund, R. Iyer, R. Shapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.In International Conference on Machine Learning(ICML), 1998.

[10] N. Fuhr. Optimum polynomial retrieval functionsbased on the probability ranking principle. ACMTransactions on Information Systems, 7(3):183–204,1989.

[11] N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner,K. Tzeras, and G. Knorz. Air/x - a rule-basedmultistage indexing system for large subject fields. InRIAO, pages 606–623, 1991.

[12] R. Herbrich, T. Graepel, and K. Obermayer. Largemargin rank boundaries for ordinal regression. InAdvances in Large Margin Classifiers, pages 115–132.MIT Press, Cambridge, MA, 2000.

[13] K. H¨offgen, H. Simon, and K. van Horn. Robusttrainability of single neurons. Journal of Computerand System Sciences, 50:114–125, 1995.

[14] T. Joachims. Making large-scale SVM learningpractical. In B. Sch¨olkopf, C. Burges, and A. Smola,editors, Advances in Kernel Methods - Support VectorLearning, chapter 11. MIT Press, Cambridge, MA,1999.

[15] T. Joachims. Learning to Classify Text Using SupportVector Machines – Methods, Theory, and Algorithms.Kluwer, 2002.

[16] T. Joachims. Unbiased evaluation of retrieval qualityusing clickthrough data. Technical report, CornellUniversity, Department of Computer Science, 2002.http://www.joachims.org.

[17] T. Joachims, D. Freitag, and T. Mitchell.WebWatcher: a tour guide for the world wide web. InProceedings of International Joint Conference onArtificial Intelligence (IJCAI), volume 1, pages 770 –777. Morgan Kaufmann, 1997.

[18] J. Kemeny and L. Snell. Mathematical Models in theSocial Sciences. Ginn & Co, 1962.

[19] M. Kendall. Rank Correlation Methods. Hafner, 1955.

[20] H. Lieberman. Letizia: An agent that assists Webbrowsing. In Proceedings of the Fifteenth InternationalJoint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, 1995. Morgan Kaufmann.

[21] A. Mood, F. Graybill, and D. Boes. Introduction tothe Theory of Statistics. McGraw-Hill, 3 edition, 1974.

[22] L. Page and S. Brin. Pagerank, an eigenvector basedranking approach for hypertext. In 21st AnnualACM/SIGIR International Conference on Researchand Development in Information Retrieval, 1998.

[23] G. Salton and C. Buckley. Term weighting approachesin automatic text retrieval. Information Processingand Management, 24(5):513–523, 1988.

[24] C. Silverstein, M. Henzinger, H. Marais, andM. Moricz. Analysis of a very large altavista querylog. Technical Report SRC 1998-014, Digital SystemsResearch Center, 1998.

[25] V. Vapnik. Statistical Learning Theory. Wiley,Chichester, GB, 1998.

[26] Y. Yao. Measuring retrieval effectiveness based onuser preference of documents. Journal of the AmericanSociety for Information Science, 46(2):133–145, 1995.

?

Experiment Results ?

Clickthrough Data ?

Machine Learning for

Retrieval Functions ?

Retrieval Functions ?