1 / 39

Searching Web Better Dr Wilfred Ng

Web Dynamics and their Ramifications for the Development of Web Search Engines. ... WUML: A Web Usage Manipulation Language For Querying Web Log Data. ...

Jeffrey
Download Presentation

Searching Web Better Dr Wilfred Ng

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Searching Web Better

    Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology Add the logo of HKUSTAdd the logo of HKUST

    Slide 2:Outline

    Introduction Main Techniques (RSCF) Clickthrough Data Ranking Support Vector Machine Algorithm Ranking SVM in Co-training Framework The RSCF-based Metasearch Engine Search Engine Components Feature Extraction Experiments Current Development Use red colour to highlight the important keywords (once), then you could tell The main idea when you come across the words: SVM, ESCF, Co-training, Metasearch EngineUse red colour to highlight the important keywords (once), then you could tell The main idea when you come across the words: SVM, ESCF, Co-training, Metasearch Engine

    Slide 3:Search Engine Adaptation

    Google, MSNsearch, Wisenut, Overture, … Computer Science Finance Social Science Adapt the search engine by learning from implicit feedback ---- Clickthrough data CS terms Product News I change the arrow directions. Here the engine you may write down the engine we used: Google, MSN, Overture Highlight clickthrough dataI change the arrow directions. Here the engine you may write down the engine we used: Google, MSN, Overture Highlight clickthrough data

    Slide 4:Clickthrough Data

    Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users Formally, a triplet (q, r, c) q – the input query r – the ranking result presented to the user c – the set of links the user clicked on Benefits: Can be obtained timely No intervention to the search activity q, r,c should be italic Benefits q, r,c should be italic Benefits

    Slide 5:An Example of Clickthrough Data

    User’s input query Clicked by the user l l l l l l l l Overlay with l1, l7, l10 in the animation Overlay with labelled and unlabelled (region with colours for example)Overlay with l1, l7, l10 in the animation Overlay with labelled and unlabelled (region with colours for example)

    Slide 6:Target Ranking (Preference Pairs Set )

    Slide 7:An Example of Clickthrough Data

    User’s input query Clicked by the user l l l l l l l l Labelled data set Unlabelled data set

    Labelled data set: l1, l2,…, l10 Unlabelled data set: l11, l12,…

    Slide 8:Target Ranking (Preference Pairs Set )

    Slide 9:The Ranking SVM Algorithm

    Three links, each described by a feature vector Target ranking: l1 <r’ l2 <r’ l3 Weight vector -- Ranker Distance between two closest projected links Cons: It needs a large set of labelled data l2 l1 l3 l2’ l1’ l3’ l2’ l1’ l3’ I can’t see \delta in the slide! Fill in the animiation of the projected points on W1 and W2 It seems W and \alpha are confused How large the set of labelled data?I can’t see \delta in the slide! Fill in the animiation of the projected points on W1 and W2 It seems W and \alpha are confused How large the set of labelled data?

    Slide 10:The Ranking SVM in Co-training Framework

    Divide the feature vector into two subvectors Two rankers are built over these two feature subvectors Each ranker chooses several unlabelled preference pairs and add them to the labelled data set Rebuild each ranker from the augmented labelled data set Labelled Preference Feedback Pairs P_l Unlabelled Preference Pairs P_u Ranker a_B Training Selecting confident pairs Ranker a_A Augmented pairs Augmented pairs

    Slide 11:Some Issues

    Guideline for partitioning the feature vector After the partition each subvector must be sufficient for the later ranking Number of rankers Depend on the number of features When to terminate the procedure? Prediction difference: indicates the ranking difference between the two rankers After termination, get a final ranker on the augmented labelled data set

    Slide 12:Metasearch Engine

    Receives query from user Sends query to multiple search engines Combines the retrieved results from the underlying search engines Presents a unified ranking result to user User Metasearch Engine Search Engine 1 query Search Engine 2 Search Engine n Retrieved Results 1 Retrieved Results 2 Retrieved Results n Unified Ranking Result The diagram is too boring, use more colourful boxesThe diagram is too boring, use more colourful boxes

    Slide 13:Search Engine Components

    Powered by Inktomi, relatively mature One of the most powerful search engines nowadays A new but growing search engine Ranks links based on the prices paid by the sponsors on the links

    Slide 14:Feature Extraction

    Ranking Features (12 binary features) Rank(E,T) where E? {M,W,O} T? {1,3,5,10} (M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying search engine Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) URL,Title, Abstract Cover, Abstract Group Indicate the similarity between the query and the link You should mention why these features are selected as a bulletYou should mention why these features are selected as a bullet

    Slide 15:Experiments

    Experiment data: within the same domain – Computer science Objectives: Offline experiments – compared with RSVM Online experiments – compared with Google You should elaborate more on prediction error by giving an example. The first two points are minor or can be verbally mentioned during Presentation You should also mention the objectives of the on-line and off-line experimentsYou should elaborate more on prediction error by giving an example. The first two points are minor or can be verbally mentioned during Presentation You should also mention the objectives of the on-line and off-line experiments

    Slide 16:Prediction Error

    Prediction Error: difference between the ranker’s ranking and the target ranking Target ranking: l1 <r’ l2, l1 <r’ l3, l2 <r’ l3 Projected ranking: l2 <r’ l1, l1 <r’ l3, l2 <r’ l3 Prediction error = 33% l2 l1 l3 l2’ l1’ l3’

    Slide 17:Offline Experiment (Compared with RSVM)

    10 queries 30 queries 60 queries The ranker trained by the RSVM algorithm on the whole feature vector The ranker trained by the RSCF algorithm on one feature subvector The ranker trained by the RSCF algorithm on another feature subvector Prediction error rise up again! The number of iterations in RSCF algorithm is about four to five!

    Slide 18:Offline Experiment (Compare with RSVM)

    The ranker trained by the RSVM algorithm The final ranker trained by the RSCF algorithm Overall comparison

    Slide 19:Online Experiment (Compare with Google)

    Experiment data: CS terms e.g. radix sort, TREC collection, … Experiment Setup Combine the results returned by RSCF and those by Google into one shuffled list Present to the users in a unified way Record the users’ clicks You may show some example of queries in one bulletYou may show some example of queries in one bullet

    Slide 20:Experimental Analysis

    Slide 21:Experimental Analysis

    Slide 22:Experimental Analysis

    Slide 23:Conclusion on RSCF

    Search engine adaptation The RSCF algorithm Train on clickthrough data Apply RSVM in the co-training framework The RSCF-based metasearch engine Offline experiments – better than RSVM Online experiments – better than Google I change some grammatical errorsI change some grammatical errors

    Slide 24:Current Development

    Features extraction and division Apply in different domains Search engine personalization SpyNoby Project: Personalized search engine with clickthrough analysis

    Slide 25:If l1 and l7 are from the same underlying search engine, the preference pairs set arising from l1 should be l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6 Advantages: Alleviate the penalty on high-ranked links Give more credit to the ranking ability of the underlying search engines

    Modified Target Ranking for Metasearch Engines

    Labeled data set: l1, l2,…, l10 Unlabelled data set: l11, l12,…

    Slide 26:Modified Target Ranking

    Slide 27:RSCF-based Metasearch Engine - MEA

    User MEA query …… …… ………… ………… 30. ...... Unified Ranking Result q q q q …… …… ………… ………… 30. …… …… …… ………… ………… 30. ……

    Slide 28:RSCF-based Metasearch Engine - MEB

    User MEB query …… …… ………… ………… 30. …… Unified Ranking Result q q q q …… …… ………… ………… 30. …… …… …… ………… ………… 30. …… …… …… ………… ………… 30. …… q

    Slide 29:Generating Clickthrough Data

    Probability of being clicked on: k: the ranking of the link in the metasearch engine n: the number of all the links in the metasearch engine : the skewness parameter in Zipf’s law Harmonic number: Judge the link’s relevance manually If the link is irrelevant ? not be clicked on If the link is relevant ? has the probability of Pr(k) to be clicked on

    Slide 30:Feature Extraction

    Ranking Features (binary features) Rank(E,T): whether the link is ranked within ST in E where E? {G,M,W,O} T? {1,3,5,10,15,20,25,30} S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} …… (G: Google, M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying search engine Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) Measure the similarity between the query and the link

    Slide 31:Experiments

    Experiment data: three different domains CS terms News E-shopping Objectives: Prediction Error – better than RSVM Top-k Precision – adaptation ability

    Slide 32:Top-k Precision

    Advantages: Precision is more easier to obtained than recall Users care only top-k links (k=10) Evaluation data: 30 queries in each domain

    Slide 33:Comparison of Top-k precision

    CS terms News E-shopping

    Slide 34:Statistical Analysis

    Hypothesis Testing: (two-sample hypothesis testing about means) used to analyze whether there is a statistically significant difference between two means of two samples

    Slide 35:Comparison Results

    MEA can produce better search quality than Google Google does not excel in every query category MEA and MEB is able to adapt to bring out the strengths of each underlying search engine MEA and MEB are better than, or comparable to all their underlying search engine components in every query category The RSCF-based metasearch engine Comparison of prediction error – better than RSVM Comparison of top-k precision – adaptation ability

    Slide 36:Spy Naοve Bayes – Motivation

    The problem of Joachims method Strong assumptions Excessively penalize high-ranked links l1, l2, l3 are apt to appear on the right, while l7, l10 on the left New interpretation of clickthrough data Clicked – positive (P) Unclicked – unlabeled (U), containing both positive and negative samples. Goal: identify Reliable Negatives (RN) from U lp <r ln Strong assumptions may not be necessary What are the symbols under the table?Strong assumptions may not be necessary What are the symbols under the table?

    Slide 37:Spy Naοve Bayes: Ideas

    Standard naοve Bayes – classify positive and negative samples One-step spy naοve Bayes: Spying out RN from U Put a small number of positive samples into U to act as “spies”, (to scout the behavior of real positive samples in U) Take U as negative samples to train a naοve Bayes classifier Samples with lower probabilities to be positive will be assigned into RN Voting procedure: make Spying more robust Run one-step SpyNB for n times and get n sets of RNi A sample appear in at least m (m<˜n) sets of RNi will appear in the final RN

    Slide 38:http://dleecpu1.cs.ust.hk:8080/SpyNoby/

    Slide 39:My publications

    Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation. An International Journal of Information Processing & Management, pp. 290-292, 43(1) (2007). Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching. Accepted and to appear: ACM Transactions on Internet Technology, (2006). Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE. Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005). Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data. International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567-581, (2004). Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004). Qingzhao TAN,  Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519-532, (2004). Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003). Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003).

More Related