260 likes | 425 Views
Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure. Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006. Outline. 1. Introduction 2. Extended Neighborhood Structure Model 3. Extending Link-based Similarity Measures 4. Experimental Results
E N D
Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006
Outline • 1. Introduction • 2. Extended NeighborhoodStructure Model • 3. Extending Link-based Similarity Measures • 4. Experimental Results • 5. Conclusion and Future Work
1. Introduction • Background • Similarity measures are required in many web applications to evaluate the similarity between web pages. • The “similar pages” service of Web search engines; • Web document classification; • Web community identification. • Problem • Many link-based similarity measures are not so accurate since they consider only part of the structural information.
1. Introduction • Motivation • How to improve the accuracy of link-based similarity measures by making full use of the structural information? • Contributions • Propose the Extended Neighborhood Structure (ENS) model. • bi-direction • multi-hop • Construct extended link-based similarity measures base on the ENS model. • more flexible and accurate
Search Engine similarity measure 1. Introduction • Searching the Web • Keyword searching • Similarity searching KEYWORDS: news http://news.bbc.co.uk/ http://www.cnn.com/ … URL: www.cnn.com Search Engine http://news.bbc.co.uk/ http://usnews.com/ …
Focus of this talk 1. Introduction • Similarity measures • Evaluate how similarity or related two objects are. • Approaches to measuring similarity • Text-based • Cosine TFIDF [Joachims97] • Link-based • Bibliographic coupling [Kessler63] • Co-citation [Small73] • SimRank [Jeh et al 02], PageSim [Lin et al 06] • Hybrid
2. Extend Neighborhood Structure Model • Extended Neighborhood Structure (ENS) model • Question: what hide in hyperlinks? • similarity relationship between pages, • similarity relationship decrease along hyperlinks.
2. Extend Neighborhood Structure Model • Extended Neighborhood Structure (ENS) model • The ENS model • bi-direction • in-link • out-link • multi-hop • direct (1-hop) • indirect (2-hop, 3-hop, etc) • Purpose • Improve accuracy of link-based similarity measures by helping them make full use of the structural information of the Web.
3. Extending Link-based Similarity Measures • Intuition of similarity • Similar web pages have similar neighbors.(to compare two web pages, see their neighbors.) • Notations • G=(V, E), |V| = n: the web graph. • I(a) / O(a): in-link / out-link neighbors of web page a. • path(a1, as): a sequence of vertices a1, a2, …, as such that (ai, ai+1) ∈ E (i=1,…,s-1) and ai are distinct. • PATH(a,b): the set of all possible paths from page a to b. • Sim(a,b): similarity score of web page a and b.
3. Extending Link-based Similarity Measures • Two classical methods • Co-citation: the more commonin-link neighbors, the more similar. • Sim(a,b) = |I(a)∩I(b)| • Bibliographic coupling: the more common out-link neighbors, the more similar. • Sim(a,b) = |O(a)∩O(b)| • Extended Co-citation and Bibliographic Coupling (ECBC) • ECBC: the more common neighbors, the more similar. • Sim(a,b) = α|I(a)∩I(b)| + (1-α)|O(a)∩O(b)|, where 0≤α≤1 is a constant.
3. Extending Link-based Similarity Measures • SimRank “two pages are similar if they are linked to by similar pages” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠v.
3. Extending Link-based Similarity Measures • Extended SimRank “two pages are similar if they have similar neighbors” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠v.
3. Extending Link-based Similarity Measures • PageSim “weighted multi-hop” version of Co-citation algorithm. • (a) multi-hop in-link information, and • (b) importance of web pages. • Can be represented by any global scoring system • PageRank scores, or • Authoritative scores of HITS.
3. Extending Link-based Similarity Measures • PageSim (phase 1: feature propagation) • Initially, each web page contains an unique feature information, which is represented by its PageRank score. • The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by
3. Extending Link-based Similarity Measures • PageSim (phase 2: similarity computation) • A web page v stores the featureinformation of its and others in its Feature Vector FV(v). • The similarity between web page u and v is computed by Jaccard measure[Jain et al 88] • Intuition:the more common feature information two web pages contain, the more similar they are.
3. Extending Link-based Similarity Measures • Extended PageSim (EPS) • Propagating featureinformation of web pages along in-link hyperlinks at decay rate 1- d. • Computing the in-link PS scores. • EPS(u,v) = in-link PS(u,v) + out-link PS(u,v).
3. Extending Link-based Similarity Measures • Properties • CC: Co-citation, BC: Bibliographic Coupling, • ECBC: Extended Co-citation and Bibliographic Coupling, • SR: SimRank, ESR: Extended SimRank, PS: PageSim, EPS: Extended PageSim. • Summary • The extended versions consider more structural information. • ESR and EPS are bi-directional & multi-hop. • In ESR, two web pages are not similar unless there are intermediate pages between them, even if they link to other (see Figure 1(2)).
3. Extending Link-based Similarity Measures • Case study: Sim(a,b) • Summary • The extended algorithms are more flexible. • EPS is able to handle more cases.
4. Experimental Results • Datasets • CSE Web (CW) dataset: • A set of web pages crawled from http://cse.cuhk.edu.hk. • 22,000 pages, 180,000 hyperlinks. • The average number of in-links and out-links are 8.6 and 7.7. • Google Scholar (GS) dataset: • A set of articles crawled from Google Scholar searching engine. • Start crawling by submitting “web mining” keywords to GS, and then following the “Cited by” hyperlinks. • 20,000 articles, 154,000 citations.
4. Experimental Results • Evaluation Methods • Cosine TFIDF similarity (for CW dataset) • A commonly used text-based similarity measure. • “Related Articles” (for GS dataset) • A list of related articles to a query article provided by GS. • Can be used as ground truth. • Parameter Settings
4. Experimental Results • CC, BC vs ECBC • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.
4. Experimental Results • SimRank vs Extended SimRank • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.
4. Experimental Results • PageSim vs Extended PageSim • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.
4. Experimental Results • Overall Accuracy of Algorithms
5. Conclusion and Future Work • Conclusion • Extended Neighborhood Structure model • Bi-direction and multi-hop • Extend existing link-based similarity measures • Co-citation, Bibliographic coupling, SimRank, PageSim • Experiments • Future Work • Extend link-based algorithms based on ENS model • Prove the convergence of the Extended SimRank • Integrating link-based with text-based
Publications • Z. Lin, M. R. Lyu, andI. King. PageSim: A novel link-based measure of web page similarity.In WWW '06: Proceedings of the 15th international conference on World Wide Web.Pages 1019-1020, Edinburgh, Scotland, 2006. • Z. Lin, I. King, and M. R. Lyu. PageSim: A novellink-based similarity measure for the World WideWeb.In WI ’06: Proceedings of the 5th InternationalConference on Web Intelligence. ACM Press.To appear, 2006. • Z. Lin, M. R. Lyu, andI. King.Extending Link-basedAlgorithms for Similar Web Pageswith Neighborhood Structure. Submitted to WWW’07.