1 / 26

Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure

Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure. Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006. Outline. 1. Introduction 2. Extended Neighborhood Structure Model 3. Extending Link-based Similarity Measures 4. Experimental Results

edmund
Download Presentation

Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

  2. Outline • 1. Introduction • 2. Extended NeighborhoodStructure Model • 3. Extending Link-based Similarity Measures • 4. Experimental Results • 5. Conclusion and Future Work

  3. 1. Introduction • Background • Similarity measures are required in many web applications to evaluate the similarity between web pages. • The “similar pages” service of Web search engines; • Web document classification; • Web community identification. • Problem • Many link-based similarity measures are not so accurate since they consider only part of the structural information.

  4. 1. Introduction • Motivation • How to improve the accuracy of link-based similarity measures by making full use of the structural information? • Contributions • Propose the Extended Neighborhood Structure (ENS) model. • bi-direction • multi-hop • Construct extended link-based similarity measures base on the ENS model. • more flexible and accurate

  5. Search Engine similarity measure 1. Introduction • Searching the Web • Keyword searching • Similarity searching KEYWORDS: news http://news.bbc.co.uk/ http://www.cnn.com/ … URL: www.cnn.com Search Engine http://news.bbc.co.uk/ http://usnews.com/ …

  6. Focus of this talk 1. Introduction • Similarity measures • Evaluate how similarity or related two objects are. • Approaches to measuring similarity • Text-based • Cosine TFIDF [Joachims97] • Link-based • Bibliographic coupling [Kessler63] • Co-citation [Small73] • SimRank [Jeh et al 02], PageSim [Lin et al 06] • Hybrid

  7. 2. Extend Neighborhood Structure Model • Extended Neighborhood Structure (ENS) model • Question: what hide in hyperlinks? • similarity relationship between pages, • similarity relationship decrease along hyperlinks.

  8. 2. Extend Neighborhood Structure Model • Extended Neighborhood Structure (ENS) model • The ENS model • bi-direction • in-link • out-link • multi-hop • direct (1-hop) • indirect (2-hop, 3-hop, etc) • Purpose • Improve accuracy of link-based similarity measures by helping them make full use of the structural information of the Web.

  9. 3. Extending Link-based Similarity Measures • Intuition of similarity • Similar web pages have similar neighbors.(to compare two web pages, see their neighbors.) • Notations • G=(V, E), |V| = n: the web graph. • I(a) / O(a): in-link / out-link neighbors of web page a. • path(a1, as): a sequence of vertices a1, a2, …, as such that (ai, ai+1) ∈ E (i=1,…,s-1) and ai are distinct. • PATH(a,b): the set of all possible paths from page a to b. • Sim(a,b): similarity score of web page a and b.

  10. 3. Extending Link-based Similarity Measures • Two classical methods • Co-citation: the more commonin-link neighbors, the more similar. • Sim(a,b) = |I(a)∩I(b)| • Bibliographic coupling: the more common out-link neighbors, the more similar. • Sim(a,b) = |O(a)∩O(b)| • Extended Co-citation and Bibliographic Coupling (ECBC) • ECBC: the more common neighbors, the more similar. • Sim(a,b) = α|I(a)∩I(b)| + (1-α)|O(a)∩O(b)|, where 0≤α≤1 is a constant.

  11. 3. Extending Link-based Similarity Measures • SimRank “two pages are similar if they are linked to by similar pages” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠v.

  12. 3. Extending Link-based Similarity Measures • Extended SimRank “two pages are similar if they have similar neighbors” • (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition • C is a constant between 0 and 1. • The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠v.

  13. 3. Extending Link-based Similarity Measures • PageSim “weighted multi-hop” version of Co-citation algorithm. • (a) multi-hop in-link information, and • (b) importance of web pages. • Can be represented by any global scoring system • PageRank scores, or • Authoritative scores of HITS.

  14. 3. Extending Link-based Similarity Measures • PageSim (phase 1: feature propagation) • Initially, each web page contains an unique feature information, which is represented by its PageRank score. • The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by

  15. 3. Extending Link-based Similarity Measures • PageSim (phase 2: similarity computation) • A web page v stores the featureinformation of its and others in its Feature Vector FV(v). • The similarity between web page u and v is computed by Jaccard measure[Jain et al 88] • Intuition:the more common feature information two web pages contain, the more similar they are.

  16. 3. Extending Link-based Similarity Measures • Extended PageSim (EPS) • Propagating featureinformation of web pages along in-link hyperlinks at decay rate 1- d. • Computing the in-link PS scores. • EPS(u,v) = in-link PS(u,v) + out-link PS(u,v).

  17. 3. Extending Link-based Similarity Measures • Properties • CC: Co-citation, BC: Bibliographic Coupling, • ECBC: Extended Co-citation and Bibliographic Coupling, • SR: SimRank, ESR: Extended SimRank, PS: PageSim, EPS: Extended PageSim. • Summary • The extended versions consider more structural information. • ESR and EPS are bi-directional & multi-hop. • In ESR, two web pages are not similar unless there are intermediate pages between them, even if they link to other (see Figure 1(2)).

  18. 3. Extending Link-based Similarity Measures • Case study: Sim(a,b) • Summary • The extended algorithms are more flexible. • EPS is able to handle more cases.

  19. 4. Experimental Results • Datasets • CSE Web (CW) dataset: • A set of web pages crawled from http://cse.cuhk.edu.hk. • 22,000 pages, 180,000 hyperlinks. • The average number of in-links and out-links are 8.6 and 7.7. • Google Scholar (GS) dataset: • A set of articles crawled from Google Scholar searching engine. • Start crawling by submitting “web mining” keywords to GS, and then following the “Cited by” hyperlinks. • 20,000 articles, 154,000 citations.

  20. 4. Experimental Results • Evaluation Methods • Cosine TFIDF similarity (for CW dataset) • A commonly used text-based similarity measure. • “Related Articles” (for GS dataset) • A list of related articles to a query article provided by GS. • Can be used as ground truth. • Parameter Settings

  21. 4. Experimental Results • CC, BC vs ECBC • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.

  22. 4. Experimental Results • SimRank vs Extended SimRank • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.

  23. 4. Experimental Results • PageSim vs Extended PageSim • CW data (left):x-axis: top N results; y-axis: average cosine TFIDF of all pages. • GS data (right):x-axis: top N results; y-axis: average precision of all pages.

  24. 4. Experimental Results • Overall Accuracy of Algorithms

  25. 5. Conclusion and Future Work • Conclusion • Extended Neighborhood Structure model • Bi-direction and multi-hop • Extend existing link-based similarity measures • Co-citation, Bibliographic coupling, SimRank, PageSim • Experiments • Future Work • Extend link-based algorithms based on ENS model • Prove the convergence of the Extended SimRank • Integrating link-based with text-based

  26. Publications • Z. Lin, M. R. Lyu, andI. King. PageSim: A novel link-based measure of web page similarity.In WWW '06: Proceedings of the 15th international conference on World Wide Web.Pages 1019-1020, Edinburgh, Scotland, 2006. • Z. Lin, I. King, and M. R. Lyu. PageSim: A novellink-based similarity measure for the World WideWeb.In WI ’06: Proceedings of the 5th InternationalConference on Web Intelligence. ACM Press.To appear, 2006. • Z. Lin, M. R. Lyu, andI. King.Extending Link-basedAlgorithms for Similar Web Pageswith Neighborhood Structure. Submitted to WWW’07.

More Related