1 / 19

Web Page Clustering using Heuristic Search in the Web Graph

Web Page Clustering using Heuristic Search in the Web Graph. IJCAI 07. Motivation - 1/2. The reasons for clustering of search results are two-fold cluster hypothesis : similar documents tend to be relevant to the same requests

delano
Download Presentation

Web Page Clustering using Heuristic Search in the Web Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07

  2. Motivation - 1/2 • The reasons for clustering of search results are two-fold • cluster hypothesis : similar documents tend to be relevant to the same requests • ranked list is usually too large and contains many irrelevant documents • Successful academic and industrial (vivisimo.com) • Organize search results into groups (clusters) • Topical similarity

  3. Motivation - 2/2 • Clustering problem : • there is not enough contextual information on a page • For example: savethejaguar.com • Web sites are contextually different but actually refer to the same meaning of the query • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), • and a chansonnier (www.decary.com).

  4. Introduction - 1/3 • Thematic locality of the Web graph: • Directed graph in which nodes are Web pages and edges are hyperlink • If page A hyperlink page B, page A and page B are semantically close. • For example: • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • and a chansonnier (www.decary.com). • cogilex.com

  5. Introduction - 2/3 • Heuristic Search : • To collect as much useful information as possible while crawling the Web • Heuristic estimate the amount of information available in a particular Web sub-graph. • This paper uses heuristics to estimate the utility of expanding the current node in terms of leading to the target node. • The heuristics are not to reduce the search time, but to improve the search accuracy. • Heuristics are used as filters to prune branches of search trees that are likely to establish undesired connections between unrelated Web pages.

  6. Introduction - 3/3 • Multi-agent system: • Given n Web pages in the ranked list • ncollaborative Web agents • initial dataset : assigned one page • Each agentperforms heuristic searchto traverse the Web graphin order to meet as many other agents as possible. • Two applications: • Web appearance disambiguation • Search result clustering

  7. Multi-agent heuristic search • Two multi-agent heuristic search • Sequential Heuristic Search (SHS) • Frontier: • a list of nodes (URL) to be expanded (initially, the URL of its source page) • Filter : ( later) • Initialize :

  8. Multi-agent heuristic search • The SHS algorithm • simple and intuitive • One crucial drawback • there is no possibility to control the topology of the constructed clusters • In a worst case • If , , and • Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak Page A --> Page B Page B --> Page C Page C --> Page D

  9. Multi-agent heuristic search • Incremental Heuristic Search (IHS)

  10. Heuristics - 1/2 • Two heuristics • Topology-driven • High-degree node elimination • Remove high out-degree pages and high in-degree pages • Content-driven • Person name heuristic

  11. Heuristics - 2/2 • To detect high out-degree URL • Using Google’s link:operator • Threshold in/out hyperlinks 1000 • Person names consist of two, three or four words • This heuristic excludes people names that are too common (again, we use Google’s link: operator) • In many cases, an entity tagged as a person name has millions of Google’s hits if it is a tagger error. • Examples of such entities are Price Range and Mac Os.

  12. Datasets - disambiguation dataset • Web appearance disambiguation dataset • www.cs.umass.edu/~ronb • It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors). • The dataset is labeled according to the person’s occupation. • The process crawled the Web starting with these 1085 pages (source pages). • 7009 pages at the first hop ((一次飛行的)航程), • 69,454 pages at the second hop • 592,299 pages at the third hop

  13. One-Cluster

  14. Datasets - Jaguar dataset - 1/2 • Problem of clustering Web search results • Retrieved and labeled 100 first Google hits obtained on the query jaguar.

  15. Datasets - Jaguar dataset - 2/2 • Jaguar dataset • K = 3 (car, Mac Os, and cats) • 883 pages on the first hop • 8548 pages on the second hop • 56,287 pages on the third hop

  16. Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

  17. Conclusion • This paper is the first study of heuristic search in the Web graph. • Heuristic search : • Viable in the vast domain of the WWW • Clustering of Web search results • Web appearance disambiguation

  18. Introduction - 4/4 • Topological clustering • Only k largest cluster : • a set C of k • Initial : Each document from the original ranked list into one cluster C’ • a set C’ of k’ > k topical cluster • For each cluster ciC to find it closest cluster cj’ from C’ • j=argmaxj’|cic’j’|

More Related