web document clustering l.
Skip this Video
Loading SlideShow in 5 Seconds..
Web Document Clustering PowerPoint Presentation
Download Presentation
Web Document Clustering

Loading in 2 Seconds...

play fullscreen
1 / 12

Web Document Clustering - PowerPoint PPT Presentation

  • Uploaded on

Web Document Clustering. By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ?. Two results for the same query ‘amazon’ Google : currently the most powerful search engine Metacrawler : a search engine which cluster retrieved web documents. 2. Approaches.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Web Document Clustering' - maili

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web document clustering

Web Document Clustering

By Sang-Cheol Seok

1 introduction web document clustering why
1.Introduction: Web document clustering? Why?

Two results for the same query ‘amazon’

  • Google : currently the most powerful search engine
  • Metacrawler : a search engine which cluster retrieved web documents.
2 approaches
2. Approaches
  • Using contents of documents
  • Using user’s usage logs
  • Using current search engines
  • Using hyperlinks
  • Other classical methods
1 using contents of documents
(1) Using Contents of Documents
  • Creating clusters based on snippets returned by web search engines.
  • clusters based on snippets are almost as good as clusters created using the full text of Web documents.
  • Suffix Tree Clustering (STC) : incremental, O(n) time algorithm
  • three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters
2 using user s usage logs
(2) Using user’s usage logs
  • Advantage: relevancy information is objectively reflected by the usage logs
  • An experimental result on www.nasa.gov/
3 using current web search engines metacrawler
(3) Using current web search engines – Metacrawler
  • Step1: When MetaCrawler receives a query, it posts the query to multiple search engines in parallel.
  • Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable )
  • Metacrawler at U. of Washington.
4 using hyperlinks
(4) Using hyperlinks
  • Consider web documents as vertices and the hyperlinks as direct edges in a direct graph.
  • Similarity-based clustering method was successfully used in image segmentation
  • Kleinberg’s HITS algorithm
    • based purely on hyperlink information.
    • authority and hub documents for a user query.
    • only cover the most popular topics and leave out the less popular ones.
4 using hyperlinks continued
(4) Using Hyperlinks: continued
  • cluster web documents based on both the textual and hyperlink
  • the hyperlink structure is used as the dominant factor in the similarity metric
5 other classical clustering methods
(5) Other classical clustering methods
  • K-means method
  • HAC (hierarchical agglomerative clustering)
  • DBSCAN (Density-based SCAN)
  • And Single-link and group-average methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used
3 key requirements and future challenges
3. Key requirements and future challenges

(1) key requirements for Web document clustering methods

  • Relevance
  • Browsable Summaries
  • Overlap
  • Speed
  • Incrementality for some methods.
3 key requirements and future challenges continued
3. Key requirements and future challenges: continued

(2) Concerns on current methods

  • Each method has pros and cons.
  • Using hyperlinks : the best accuracy and still some room to improve and it does not overlap.
  • STC : best to browse and for incrementality.
  • Metacrawler : best to prune.
3 key requirements and future challenges continued12
3. Key requirements and future challenges: continued

Future challenges

  • We can not take advantage of all pros of each method.
  • Some pros work against other pros.
  • So, we have to trade off.
  • Moreover, we need to find improvements.