1 / 18

Finding replicated web collections

Finding replicated web collections. Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California. Outline. Replication on the web Importance of de-duplication in today’s Internet Similarity

xylia
Download Presentation

Finding replicated web collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding replicated web collections Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California

  2. Outline • Replication on the web • Importance of de-duplication in today’s Internet • Similarity • Identifying similar collections • Growing similar collections • How is this useful? • Contributions of the paper • Pros/Cons of the paper • Related work Finding replicated web collections

  3. Replication on the web • Some reasons for duplication • Reliability • Performance: caching, load balancing • Archival • Anarchy on the web makes duplicating easy but finding duplicates hard. • Same page on different URL: protocol, host, domain, etc [2]. • Many aspects of mirrored sites prevent us from identifying replication by finding exact matches • Freshness, coverage, formats, partial crawls Finding replicated web collections

  4. Importance of deduplication in today’s Internet • The Internet grows at an extremely fast pace [1]. • Crawling becomes more and more difficult if done in a brute force attempt. • Intelligent algorithms can achieve similar results in less time using less memory. • We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections

  5. Similarity • Similarity of pages • Similarity of link structure • Similarity of collections Finding replicated web collections

  6. Similarity of pages • Various metrics for determining page similarity based on… • Information retrieval • Data mining • Intuition: Textual Overlap • Counting chunks of text that overlap. • Requires threshold based on empirical data Finding replicated web collections

  7. Similarity of pages • The paper uses the Textual Overlap metric • Convert page into text • Divide text into obvious chunk (e.g. sentences) • Hash each chunk to determine “fingerprint” of chunks • Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections

  8. Similarity of link structure • At least one matching incoming link, unless no incoming links exists • For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p • For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. • Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections

  9. Similarity of collections • Collections are similar if they have similar pages and similar link structure • To control complexity, the method in the paper only considers: • Equi-sized collections • One-to-one mapping of similar pages • Terminology: • Collection: a group of linked pages (e.g. website) • Cluster: a group of collections • Similar cluster: a group of similar collections • Too expensive to compute the optimal set of similar clusters • Start with trivial clusters and “grow” them Finding replicated web collections

  10. Growing Clusters • Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. • Two trivial clusters are merged if they become a similar cluster with larger collections. • Continue until no merger can produce a similar cluster. Finding replicated web collections

  11. Growing Clusters Finding replicated web collections

  12. How is this useful? • Improving crawling • This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. • Improving querying • Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections

  13. Contributions • Clearly defined the problem and provided a basic solution. • Helps people understand the problem. • Proposed a new algorithm to identify similar collections. • Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. • Proves that it is a worthwhile problem to solve. • Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections

  14. Pros • Thoroughly defined the problem. • Presented a concise and effective algorithm to address the problem. • Cleary stated any trade-offs made so that algorithm can be improved in future work. • Simplifications made are mainly to control complexity and allow the solution to be more comprehensible • Left the de-simplification of their algorithm to future work Finding replicated web collections

  15. Cons • Similar collections must be equi-size. • Similar collections must have one-to-one mappings of all pages • High probability for break points. Collections can become highly chunked. • Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections

  16. Related Work • “Detecting Near-Duplicates for Web Crawling” (2007) [5] • Takes a lower level, in depth approach to determining page similarity. • Hashing algorithms • Good supplement • “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] • Takes a different approach that identifies URLs that point to the same/similar content. • e.g. www.myhomepage.com and www.myhomepage.com/index.html • Does not look into page content • Focus on the “low-hanging” fruits Finding replicated web collections

  17. Questions? Finding replicated web collections

  18. References • [1] C. Mattmann. Characterizing the Web.CSCI 572 course lecture at USC, May 20, 2010 • [2] C. Mattmann. Deduplication.CSCI 572 course lecture at USC, June 1, 2010 • [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 • [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 • [5] Manku, G. S., Jain, A., and Das Sarma, A. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07. ACM, New York, NY, 141-150. • [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections

More Related