1 / 15

Mercator: A Scalable, Extensible Web Crawler

Mercator: A Scalable, Extensible Web Crawler. Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim. Content. Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions. Extensibility.

Download Presentation

Mercator: A Scalable, Extensible Web Crawler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim

  2. Content • Extensibility • Crawler traps and other hazards • Results of an extended crawl • Conclusions

  3. Extensibility • Extensibility • Extend with new functionality • New protocol and processing modules • Different versions of most of its major components • Ingredients • Interface  an abstract class • Mechanism  a configuration file • Infrastructure

  4. Protocol and processing modules • Abstract Protocol class • fetch method: download the document • newURL method: parse a given string • Abstract Analyzer class • process method: process it appropriately • Different Analyzer subclasses • GifStats • TagCounter • WebLinter: runs the Weblint program

  5. Alternative URL frontier • Drawback on intranet • Multiple hosts might be assigned to the same thread • Solution • URL frontier component that dynamically assigns host • Maximized the number of busy worker threads • Is well-suited to host-limited crawls

  6. As a random walker • Random walker • Starts at a random page taken from a set of seeds • The next page is selected by choosing a random link • Differences • A page may be revisited multiple times • Only one link is followed each time • To support random walking • A new URL frontier • Records only the URLs discovered most recently fetched file • Document fingerprint set • Never rejects documents as already having been seen

  7. URL aliases • Four causes • Host name aliases  canonicalize • coke.com and cocacola.com  203.134.241.178 • Omitted port numbers  default value: 80 • Alternative paths on the same host  cannot avoid • digital.com/index.html and digital.com/home.html • Replication across different hosts  cannot avoid • Mirror sites • Cannot avoid  content-seen test

  8. Session IDs embedded in URLs • Session identifiers • To tract the browsing behavior of their visitors • Create a potentially infinite set of URLs • Represent a special case of alternative paths • Document fingerprinting technique

  9. Crawler traps • Crawler trap • Cause a crawler to crawl indefinitely • Unintentional: symbolic link • Intentional: trap using CGI programs • Antispam traps, traps to catch search engine crawlers • Solution • No automatic technique • But traps are easily noticed • Manually exclude the site • Using the customizable URL filter

  10. Performance • Digital Ultimate Workstation • Two 533 MHz Alpha processors • 2 GB of RAM and 118 GB of local disk • Run in May 1999 • 77.4 million HTTP requests in 8 days • 112 docs/sec and 1,682 KB/sec • CPU cycle • 37%: JIT-compiled Java bytecode • 19%: Java runtime • 44%: Unix kernel

  11. Selected Web statistics (1) • Relationship between URLs and HTTP requests

  12. Selected Web statistics (2) • Breakdown of HTTP status codes relatively low

  13. Selected Web statistics (3) • Size of successfully downloaded documents 80%

  14. Selected Web statistics (4) • Distribution of MIME types

  15. Conclusions • Use of Java • Made implementation easier and more elegant • Threads, garbage collection, objects, exception, etc. • Scalability • Extensibility Fin.

More Related