150 likes | 179 Views
This paper discusses a scalable and extensible web crawler developed by Allan Heydon and Marc Najork. It covers topics such as extensibility, crawler traps, the results of extended crawls, and conclusions drawn from the study. The crawler is designed for flexibility, allowing for the addition of new functionalities, protocols, and processing modules. Different versions of major components enhance adaptability, and a robust infrastructure supports protocol and processing modules. The study also discusses an alternative URL frontier to optimize performance in host-limited crawls. Additionally, the paper touches on the challenges of URL aliases, crawler traps, and performance statistics of the web crawler. Overall, the implementation leveraged Java for its ease and scalability, making it a valuable tool for search performance enhancement.
E N D
Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim
Content • Extensibility • Crawler traps and other hazards • Results of an extended crawl • Conclusions
Extensibility • Extensibility • Extend with new functionality • New protocol and processing modules • Different versions of most of its major components • Ingredients • Interface an abstract class • Mechanism a configuration file • Infrastructure
Protocol and processing modules • Abstract Protocol class • fetch method: download the document • newURL method: parse a given string • Abstract Analyzer class • process method: process it appropriately • Different Analyzer subclasses • GifStats • TagCounter • WebLinter: runs the Weblint program
Alternative URL frontier • Drawback on intranet • Multiple hosts might be assigned to the same thread • Solution • URL frontier component that dynamically assigns host • Maximized the number of busy worker threads • Is well-suited to host-limited crawls
As a random walker • Random walker • Starts at a random page taken from a set of seeds • The next page is selected by choosing a random link • Differences • A page may be revisited multiple times • Only one link is followed each time • To support random walking • A new URL frontier • Records only the URLs discovered most recently fetched file • Document fingerprint set • Never rejects documents as already having been seen
URL aliases • Four causes • Host name aliases canonicalize • coke.com and cocacola.com 203.134.241.178 • Omitted port numbers default value: 80 • Alternative paths on the same host cannot avoid • digital.com/index.html and digital.com/home.html • Replication across different hosts cannot avoid • Mirror sites • Cannot avoid content-seen test
Session IDs embedded in URLs • Session identifiers • To tract the browsing behavior of their visitors • Create a potentially infinite set of URLs • Represent a special case of alternative paths • Document fingerprinting technique
Crawler traps • Crawler trap • Cause a crawler to crawl indefinitely • Unintentional: symbolic link • Intentional: trap using CGI programs • Antispam traps, traps to catch search engine crawlers • Solution • No automatic technique • But traps are easily noticed • Manually exclude the site • Using the customizable URL filter
Performance • Digital Ultimate Workstation • Two 533 MHz Alpha processors • 2 GB of RAM and 118 GB of local disk • Run in May 1999 • 77.4 million HTTP requests in 8 days • 112 docs/sec and 1,682 KB/sec • CPU cycle • 37%: JIT-compiled Java bytecode • 19%: Java runtime • 44%: Unix kernel
Selected Web statistics (1) • Relationship between URLs and HTTP requests
Selected Web statistics (2) • Breakdown of HTTP status codes relatively low
Selected Web statistics (3) • Size of successfully downloaded documents 80%
Selected Web statistics (4) • Distribution of MIME types
Conclusions • Use of Java • Made implementation easier and more elegant • Threads, garbage collection, objects, exception, etc. • Scalability • Extensibility Fin.