1 / 23

A Brief Look at Web Crawlers

A Brief Look at Web Crawlers. Bin Tan 03/15/07. Web Crawlers. “… is a program or automated script which browses the World Wide Web in a methodical, automated manner” Uses: Create an archive / index from the visited web pages to support offline browsing / search / mining.

Download Presentation

A Brief Look at Web Crawlers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Look at Web Crawlers Bin Tan 03/15/07

  2. Web Crawlers • “… is a program or automated script which browses the World Wide Web in a methodical, automated manner” • Uses: • Create an archive / index from the visited web pages to support offline browsing / search / mining. • Automating maintenance tasks on a website • Harvesting specific information from web pages

  3. High-level architecture Seeds Frontier

  4. How easy is it to write a program to crawl all uiuc.edu web pages?

  5. All sorts of real problems: • Managing multiple download threads is nontrivial • If you make requests to a server in short intervals, you’ll overloading it • Pages may be missing; servers may be down or sluggish • You may be trapped in dynamic-generated pages • Web page may use ill-formed HTML

  6. This is only a small-scale crawl… • (Shkapenyuk and Suel, 2002): "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."

  7. Data characterics in large-scale crawls • Large volume, fast changes, dynamic page generation: a wide selection of possibly crawlable URLs • Edwards et al: "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."

  8. Selection policy: which page to download • Need to prioritize according to some page importance metrics • Depth-first • Breadth-first • Partial PageRank calculation • OPIC (On-line Page Importance Computation) • Length of per-site queues • In focused crawling, prediction of similarity between page text and query re-visit policy

  9. Revisit policy: when to check for changes to the pages • Pages are frequently updated, created or deleted • Cost functions to minimize: • Freshness (0 for stale pages, 1 for fresh pages ) • Age (amount of time for which a page has been stale)

  10. Revisit Policy (cont.) • Uniform policy: revisiting all pages in the collection with the same frequency • Proportional policy: revisiting more often the pages that change more frequently • The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. • Numerical methods are used for calculation based on distribution of page changes

  11. Politeness policy: how to avoid overloading websites • Badly-behaved crawlers can be a nuisance • Robots exclusion protocol (robots.txt) Google • Interval/delay between connections (10sec – 5 min) • fixed • proportional to page downloading time

  12. Parallelization policy: how to coordinate distributed web crawlers • Nutch: "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages"

  13. Crawling the deep web • Many web spiders run by popular search engines ignore URLs with a query string • Google’s Sitemap protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling • Also: mod-oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community

  14. Example Web Crawler Software • wget • heritrix • nutch • others

  15. Wget • Command-line tool, non-extensible • Config: recursive downloading • Config: spanning hosts • Breadth-first for HTTP, depth-first for FTP • Config: include/exclude filters • Updates outdated pages based on timestamps • Supports robots.txt protocol • Config: connection delay • Single-threaded

  16. Heritrix • Heritrix is Internet Archive’s web crawler which was specially designed for web archiving • Licence: LGPL • Written in Java

  17. Features • Highly modular; easily extensible • Scales to large data volume • Implemented selection policies: • Breadth-first with options to throttle activity against particular hosts and to bias towards finishing hosts in progress or cycling among all hosts with pending URLs • Domain sensitive: allows specifying an upper-bound on the number of pages downloaded per site • Adaptive revisiting: repeatedly visit all encountered URLs (wait time between visits configurable) • Implements fixed / proportional connection delay • Detailed documentation • Web-based UI for crawler administration

  18. Nutch • Nutch is an effort to build an open source search engine based on Lucene for the search and index component. • License: Apache 2.0 • Written in Java

  19. Features • Modular; extensible • Breadth-first • Includes parsing and indexing components • Implements a MapReduce facility and a distributed file system (Haddop)

  20. Recrawl command lines • # The generate/fetch/update cycle • for ((i=1; i <= depth ; i++)) • do • bin/nutch generate $webdb_dir $segments_dir -adddays $adddays • segment=`ls -d $segments_dir/* | tail -1` • bin/nutch fetch $segment • bin/nutch updatedb $webdb_dir $segment • done

  21. Appedix: Parsers • HTML: • lynx –dump • Beautiful Soup (Python) • tidylib (C) • PDF • xpdf • Others • Nutch plugins • Office API (Windows)

More Related