A Brief Look at Web Crawlers

A Brief Look at Web Crawlers Bin Tan 03/15/07

Web Crawlers • “… is a program or automated script which browses the World Wide Web in a methodical, automated manner” • Uses: • Create an archive / index from the visited web pages to support offline browsing / search / mining. • Automating maintenance tasks on a website • Harvesting specific information from web pages

High-level architecture Seeds Frontier

How easy is it to write a program to crawl all uiuc.edu web pages?

All sorts of real problems: • Managing multiple download threads is nontrivial • If you make requests to a server in short intervals, you’ll overloading it • Pages may be missing; servers may be down or sluggish • You may be trapped in dynamic-generated pages • Web page may use ill-formed HTML

This is only a small-scale crawl… • (Shkapenyuk and Suel, 2002): "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."

Data characterics in large-scale crawls • Large volume, fast changes, dynamic page generation: a wide selection of possibly crawlable URLs • Edwards et al: "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."

Selection policy: which page to download • Need to prioritize according to some page importance metrics • Depth-first • Breadth-first • Partial PageRank calculation • OPIC (On-line Page Importance Computation) • Length of per-site queues • In focused crawling, prediction of similarity between page text and query re-visit policy

Revisit policy: when to check for changes to the pages • Pages are frequently updated, created or deleted • Cost functions to minimize: • Freshness (0 for stale pages, 1 for fresh pages ) • Age (amount of time for which a page has been stale)

Revisit Policy (cont.) • Uniform policy: revisiting all pages in the collection with the same frequency • Proportional policy: revisiting more often the pages that change more frequently • The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. • Numerical methods are used for calculation based on distribution of page changes

Politeness policy: how to avoid overloading websites • Badly-behaved crawlers can be a nuisance • Robots exclusion protocol (robots.txt) Google • Interval/delay between connections (10sec – 5 min) • fixed • proportional to page downloading time

Parallelization policy: how to coordinate distributed web crawlers • Nutch: "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages"

Crawling the deep web • Many web spiders run by popular search engines ignore URLs with a query string • Google’s Sitemap protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling • Also: mod-oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community

Example Web Crawler Software • wget • heritrix • nutch • others

Wget • Command-line tool, non-extensible • Config: recursive downloading • Config: spanning hosts • Breadth-first for HTTP, depth-first for FTP • Config: include/exclude filters • Updates outdated pages based on timestamps • Supports robots.txt protocol • Config: connection delay • Single-threaded

Heritrix • Heritrix is Internet Archive’s web crawler which was specially designed for web archiving • Licence: LGPL • Written in Java

Features • Highly modular; easily extensible • Scales to large data volume • Implemented selection policies: • Breadth-first with options to throttle activity against particular hosts and to bias towards finishing hosts in progress or cycling among all hosts with pending URLs • Domain sensitive: allows specifying an upper-bound on the number of pages downloaded per site • Adaptive revisiting: repeatedly visit all encountered URLs (wait time between visits configurable) • Implements fixed / proportional connection delay • Detailed documentation • Web-based UI for crawler administration

Nutch • Nutch is an effort to build an open source search engine based on Lucene for the search and index component. • License: Apache 2.0 • Written in Java

Features • Modular; extensible • Breadth-first • Includes parsing and indexing components • Implements a MapReduce facility and a distributed file system (Haddop)

Recrawl command lines • # The generate/fetch/update cycle • for ((i=1; i <= depth ; i++)) • do • bin/nutch generate $webdb_dir $segments_dir -adddays $adddays • segment=`ls -d $segments_dir/* | tail -1` • bin/nutch fetch $segment • bin/nutch updatedb $webdb_dir $segment • done

Appedix: Parsers • HTML: • lynx –dump • Beautiful Soup (Python) • tidylib (C) • PDF • xpdf • Others • Nutch plugins • Office API (Windows)

A Brief Look at Web Crawlers

A Brief Look at Web Crawlers

Presentation Transcript

A brief look at semantic networks

A Brief Look at the Southern Colonies...

Overview of Web-Crawlers

A Very Brief Look at Finances

Ohio: A Brief look at Ohio’s History

A Brief look at Time

A Brief look at Time

A Brief Look at Medical Equipment

A Brief Look at Confirmatory Factor Analysis

A Brief look at Medieval Spain

Web Crawlers

A Brief Look at the Electromagnetic Spectrum

A Brief Look at Random Numbers

A brief look at the payment gateway

(Web) Crawlers Domain

A brief look at the Infertility Snag

A Brief Look at Google’s Local Pack

A Brief look at Drupal 8.3

A Brief Look at Coins of Peshwa

A Brief Look at the Electromagnetic Spectrum

The Web Servers + Crawlers