Mercator a scalable extensible web crawler
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Mercator: A scalable, extensible Web crawler PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on
  • Presentation posted in: General

Mercator: A scalable, extensible Web crawler. Allan Heydon and Marc Najork, World Wide Web, 1999. 2006. 5. 23 Young Geun Han. Contents. Introduction Related Work Architecture of a scalable Web crawler Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions.

Download Presentation

Mercator: A scalable, extensible Web crawler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mercator a scalable extensible web crawler

Mercator: A scalable, extensible Web crawler

Allan Heydon and Marc Najork, World Wide Web, 1999

2006. 5. 23

Young Geun Han


Contents

Contents

  • Introduction

  • Related Work

  • Architecture of a scalable Web crawler

  • Extensibility

  • Crawler traps and other hazards

  • Results of an extended crawl

  • Conclusions


1 introduction

1. Introduction

  • The motivations of this work

    • Due to the competitive nature of the search engine business, Web crawler design is not well-documented in the literature

    • To collect statistics about the Web

  • Mercator, a scalable, extensible Web crawler

    • By scalable

      • Mercator is designed to scale up to the entire Web

      • They archive scalability by implementing their data structures so that use a bounded amount of memory, regardless of the size of the crawl

      • The vast majority of their data structures are stored on disk, and small parts of them are stored in memory for efficiency

    • By extensible

      • Mercator is designed in a modular way, with the expectation that new functionality will be added by third parties


2 related work 1

  • - Run on a different machine

  • Use asynchronous I/O to fetch data from up to 300 Web servers

  • Transmit pages to a Store Server

Crawler

  • Compress the pages

  • Store the pages to disk

- Read URLs

- Forward them to multiple crawler processes

URL Server

Store Server

Anchors

Repository

  • Read the pages from disk

  • Extract links from HTML pages

  • Save links to a different disk file

Indexer

URL Resolver

  • Read the link file

  • Resolve the URLs

  • Save the abolute URLs

Sorter

Doc

index

Barrels

2. Related work (1)

  • Web crawlers are almost as old as the Web itself

    • The first crawler, Matthew Gray’s Wanderer, 1993

      (roughly coincided with the first release of NCSA Mosaic)

  • Google search engine [Brin and Page 1998;Google]

    • A distributed system that uses multiple machines for crawling

    • The crawler consists of five functional components


2 related work 2

2. Related work (2)

  • Internet Archive [Burner 1997;InternetArchive]

    • The internet Archive also uses multiple machines to crawl the Web

    • Each crawler process is assigned up to 64 sites to crawl

    • Each crawler reads a list of seed URLs and uses asynchronous I/O to fetch pages from per-site queues in parallel

    • When a page is downloaded, the crawler extracts the links and adds to the appropriate site queue

    • Using a batch process, it merges “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process

  • SPHINK [Miller and Bharat 1998]

    • SPHINK system provides some of the customizability features

      (a mechanism for limiting which pages are crawled, document processing code)

    • SPHINK is targeted towards site-specific crawling, and therefore is not designed to be scalable


3 architecture of a scalable web crawler

3. Architecture of a scalable Web crawler

  • The basic algorithm of any Web crawler takes a list of seed URLs

    • Remove a URL from the URL list

    • Determine the IP address of its host name

    • Download the corresponding document

    • Extract any links contained in document

    • For each of the extracted links, ensure that it is an absolute URL

    • Add a URL to the list of URLs to download, provided it has not been encountered before

  • Functional components

    • a component (URL frontier) for storing the list of URLs to download

    • a component for resolving host names into IP addresses

    • a component for downloading documents using the HTTP protocol

    • a component for extracting links from HTML documents

    • a component for determining whether a URL has been encountered before


3 1 mercator s components 1

Link

Extractor

HTTP

RIS

FTP

Tag

Counter

Gopher

GIF

Stats

3.1 Mercator’s components (1)

Mercator

I

N

T

E

R

N

E

T

DNS

Resolver

Content

Seen?

Doc

FPs

4

1

URL

Filter

URL

Seen?

URL Frontier

2

3

5

6

7

8

URL

Set

Queue

Files

Log

Log

Protocol

Modules

Processing

Modules

Figure 1. Mercator’s main components


3 1 mercator s components 2

3.1 Mercator’s components (2)

  • The first step of this loop isto remove an absolute URLfrom the sharedURL frontier for downloading

  • The protocol module's fetch method downloads the document from internetinto a per-thread RewindInputStream

  • The worker thread invokes the content-seen test to determine whether this document has been seen before

  • Based on the downloaded document's MIME type, the worker invokes the process method of each processing module associated with that MIME type

  • Each extracted link is converted into an absolute URL, and tested against a user-supplied URL filter to determine if it should be download

  • The worker performs the URL-seen test, which checks if the URL has been seen before

  • If the URL is new, it is added to the frontier

1

2

3

4

5

6

7

8


3 2 the url frontier

Naver

I

N

T

E

R

N

E

T

HTTP

Naver

I

N

T

E

R

N

E

T

HTTP

http://naver.com/c.html

http://naver.com/b.html

Head

http://naver.com/a.html

Daum

HTTP

Daum

http://daum.net/B.html

HTTP

http://naver.com/c.html

http://daum.net/B.html

Head

http://www.ssu.ac.kr

http://daum.net/A.html

SSU

HTTP

http://daum.net/A.html

http://naver.com/b.html

SSU

HTTP

Head

http://naver.com/a.html

Head

http://www.ssu.ac.kr

Protocol Module

URL frontier

Web Server

Protocol Module

URL frontier

Web Server

3.2 The URL frontier

  • The URL frontier is the data structure that contains all the URLs that remain to be downloaded

  • To implement the politeness constraint, the default version of Mercator’s URL frontier is implemented by a collection of distinct FIFO subqueues

    • There is one FIFO subqueue per worker thread

    • When a new URL is added, the FIFO subqueue in which it is placed is determined by the URL’s canonical host name


3 3 the http protocol module

Host name

Robots exclusion rules

(ex. User-agent, Disallow)

LRU value (ex. Date)

LRU replacement strategy

www.naver.com

*, /tmp/

2006.05.23/09:00(1)

2^18

entries

www. daum.net

googlebot, /cafe/

2006.05.23/09:20(2)

www. ssu.ac.kr

2006.05.23/10:00(3)

www. google.com

*, /calendar/

2006.05.23/10:10(4)

3.3 The HTTP protocol module

  • The purpose of a protocol module is to fetch the document corresponding to a given URL using the appropriate network protocol

  • Network protocols supported by Mercator include HTTP, FTP, and Gopher

  • Mercator implements the Robots Exclusion Protocol

  • To avoid downloading the RobotsExclusion file(robots.txt) on every request, Mercator's HTTP protocol module maintains a fixed-sized cachemappinghost names to their robots exclusion rules

  • Mercator uses its "lean and mean" HTTP protocol module

    • its requests time out after 1 minute, and it has minimal synchronization and allocation overhead


3 4 rewind input stream

Protocol

Modules

Link

Extractor

HTTP

RIS

2

Initialize

Tag

Counter

Processing

Modules

GIF

Stats

1

URL from the frontier

text

GIF

Link

text

GIF

Link

RIS

Work thread

3

Pass the RIS

Rewinding

3.4 Rewind input stream

  • Mercator’s design allows the same document to be processed by multiple processing modules

  • To avoid reading a document over the network multiple times, Mercator caches the document locally using an abstraction called a RewindInputStream

  • A RIS caches small documents (64 KB or less) entirely in memory, while larger documents are temporarily written to a backing file (limit 1 MB)

  • A RIS also provides a method for rewinding its position to the beginning of the stream, and various lexing methods that make it easy to build MIME-type-specific parsers


3 5 content seen test 1

SERVER B

SERVER A

Case C

it.ssu.ac.kr/index.html

Case D

www.ssu.ac.kr/index.html

Case A

www.ssu.ac.kr/index.html

Case B

www3.ssu.ac.kr/index.html

3.5 Content-seen test (1)

  • The Web crawler downloads the same document contents multiple times

    • Many documents are available under multiple, different URLs

    • There are also many cases in which document are mirrored on multiple servers

  • To prevent processing a document more than once, a Web crawler may wish to perform a content-seen test to decide if the document has already been processed.

  • To save space and time, Mercator uses a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document

  • Mercator compute the checksum using Broder’s implementation [Broder 1993] of Rabin’s fingerprinting algorithm [Rabin 1981]

  • Fingerprints offer provably strong probabilistic guarantees that two different string will not have the same fingerprint


3 5 content seen test 2

Document fingerprint set

Hash table

Index of the disk file

Memory

Disk

Java’s random access

Use a readers-writer lock

RIS

Content-seen test

Check the FP in memory

1

Not seen

Content-seen test

Check the FP in the disk file

2

Not seen

3

Add the new FP to the in-memory table

Hash table fills up

4

Add the new fingerprint

Fill up

5

Merge the contents with the FP on disk

6

Update the in-memory index of the disk file

3.5 Content-seen test (2)

  • Mercator maintains two independent set of fingerprints

    • A small hash table kept in memory

    • A large sorted list kept in a single disk file


3 6 url filters

URL Filter

URL

Seen?

URL Frontier

Link

Extractor

RIS

URL filter class

URL

input

output

a boolean value

crawl method

Domain

www.ssu.ac.kr

www.naver.com

True

input

output

www.ssu.ac.kr

input

output

False

input

True

output

www.daum.net

Example

3.6 URL filters

  • The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded

  • The URL filter class has a single crawl method that takes a URL and returns a boolean value indicating whether or not to crawl that URL


3 7 domain name resolution

DNS resolver

I

N

T

E

R

N

E

T

DNS request

I

N

T

E

R

N

E

T

DNS resolver

DNS request

DNS resolver

DNS request

Java interface to DNS lookups and the DNS interface on most Unix are synchronized

DNS resolver

DNS request

Multi-thread

DNS resoler

3.7 Domain name resolution

  • Before contacting a Web server, a Web crawler must use the DNS to map the host name into an IP address

  • Mercator tried to alleviate the DNS bottleneck by caching DNS results, but that was only partially effective

  • To avoid bottleneck of DNS, Mercator used its own multi-threaded DNSresolver that can resolve host names much more rapidly than either the Java or Unix resolver

Reduce that elapsed time to 25%

Perform DNS lookings accounted for 87% of each thread’s elapsed time


3 8 url seen test 1

URL Seen test

Link

Extractor

RIS

URL

Filter

Popular URLs

URL Frontier

URL Seen test

Memory

The table of recently-added URLs

This approach would result in a much larger frontier

URL Set

3.8 URL-seen test (1)

  • To avoid downloading and processing a document multiple times, a URL-seen test must be performed on each extracted link

  • To perform the URL-seen test, all of the URLs seen by Mercator are stored in canonical form in a large table called the URL set

  • To save space, Mercator doesn’t store the textual representation of each URL in the URL set, but rather a fixed-sized check-sum

  • To reduce the number of operations on the backing disk file, Mercator keeps an in-memory cache of popular URLs


3 8 url seen test 2

hit rate

16%

in-memory cache

8%

recently-added URLs

the buffer

9.5%

missed requests

66.2%

3.8 URL-seen test (2)

  • Unlike the fingerprints, the stream of URLs has a non-trivial amount of locality (URL locality)

URL Seen test

Popular URLs

Memory

The table of recently-added URLs

URL Set

Using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy

  • Each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set

    (Each membership test on the URL set results in an average of 0.16 seek and 0.17 read kernel calls)


3 8 url seen test 3

3.8 URL-seen test (3)

  • Host name locality

    • Host name locality arises because many links found in Web pages are to different documents on the same server

    • To preserve the locality, they compute the checksum of a URL by merging two independent fingerprints

      • The fingerprint of the URL’s host name

      • The fingerprint of the complete URL

    • These two fingerprints are merged so that the high-order bits of the checksum derive from the host name fingerprint

    • As a result, checksums for URLs with the same host component are numerically close together

    • The host name locality in the stream of URLs translates into access locality on the URL set’s backing disk file, allowing the kernel’s file system buffers to service read requests from memory more often

    • On extended crawls, this technique results in a significant reduction in disk load in a significant performance improvement


3 9 synchronous vs asynchronous i o

3.9 Synchronous vs. asynchronous I/O

  • Google and Internet Archive crawlers

    • Use single-threaded crawling processes and asynchronous I/O to perform multiple download in parallel

    • They are designed from the ground up to scale to multiple machines

  • Mercator

    • Uses a multi-threaded process in which each thread performs synchronous I/O

      (It leads to a much simpler program structure)

  • It would not be too difficult to adapt Mercator to run on multiple machines

Web server

Web server

Thread

Thread

Web server

Web server

Thread

machine

Thread

Web server

Web server

Thread

Web server

Web server

Thread

Thread

machine

Web server

Web server

machine

Google and Archive cralwers

Mercator


3 10 checkpointing

3.10 Checkpointing

  • To complete a crawl of the entire Web, Mercator writes regular snapshots of its state to disk

  • An interrupted or aborted crawl can easily be restarted from the lastest checkpoint

  • Mercator’s core classes and all user-supplied modules are required to implement the checkpointing interface

  • Checkpointing are coordinated using a global readers-writer lock

  • Each worker thread acquires a read share of the lock while processing a downloaded document

  • Once a day, Mercator’s main thread has acquired the lock, it arranges for the checkpoint methods


  • Login