webbase a repository of web pages n.
Skip this Video
Loading SlideShow in 5 Seconds..
WebBase : A repository of web pages PowerPoint Presentation
Download Presentation
WebBase : A repository of web pages

Loading in 2 Seconds...

play fullscreen
1 / 22

WebBase : A repository of web pages - PowerPoint PPT Presentation

  • Uploaded on

WebBase : A repository of web pages. Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By: Maria Fragouli Athens 2002. Web repository: stores, manages large collections of web pages,

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'WebBase : A repository of web pages' - germaine-rush

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
webbase a repository of web pages

WebBase : A repository of web pages

Jun Hirai Sriram Raghavan

Hector Garcia-Molina Andreas Paepcke

Computer Science Department

Stanford University

By: Maria Fragouli

Athens 2002


Web repository:

stores, manages large collections of web pages,

is used by applications that access, mine or index up-to-date web content

Basic implementation goals:

Scalability: use of network disks to hold the repository so that it can scale to web growth,

Streams: support of streaming (ordered) access mode (cmp to random access mode) for requests of pages in bulk (cmp to individual pages requests)

Large updates: new updated version of pages must efficiently replace older ones

Expunging Pages: obsolete pages need to be detected and removed


WebBase: prototype repository – Stanford University

We study:

Repository architecture for required functionality – performance

Distribution policies of web pages across network disks

Interaction of crawler-repository

Organization strategies of web pages on system nodes

Experimental results of simulations on prototype

Design Assumptions for the Web repository

  • Incremental crawler: only new or changed web pages are visited at each run
  • Retain only the latest version of each page.
  • Crawl and store only HTML pages
  • Snapshot index construction
webbase architecture ii functional modules and their interaction

WebBase Architecture IIFunctional modules and their interaction

Crawler module: retrieves new or updated copies of web pages

Storage module: assigns pages to storage devices,

handles updates of pages,

schedules, services requests, etc.

Metadata-Indexing module: indexes pages and metadata extracted from them

Query engine:

Multicast module:

handle web content according

to access mode on pages


Access Modes

  • Random access: pages retrieved using their URL
  • Query-based access: pages retrieved as responses to queries on pages metadata or textual content (handled by query engine)
  • Streaming access: pages retrieved and delivered as a data stream to requesting applications (handled by multicast module)
      • Streams available not only locally but to remote applications as well
      • Restartable streams, can be paused and resumed at will

Page Identifier

  • The page URL

is first normalized:

  • The resulting text string is hashed using a signature computation to yield a 64-bit page identifier(signature collisions unlikely to occur).
  • Removal of the protocol prefix
  • Removal of the port number specification
  • Conversion of the server name to lower case
  • Removal of all trailing slashes ("/")
storage manager sm


Node mgmt server

Stream requests


Random access requests

Storage Manager (SM)

For scalability:

  • SM is distributed across a collection of storage nodes
  • Storage nodes are coordinated by a central node management server
  • The latter keeps a table of parameters concerning current state of each storage node (node capacity, extent of node fragmentation, state, # of requests)

Stores only latest versions of web pages – provides facilities for their access/update

Consistency of indexes must be dealt with

Expunging of obsolete pages is assisted by the allowed lifetime and lifetime count values associated with each page

design issues for sm i page distribution across nodes

Design issues for SM – I. Page Distribution across nodes

Uniform distribution: all nodes are treated identically

Hash distribution: pages are stored on the nodes whose range of identifiers include the page identifier

Uniform vs Hash distribution

design issues for sm ii organization of pages on disk

Design issues for SM – II. Organization of pages on disk

Hash-based organization

Each disk is considered as a collection of hash buckets

Pages are stored into buckets according to the pageID range they hold

Bucket overflows are handled by allocation of extra overflow buckets

We assume that

-buckets with successive ranges of pageIDs are physically continuous on disk,

-pages are stored in the buckets in increasing order of their IDs

  • How the fundamental operations are performed
  • Random page access: identify containing bucket->read it into memory->main memory search to locate page
  • Streaming: sequentially read buckets into memory->transmit pages to client
  • Page addition: in-order or not in-memory addition of pages in buckets->disk write of modified buckets
log based organization new pages received are appended at the end of the log

Append pages




Basic objects on disk:

  • Log: includes pages allocated at disk
  • Catalog: contains entries with useful info (pageID, ptr to physical location of page in log, pagesize, pagestatus, timestamp of page addition) for each page in the log
  • B-tree index in case of random access mode

Log-based organization

New pages received

are appended at the

end of the log

  • How the fundamental operations are performed
  • Random page access: requires two disk accesses
  • Streaming: read sequentially the log for valid pages
  • Page addition: pages are added to the log, catalog and B-tree modifications are periodically flushed to disk
design issues for sm iii update schemes

Design issues for SM – III. Update Schemes

Classification of pages in repository:

Class A: includes old versions of pages that will be replaced

Class B: unchanged pages

Class C: unseen pages or new versions of pages that will replace class A pages

General update process:

Receive class C pages from the crawler and add them to the repository.

Rebuild all the indexes using the class B and C pages.

Delete the class A pages.

Suggested update strategies:

i. Batch update

ii. Incremental update


i. Batch update scheme

Two sets of storage nodes: update nodes (hold class C pages), read nodes (hold class A, B pages)

Steps followed:

System isolation Page transfer System restart

examples of page transfer in the batch update scheme





of class C

pages streams

Distribution of

pages by their pID


4 Update nodes

12 Read nodes

Examples of page transfer in the batch update scheme

1. Log-structured page organization and Hash distribution policy on both sets of nodes

  • Deletion of class A pages requires a separate step

2. Hash-based page organization and Hash distribution policy on both sets

Deletion of class A pages occurs while class C pages are added

This addition is performed using merge sort

Advantages: no conflicts occur, physical location of pages is not changed (compaction operation=part of the update)


ii. Incremental update scheme

All nodes are equally responsible for supporting both page update and access at the same time continuous service provision

Drawbacks of continuous service

Performance penalty: due to conflicts between various operations

Requirement for maintaining local index in a dynamic way

Restartable streams are more complicated

-in batch update systems, the pair (Node-id, Page- id), provides sufficient information for their state

-in incremental update systems where physical locations of pages may change, additional stream state information is required



  • WebBase prototype SM’s configuration features
        • Batch update strategy
        • Hash page distribution for both update and read nodes
        • Log-structured page organization in both sets of nodes

Implemented on top of a standard Linux FS

  • SM is fed 50-100 pages/sec from an incremental crawler
  • Use of a cluster of PCs connected by a 100 Mbps Ethernet LAN
  • A client module to request access on the repository and a crawler emulator to retrieve/transmit pages accordingly are also implemented
  • Performance Metrics
    • Page addition rate (pages/sec/node)
    • Streaming rate (pages/sec/node)
    • Random access rate (pages/sec/node)
    • Batch update time (in case of batch update systems)



R(hash, log)]

choosing a hash bucket size

Optimal hash bucket size

Space-performance tradeoff

Choosing a hash bucket size

if on the average 16 pages are kept per bucket, a hash bucket size of 64 KB must be chosen and thus the average random page access time would be 20,7 ms (optimal point, plot A)

As buckets grow, space utilization and streaming performance improve, but random access suffers

comparing different systems

Performance Metric

Log-structured (pages/sec)

Hash-based (pages/sec)



Streaming rate and ordering

6300 unsorted

3900 sorted

6300 sorted

Random page access rate




Page addition rate(random order, no buffering)




Page addition rate(random order, 10MB buffer)




Page addition rate(sorted order, 10MB buffer)




Comparing different systems

Hashed-log hybrid node organization: the disk contains a number of large logs (8-10MB), each one associated with a range of hash values

comparing different configurations

System configuration

Page addition rate[pages/sec/node]

Batch update time(update ratio=0.25)

Batch[U(hash, log), R(hash, hash)]


11700 secs

Batch[U(hash, hash), R(hash, hash)]


1260 secs

Batch[U(hash, hashed-log), R(hash, hash)]


1260 secs

Comparing different configurations


25% of the pages on read nodes are replaced by newer versions during the update process (update ratio=0.25)

experiments on overall system performance of prototype

Performance Metric

Observed value

Streaming rate

2800 pages/sec (per read node)

Page addition rate

3200 pages/sec (per update node)

Batch update time

2451 seconds (for update ratio = 0.25)

Random page access rate

33 pages/sec (per read node)

Experiments onoverall system performance of prototype

Batch update time of prototype


Performance of prototype

summary relative performance of different system configurations

System configuration


Random access

Page addition

Update time

Incr [hash, log]





Incr [uniform, log]





Incr [hash, hash]





Batch [U(hash, log), R(hash, log)]





Batch [U(hash, log), R(hash, hash)]





Batch [U(hash, hash), R(hash, hash)]





Batch [U(hashed-log, hash), R(hash, hash)]





Summary - Relative performance of different system configurations

  • Ordering of symbols adopted from the most to the least favorable: ++,+,+-,-,--

We provided overview of:

  • WebBase prototype architecture
  • Performance metrics based on simulation experiments
  • WebBase being considered as a research test-bed for various system configurations

Future enhancements on WebBase include:

  • Implementation of advanced system configurations
  • Development of advanced streaming facilities (e.g. deliver streams for subsets of web pages on repository)
  • Integration of a history maintaining service for old-replaced web pages