WebBase : A repository of web pages

WebBase : A repository of web pages Jun Hirai Sriram Raghavan Hector Garcia-Molina Andreas Paepcke Computer Science Department Stanford University By: Maria Fragouli Athens 2002

Web repository: stores, manages large collections of web pages, is used by applications that access, mine or index up-to-date web content Basic implementation goals: Scalability: use of network disks to hold the repository so that it can scale to web growth, Streams: support of streaming (ordered) access mode (cmp to random access mode) for requests of pages in bulk (cmp to individual pages requests) Large updates: new updated version of pages must efficiently replace older ones Expunging Pages: obsolete pages need to be detected and removed

WebBase: prototype repository – Stanford University We study: Repository architecture for required functionality – performance Distribution policies of web pages across network disks Interaction of crawler-repository Organization strategies of web pages on system nodes Experimental results of simulations on prototype Design Assumptions for the Web repository • Incremental crawler: only new or changed web pages are visited at each run • Retain only the latest version of each page. • Crawl and store only HTML pages • Snapshot index construction

WebBase Architecture IFunctional modules and their interaction

WebBase Architecture IIFunctional modules and their interaction Crawler module: retrieves new or updated copies of web pages Storage module: assigns pages to storage devices, handles updates of pages, schedules, services requests, etc. Metadata-Indexing module: indexes pages and metadata extracted from them Query engine: Multicast module: handle web content according to access mode on pages

Access Modes • Random access: pages retrieved using their URL • Query-based access: pages retrieved as responses to queries on pages metadata or textual content (handled by query engine) • Streaming access: pages retrieved and delivered as a data stream to requesting applications (handled by multicast module) • Streams available not only locally but to remote applications as well • Restartable streams, can be paused and resumed at will Page Identifier • The page URL is first normalized: • The resulting text string is hashed using a signature computation to yield a 64-bit page identifier(signature collisions unlikely to occur). • Removal of the protocol prefix • Removal of the port number specification • Conversion of the server name to lower case • Removal of all trailing slashes ("/")

Crawler Node mgmt server Stream requests LAN Random access requests Storage Manager (SM) For scalability: • SM is distributed across a collection of storage nodes • Storage nodes are coordinated by a central node management server • The latter keeps a table of parameters concerning current state of each storage node (node capacity, extent of node fragmentation, state, # of requests) Stores only latest versions of web pages – provides facilities for their access/update Consistency of indexes must be dealt with Expunging of obsolete pages is assisted by the allowed lifetime and lifetime count values associated with each page

Design issues for SM – I. Page Distribution across nodes Uniform distribution: all nodes are treated identically Hash distribution: pages are stored on the nodes whose range of identifiers include the page identifier Uniform vs Hash distribution

Design issues for SM – II. Organization of pages on disk Hash-based organization Each disk is considered as a collection of hash buckets Pages are stored into buckets according to the pageID range they hold Bucket overflows are handled by allocation of extra overflow buckets We assume that -buckets with successive ranges of pageIDs are physically continuous on disk, -pages are stored in the buckets in increasing order of their IDs • How the fundamental operations are performed • Random page access: identify containing bucket->read it into memory->main memory search to locate page • Streaming: sequentially read buckets into memory->transmit pages to client • Page addition: in-order or not in-memory addition of pages in buckets->disk write of modified buckets

Append pages Log Catalog Disk Basic objects on disk: • Log: includes pages allocated at disk • Catalog: contains entries with useful info (pageID, ptr to physical location of page in log, pagesize, pagestatus, timestamp of page addition) for each page in the log • B-tree index in case of random access mode Log-based organization New pages received are appended at the end of the log • How the fundamental operations are performed • Random page access: requires two disk accesses • Streaming: read sequentially the log for valid pages • Page addition: pages are added to the log, catalog and B-tree modifications are periodically flushed to disk

Design issues for SM – III. Update Schemes Classification of pages in repository: Class A: includes old versions of pages that will be replaced Class B: unchanged pages Class C: unseen pages or new versions of pages that will replace class A pages General update process: Receive class C pages from the crawler and add them to the repository. Rebuild all the indexes using the class B and C pages. Delete the class A pages. Suggested update strategies: i. Batch update ii. Incremental update

i. Batch update scheme Two sets of storage nodes: update nodes (hold class C pages), read nodes (hold class A, B pages) Steps followed: System isolation Page transfer System restart

. . . Transmission of class C pages streams Distribution of pages by their pID Crawler 4 Update nodes 12 Read nodes Examples of page transfer in the batch update scheme 1. Log-structured page organization and Hash distribution policy on both sets of nodes • Deletion of class A pages requires a separate step

2. Hash-based page organization and Hash distribution policy on both sets Deletion of class A pages occurs while class C pages are added This addition is performed using merge sort Advantages: no conflicts occur, physical location of pages is not changed (compaction operation=part of the update)

ii. Incremental update scheme All nodes are equally responsible for supporting both page update and access at the same time continuous service provision Drawbacks of continuous service Performance penalty: due to conflicts between various operations Requirement for maintaining local index in a dynamic way Restartable streams are more complicated -in batch update systems, the pair (Node-id, Page- id), provides sufficient information for their state -in incremental update systems where physical locations of pages may change, additional stream state information is required

Experiments • WebBase prototype SM’s configuration features • Batch update strategy • Hash page distribution for both update and read nodes • Log-structured page organization in both sets of nodes Implemented on top of a standard Linux FS • SM is fed 50-100 pages/sec from an incremental crawler • Use of a cluster of PCs connected by a 100 Mbps Ethernet LAN • A client module to request access on the repository and a crawler emulator to retrieve/transmit pages accordingly are also implemented • Performance Metrics • Page addition rate (pages/sec/node) • Streaming rate (pages/sec/node) • Random access rate (pages/sec/node) • Batch update time (in case of batch update systems) Batch [U(hash,log), R(hash, log)]

Optimal hash bucket size Space-performance tradeoff Choosing a hash bucket size if on the average 16 pages are kept per bucket, a hash bucket size of 64 KB must be chosen and thus the average random page access time would be 20,7 ms (optimal point, plot A) As buckets grow, space utilization and streaming performance improve, but random access suffers

Performance Metric Log-structured (pages/sec) Hash-based (pages/sec) Hashed-log (pages/sec) Streaming rate and ordering 6300 unsorted 3900 sorted 6300 sorted Random page access rate 35 51 35 Page addition rate(random order, no buffering) 6100 23 53 Page addition rate(random order, 10MB buffer) 6100 35 660 Page addition rate(sorted order, 10MB buffer) 6100 1300 1300 Comparing different systems Hashed-log hybrid node organization: the disk contains a number of large logs (8-10MB), each one associated with a range of hash values

System configuration Page addition rate[pages/sec/node] Batch update time(update ratio=0.25) Batch[U(hash, log), R(hash, hash)] 6100 11700 secs Batch[U(hash, hash), R(hash, hash)] 35 1260 secs Batch[U(hash, hashed-log), R(hash, hash)] 660 1260 secs Comparing different configurations Assumption: 25% of the pages on read nodes are replaced by newer versions during the update process (update ratio=0.25)

Performance Metric Observed value Streaming rate 2800 pages/sec (per read node) Page addition rate 3200 pages/sec (per update node) Batch update time 2451 seconds (for update ratio = 0.25) Random page access rate 33 pages/sec (per read node) Experiments onoverall system performance of prototype Batch update time of prototype and Performance of prototype

System configuration Stream Random access Page addition Update time Incr [hash, log] + - -- inapplicable Incr [uniform, log] + -- + inapplicable Incr [hash, hash] + + - inapplicable Batch [U(hash, log), R(hash, log)] ++ - ++ +- Batch [U(hash, log), R(hash, hash)] + + ++ -- Batch [U(hash, hash), R(hash, hash)] + + - + Batch [U(hashed-log, hash), R(hash, hash)] + + +- + Summary - Relative performance of different system configurations • Ordering of symbols adopted from the most to the least favorable: ++,+,+-,-,--

We provided overview of: • WebBase prototype architecture • Performance metrics based on simulation experiments • WebBase being considered as a research test-bed for various system configurations Future enhancements on WebBase include: • Implementation of advanced system configurations • Development of advanced streaming facilities (e.g. deliver streams for subsets of web pages on repository) • Integration of a history maintaining service for old-replaced web pages Conclusions

WebBase : A repository of web pages

WebBase : A repository of web pages

Presentation Transcript

Visual Summarization of Web Pages

Helpful Web Pages

Designing Web Pages

Web Pages

A Semantic Web Content Model and Repository

Dynamic Web Pages

Web Pages

Web pages

Web Pages

Web Pages

LBA Web pages

PPE Web Pages

WebBase: Building a Web Warehouse

CS 349: WebBase 1

DCF10 web pages

WEB PAGES:

WEB PAGES:

Web pages

PHP Web Pages

Clustering of Web pages