Web Archaeology

Web Archaeology Raymie Stata Compaq Systems Research Center raymie.stata@compaq.com www.research.compaq.com/SRC

What is Web Archaeology? • The study of the content of the Web • exploring the Web • sifting through data • making valuable discoveries • Difficult! Because the Web is: • Boundless • Dynamic • Radically decentralized

Some recent results • Empirical studies • Quality of almost-breadth-first crawling • Structure of the Web • Random walks (size of search engines) • Improving the Web experience • Better and more precise search results • Surfing assistants and tools • Data mining • Technologies for “page scraping”

Tools for “Web scale” research • Use data • Search quality • Crawl quality • Duplicate elimination • Web characterization Apps Feature Databases • Access subset of data fast • Full-text index, shingleprints • Connectivity, Term vectors Datastorage • Store and access web pages • Myriad • Download web pages • Mercator, Web Language Data collection

Web-scale crawling Mercator Atrax

The Mercator web crawler • A high-performance web crawler • Downloads and processes web pages • Written entirely in Java • Runs on any Java-capable platform • Extensible • Wide range of possible configurations • Users can plug in new modules at run-time

Mercator design points • Extensible • Well-chosen extensibility points • Framework for configuration • Multiple threads, synchronous I/O • vs. single thread, asynchronous I/O • Checkpointing • Allows crawls to be restarted • Modules export prepare, commit

System Architecture

Crawl quality

Atrax, a distributed version of Mercator • Distributes load across cluster of crawlers • Partitions data structures across crawlers • No central bottleneck • Network bandwidth is limiting factor

Performance of Atrax vs Mercator

Myriad -- new project • A very large, archival storage system • Scalable to a petabyte • With function shipping • Supports data mining

Myriad Requirements • Large (up to 10K disks) • Commodity hardware (low cost) • Easy to manage • Easy to use (queries vs. code) • Fault tolerance & containment • No backups, tape or otherwise

Two phases of Myriad project • Define service-level interface • Implemented to run on collections of files • Testing and tuning • Build scalable implementation • Cluster storage and processing • Designing now, prototype in summer • Won’t describe today

Applications Storage service Blocks New service level interface file systems, databases, Myriad • Better suited to this problem and scale • Supports “function shipping”

Myriad interface • Single table database • Stored vs. virtual columns • Virtual columns computed by injected code • Bulk input of new records • Management of code defining virtual columns • Output via select/project queries • User-defined code run implicitly • Support for repeatable random sampling

Example Myriad query [samplingprob=0.1, samplingseed=321223421332] select name, length where insertionDate < Date(00/01/01) && mimeType == “text/html”;

Model for large-scale data mining • Step 1: make an extract • Do data-parallel select and project • Don’t do any sorts, joins, groupings • Step 2: put extract into high-power analysis tool • Joins, sorts, joins, groupings

Feature Databases • URL DB • URL  pgid • Host DB • pgid  hostid • Link DB • out: pgid  pgid* • in: pgid  pgid* • Term vector DB • pgid  term vector

URL database: prefix compression http://kiva.net/~markh/surnames.html http://kiwi-us.com/~amigo/links/index.htm http://kiwi.emse.fr/ http://kiwi.etri.re.kr/~khshim/internet/bookmark.html http://kiwi.etri.re.kr/~ksw/bookmark http://kiwi.futuris.net/linen http://kiwi.futuris.net/linen/special/backiss.html Prefix compress 0 http://kiva.net/~markh/surnames.html 9 wi-us.com/~amigo/links/index.htm 11 .emse.fr/ 13 tri.re.kr/~khshim/internet/bookmark.html 25 sw/bookmark 12 futuris.net/linen 29 /special/backiss.html

URL compression • Prefix compression • 44  14.5 bytes/URL • Fast to decompress (~10 s) • ZIP compression • 14.5  9.2 bytes/URL • Slow to decompress (~80 s)

Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”

Compressing term vectors • Sparse representation • Only store columns with non-zero counts • Lossy representation • Only store “important” columns • “Importance” determined by: • Count of term on page (high ==> important) • Number of pages with term (low ==> important)

TVDB Builder

Applications • Categorizing pages • Topic distillation • Filtering pages • Identifying languages • Identifying running text • Relevance feedback (“more like this”) • Abstracting pages

Categorization • “Bulls take over”

How to categorize a page • Off line: • Collect training set of pages per category (~30K) • Combine training pages into category vectors • ~10K terms per category vector • On line: • Use term vector DB to look up vector of page • Find category vector that best matches this page vector • Use a Bayesian classifier to match vectors • Give no category if match not “definitive”

Topic drift in topic distillation • Some Web IR algorithms have this structure: • Compute a “seed set” on a query • Find neighborhood by following links • Rank this neighborhood • Topic drift (a problem): • The neighborhood graph includes off-topic nodes • “Download Microsoft Explorer”  MS Home page

Avoid topic drift with term vectors • Combine term vectors of seed set into topic vector • Detecting topic drift in neighboring nodes: • Combine topic vector with node’s term vector • Inner product works fine • Expunge or weight • Integration of feature databases helps!!

Link database • Goals • Fit links into RAM (fast lookup) • Build in 24 hours • Applications • Page ranking • Web structure • Mirror site detection • Related page detection

Link storage baseline design Links Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 ... ...

Link storage: deltas Link Deltas Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 ... ...

Link storage: compression Link Deltas Variable-length encoding 1.7 bytes/link Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 = 4 bits = 8 bits = 8 bits = 8 bits = 12 bits = 8 bits ... ...

LDBng

The future of Web Archaeology • Driving applications • Web search -- “finding things on the web” • Page classification (topic, community, type) • Purpose-specific search • Web “asset management” (what’s on my site?) • Automated information extraction (price robots) • Multi-billion page web • Dynamics

Web Archaeology

Web Archaeology

Presentation Transcript

Archaeology

Archaeology

Archaeology

Archaeology

ARCHAEOLOGY

Archaeology

ARCHAEOLOGY

Archaeology

Archaeology

Archaeology

Archaeology

Archaeology

ARCHAEOLOGY Archaeology Introduction

Archaeology

Archaeology

Archaeology

Archaeology

Archaeology

Archaeology