1 / 36

Web Archaeology

Web Archaeology. Raymie Stata Compaq Systems Research Center raymie.stata@compaq.com www.research.compaq.com/SRC. What is Web Archaeology?. The study of the content of the Web exploring the Web sifting through data making valuable discoveries Difficult! Because the Web is:

arnold
Download Presentation

Web Archaeology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Archaeology Raymie Stata Compaq Systems Research Center raymie.stata@compaq.com www.research.compaq.com/SRC

  2. What is Web Archaeology? • The study of the content of the Web • exploring the Web • sifting through data • making valuable discoveries • Difficult! Because the Web is: • Boundless • Dynamic • Radically decentralized

  3. Some recent results • Empirical studies • Quality of almost-breadth-first crawling • Structure of the Web • Random walks (size of search engines) • Improving the Web experience • Better and more precise search results • Surfing assistants and tools • Data mining • Technologies for “page scraping”

  4. Tools for “Web scale” research • Use data • Search quality • Crawl quality • Duplicate elimination • Web characterization Apps Feature Databases • Access subset of data fast • Full-text index, shingleprints • Connectivity, Term vectors Datastorage • Store and access web pages • Myriad • Download web pages • Mercator, Web Language Data collection

  5. Web-scale crawling Mercator Atrax

  6. The Mercator web crawler • A high-performance web crawler • Downloads and processes web pages • Written entirely in Java • Runs on any Java-capable platform • Extensible • Wide range of possible configurations • Users can plug in new modules at run-time

  7. Mercator design points • Extensible • Well-chosen extensibility points • Framework for configuration • Multiple threads, synchronous I/O • vs. single thread, asynchronous I/O • Checkpointing • Allows crawls to be restarted • Modules export prepare, commit

  8. System Architecture

  9. Crawl quality

  10. Atrax, a distributed version of Mercator • Distributes load across cluster of crawlers • Partitions data structures across crawlers • No central bottleneck • Network bandwidth is limiting factor

  11. Performance of Atrax vs Mercator

  12. Myriad -- new project • A very large, archival storage system • Scalable to a petabyte • With function shipping • Supports data mining

  13. Myriad Requirements • Large (up to 10K disks) • Commodity hardware (low cost) • Easy to manage • Easy to use (queries vs. code) • Fault tolerance & containment • No backups, tape or otherwise

  14. Two phases of Myriad project • Define service-level interface • Implemented to run on collections of files • Testing and tuning • Build scalable implementation • Cluster storage and processing • Designing now, prototype in summer • Won’t describe today

  15. Applications Storage service Blocks New service level interface file systems, databases, Myriad • Better suited to this problem and scale • Supports “function shipping”

  16. Myriad interface • Single table database • Stored vs. virtual columns • Virtual columns computed by injected code • Bulk input of new records • Management of code defining virtual columns • Output via select/project queries • User-defined code run implicitly • Support for repeatable random sampling

  17. Example Myriad query [samplingprob=0.1, samplingseed=321223421332] select name, length where insertionDate < Date(00/01/01) && mimeType == “text/html”;

  18. Model for large-scale data mining • Step 1: make an extract • Do data-parallel select and project • Don’t do any sorts, joins, groupings • Step 2: put extract into high-power analysis tool • Joins, sorts, joins, groupings

  19. Feature Databases • URL DB • URL  pgid • Host DB • pgid  hostid • Link DB • out: pgid  pgid* • in: pgid  pgid* • Term vector DB • pgid  term vector

  20. URL database: prefix compression http://kiva.net/~markh/surnames.html http://kiwi-us.com/~amigo/links/index.htm http://kiwi.emse.fr/ http://kiwi.etri.re.kr/~khshim/internet/bookmark.html http://kiwi.etri.re.kr/~ksw/bookmark http://kiwi.futuris.net/linen http://kiwi.futuris.net/linen/special/backiss.html Prefix compress 0 http://kiva.net/~markh/surnames.html 9 wi-us.com/~amigo/links/index.htm 11 .emse.fr/ 13 tri.re.kr/~khshim/internet/bookmark.html 25 sw/bookmark 12 futuris.net/linen 29 /special/backiss.html

  21. URL compression • Prefix compression • 44  14.5 bytes/URL • Fast to decompress (~10 s) • ZIP compression • 14.5  9.2 bytes/URL • Slow to decompress (~80 s)

  22. Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”

  23. Compressing term vectors • Sparse representation • Only store columns with non-zero counts • Lossy representation • Only store “important” columns • “Importance” determined by: • Count of term on page (high ==> important) • Number of pages with term (low ==> important)

  24. TVDB Builder

  25. Applications • Categorizing pages • Topic distillation • Filtering pages • Identifying languages • Identifying running text • Relevance feedback (“more like this”) • Abstracting pages

  26. Categorization • “Bulls take over”

  27. How to categorize a page • Off line: • Collect training set of pages per category (~30K) • Combine training pages into category vectors • ~10K terms per category vector • On line: • Use term vector DB to look up vector of page • Find category vector that best matches this page vector • Use a Bayesian classifier to match vectors • Give no category if match not “definitive”

  28. Topic drift in topic distillation • Some Web IR algorithms have this structure: • Compute a “seed set” on a query • Find neighborhood by following links • Rank this neighborhood • Topic drift (a problem): • The neighborhood graph includes off-topic nodes • “Download Microsoft Explorer”  MS Home page

  29. Avoid topic drift with term vectors • Combine term vectors of seed set into topic vector • Detecting topic drift in neighboring nodes: • Combine topic vector with node’s term vector • Inner product works fine • Expunge or weight • Integration of feature databases helps!!

  30. Link database • Goals • Fit links into RAM (fast lookup) • Build in 24 hours • Applications • Page ranking • Web structure • Mirror site detection • Related page detection

  31. Link storage baseline design Links Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 ... ...

  32. Link storage: deltas Link Deltas Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 ... ...

  33. Link storage: compression Link Deltas Variable-length encoding 1.7 bytes/link Starts ... 104 105 106 107 108 ... Id 106 115 101 72 208 111 2 9 -4 -31 136 4 = 4 bits = 8 bits = 8 bits = 8 bits = 12 bits = 8 bits ... ...

  34. LDBng

  35. The future of Web Archaeology • Driving applications • Web search -- “finding things on the web” • Page classification (topic, community, type) • Purpose-specific search • Web “asset management” (what’s on my site?) • Automated information extraction (price robots) • Multi-billion page web • Dynamics

More Related