1 / 16

Archive-It Architecture Introduction

Archive-It Architecture Introduction. April 18, 2006 Dan Avery Internet Archive. 1. Archive-It Components. Crawling User Interface Storage Playback Text Indexing Integration. 2. Component Integration. 3. Crawling. Heritrix ( http://crawler.archive.org / ) Java application

feoras
Download Presentation

Archive-It Architecture Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Archive-It Architecture Introduction • April 18, 2006 • Dan Avery • Internet Archive 1

  2. Archive-It Components • Crawling • User Interface • Storage • Playback • Text Indexing • Integration 2

  3. Component Integration 3

  4. Crawling • Heritrix ( http://crawler.archive.org/ ) • Java application • Open source (LGPL) • Crawls for completeness/depth • Highly configurable 4

  5. Crawling - Distributed Crawling • Heritrix Cluster Controller • Java component - open source - developed by IA • http://crawler.archive.org/hcc • Provides proxy access to pool of Heritrix instances through JMX interface • Provides crawler control and status • Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown 5

  6. Archive-It Web Application • User Interface and Crawl Scheduling • Gets seed URLs and crawl parameters from users • Schedules new periodic crawls • Talks to crawler pool through HCC • Provides access, search, and crawl history UI 6

  7. Storage • archive.org ARC repository • custom Perl system • simple storage on primary/backup pairs • monthly MD5 digest verification • robust, non proprietary file format • Alexandria (Egypt)/Amsterdam 7

  8. Access • Internet Archive Wayback Machine • Replaying archived web pages since 2001 • Current IA version written in Perl and C, with components distributed across various machines • Not open source, but open source beta (in Java) available now 8

  9. Full-Text Indexing • Nutch (http://nutch.org) • NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files • Standard text search plus link analysis • can search by date instead of relevance, useful for individual archives 9

  10. Text Indexing Challenges • Some parts are distributable, some are not • Incremental indexing - goal of new crawls in index within 72 hours • Working on Archive-It usable map/reduce version - July • In the meantime, a lot of workarounds 10

  11. Integration • Group of Perl and bash scripts - planning more complex than the execution • Most components available individually • Decentralized control, centralized monitoring • Each component operates almost entirely independently 11

  12. The Big Picture 12

  13. Future Challenges • Crawler trap detection • Scalability • Current setup can accommodate 300 partners at current crawling rates • During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks • More machines can be easily added to storage and crawling clusters 13

  14. Scalability • Current Nutch is between versions • Old version has some non-distributable pieces • New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing 14

  15. Looking ahead • After basic UI/archiving/indexing... • Time-based search UI • Analyzing archives for research and ongoing collection improvement • Content classification • Rate of change • New site suggestions 15

  16. http://www.archive-it.org 16

More Related