1 / 38

Open Source Tools for Archive-It: Unifying, Managing, and Designing Archive Collections

Learn about the open-source tools that power Archive-It, including Heritrix for collecting, Wayback for displaying, and NutchWAX for searching. These tools use open-source software and standards to unify, manage, and design archive collections. Discover the benefits of open-source software and how it allows for better quality, reliability, flexibility, and lower costs.

erunge
Download Presentation

Open Source Tools for Archive-It: Unifying, Managing, and Designing Archive Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Inside:The Open Source Toolsthat Power Archive-It Archive-It Partners 2009 Gordon Mohr, Internet Archive November 4, 2009

  2. Archive-It Unifies Many Tools Archive-It: managing, designing, monitoring, scheduling, reporting Integrated Tools: collecting, storing, displaying, searching

  3. Open Source & Standards from IA • 3 open source software projects • Heritrixcollecting • Waybackdisplaying • NutchWAXsearching • 1 co-developed ISO standard • WARC File Format storing

  4. Open Source from Elsewhere • Linux • Apache/Tomcat • MySQL • Lucene-Nutch-Hadoop

  5. Why Open Source? • Open Source Initiative says: “Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.” • More than access to source code: Right to change, reuse, extend • Wins: • Harmonize formats, practices • Avoid duplication of effort • Reduce costs

  6. Projects Genesis: 2003 • Internet Archive wanted more control over its own software & collections • Discussions with national libraries USA, Canada, UK, France, Iceland, Sweden, Norway, Finland, Denmark, Italy, Australia • Desire to share tools, formats, experiences avoid duplicated effort, closed & inflexible tools • Formed: International Internet Preservation Consortium (IIPC) http://www.netpreserve.org

  7. Heritrix

  8. What is Heritrix? Open-source Extensible Web-scale Archival-quality Web crawling software http://crawler.archive.org

  9. Heritrix Motivations • Deeper, specialized, in-house crawling • Open source • Encourage collaboration on features and best practices • Avoid duplication of work, incompatibilities • Archival-quality • Perfect copies • Keep up with changing web • Meet evolving needs of Internet Archive and International Internet Preservation Consortium

  10. Heritrix Overview • Heritrix means heiress • Java, modular • Project website: http://crawler.archive.org • News, downloads, documentation, issue-tracking • Sourceforge: open source hosting site • Source-code control (SVN) • Official downloads • “Lesser” GPL or Apache license – easy reuse • Outside contributions welcome

  11. Milestones • 1.0 release in March 2004 • Major releases since: • 1.2 new scope options (2004) • 1.4 improved memory use (2005) • 1.6 remote control (2005) • 1.8 scaling (2006) • 1.10 protocols, formats, fixes (2006) • 1.12 “smart” duplicate reduction (2007) • 2.0 “smart” prioritization (2008) • 1.14 WARC, performance (2008-2009)

  12. Archive-It Uses Heritrix 1.14.3+ • AKA “1.15.4” • WARC/1.0 • Many minor fixes • Same as all contract/national crawls • Available as developer build • Will become 1.14.4

  13. Heritrix – future • Next major release: Heritrix 3.0 • Crawl configuration by ‘Spring’ • Scriptable configuration • Web-service remote control • Other upcoming priorities • “Smart” continuous/automatic revisits (3.2) (from change detection to prediction) • Rich media improvements • Spam/trap/mirror suppression • Automate ever-larger crawls

  14. Heritrix – more info • Project website • http://crawler.archive.org • Source code • Sourceforge ‘SVN’ • Discussion • http://tech.groups.yahoo.com/group/archive-crawler/ • Issues/Bugs • http://webarchive.jira.com/browse/HER • Key IA staff: • Steve Sisney, Gordon Mohr

  15. Wayback

  16. What is Wayback? Open Source Java Modular Scalable Customizable Web Archive Access Tool http://archive-access.sourceforge.net/projects/wayback

  17. Wayback – the beginning • Inception in 2005 • Aim: URL-based browsing ‘as if’ at previous dates • Contrasts with classic: • Open source, diverse installs • Java vs. Perl/C • Refactored: • Many extension points • Basis for new features & experiments • First release: “0.2.0” December 2005 Now at 1.4.2 (July 2009)

  18. Wayback Features • Starting with an URL: • See list of captures by date • See extension URLs (same site) • View a capture • Once browsing (“replay”): • Browse web ‘as it was’ • Best-match clickthroughs

  19. Wayback: Modular Components • Query User Interface • Calendar, Search Engine, XML • Replay User Interface • Archival URL, Timeline, Proxy • Resource Index • CDX, BDB, Remote, Nutch, Aggregated • Resource Store • Local ARC, HTTP 1.1 Remote ARC

  20. Archive-It Uses Wayback 1.4.2+ • UI customized • Adds server-side rewriting-mode • Available from project source-control • Next major release: 1.6.0

  21. Wayback – more info • Website • http://archive-access.sourceforge.net/projects/wayback/ • Source code • Sourceforge ‘SVN’ • Discussion • https://lists.sourceforge.net/lists/listinfo/archive-access-discuss • Issues/Bugs • https://webarchive.jira.com/browse/ACC • Key IA staff: • Brad Tofel

  22. NutchWAX

  23. What is NutchWAX? Open Source Java Full-Text Indexing End-User Querying for Web Archives Built on Lucene/Nutch/Hadoop http://archive-access.sourceforge.net/projects/nutch

  24. NutchWAX Background • Lucene • Open-source Java full-text indexing • Popular, mature • Nutch • Extensions to Lucene • For web content, access, scale • Hadoop • Spun off from Nutch • Inspired by Google’s Map-Reduce

  25. NutchWAX • Inception in 2005 • Nutch Web Archive eXtensions • Utilities for using (W)ARCs as Nutch input • Configuration for date dimension • Handle repeated URLs • First release –“0.2.1”– July 2005 • Now at 0.12.8 (September 2009)

  26. Archive-It Uses NutchWAX 0.12.8 • Latest official release • Recent changes driven by Archive-It • Caching support • Index maintenance processes (merging) • ‘Reboost’ for reranking

  27. NutchWAX – more info • Website • http://archive-access.sourceforge.net/projects/nutchwax/ • Source code • Sourceforge ‘SVN’ • Discussion • https://lists.sourceforge.net/lists/listinfo/archive-access-discuss • Issues/Bugs • https://webarchive.jira.com/browse/WAX • Key IA staff: • Aaron Binns

  28. WARC

  29. What is WARC? IIPC ISO Standard Flexible Simple Format for Web Archive Files http://tinyurl.com/2eusle(drafts)

  30. WARC Overview • WARC = Web ARChive file format • Next generation of ARC, called for by IIPC • ARC format created by the Internet Archive • Over 1PB of ARCs gathered since 1996

  31. WARC Goals • Store arbitrary metadata (e.g., subject classifier, discovered language, encoding) • Data compression and record integrity • Store all control information from the harvesting protocol (e.g., request headers) • Store the results of data migrations • Store a duplicate detection event • Distinguishable from the legacy ARC • Globally unique record identifiers • Deterministic handling of long records (e.g., truncation, segmentation).

  32. ARC vs. WARC • Both are a simple sequence of content blocks, each introduced by a small text header • ARCs only 1-line header + protocol response • WARCs add: • multi-line header with extensible fields • New record types: • Request, Response, Resource • Metadata, Revisit, Conversion, Warcinfo, Continuation

  33. What does the future hold?

  34. What does the future hold? Expand and improve toolset • Driven by user requests, contributions, sponsors • Unify access tools • Verify and improve internationalization

  35. What does the future hold? Keep up with the web • New formats, protocols, design techniques • Content challenges: • Deep content • Spam • Interactive applications / AJAX / Javascript

  36. Thank You Gordon Mohr Internet Archive Web Group gojomo@archive.org

  37. Thank You Gordon Mohr Internet Archive Web Group gojomo@archive.org

More Related