Archiving and Preserving the Web

Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008

Open Source Technology primarily developed by Internet Archive and IIPC Heritrix: web harvester to capture the content Wayback Machine: access tool for rendering and viewing content. Displays archived web pages--surf the web as it was. NutchWAX: Search engine. Standard full-text search Open Source Technology

Heritrix development 2.0 (2008) Duplicate Reduction (saving storage) Prioritization of seeds, domains, Url’s Adapting to WARC format 2.2 (September 2008) and 2.4 (2009) • adaptive & continuous revisit crawling at a large scale • Ability to run one never-ending 'master crawl' on the same 'scope’ and not break up the crawl • improving check pointing for stable long-running crawl • Essentially a 'snapshot' of the entire state of the crawl, so if anything goes wrong, we can pick up from exactly that 'snapshot' point, with all internal queues/counters in exactly the same state. • better crawling of web video content • improving the usability and documentation features

NutchWAX Development .12 (September) • De-duplication of archive content during indexing. • Adds support for WARC files • Addresses high priority bugs • Built on most recent versions of Nutch/Hadoop • Distributed computing system scales to 100 millions of documents. • Open Search interface to integrate with numerous 3rd-party systems 1.0 (December) • Improve and simplify installation, indexing and service deployment of Nutch • Provide NutchWAX documentation

Wayback Development 1.4 (July) • Configurable/customizable error messages per website • support for exclusions framework including date ranges • anchoring date during replay to prevent "drift" through a replay session • anchoring window, to limit embedded content to a defined time range within a replay session • index format change to "identity format” • proxy mode embedding of time lines, banners, etc 1.6 (December) • Performance optimizations and better documentation • Ability to play back https • Improved packaging, installation and documentation • Formal Support for Windows platform • Improved video replay • Thumbnails and/or document titles in the UI • In page difference between two captures (visual comparison as you move through time)

IA Projects Using Open Source tools Collaborating with Partners

National Libraries Ongoing thematic crawls, event based harvests, and domain snapshots • Iceland Czech Republic • Germany France • UK Ireland • Norway Australia • Denmark Norway • US Sweden

Topic/Event crawls Library of Congress • National elections – 2000, 2002, 2004, 2006, 2008 • Supreme Court Nomination • War in Iraq • Crisis in Darfur • Egyptian Elections • Olympics • .gov • Papal Election

Community Web archiving • Hurricane Katrina collection • Contributors: The Internet Archive, the Library of Congress, CDL, a group of universities, and many individual contributors • spans content generated between September 4 and November 8, 2005 • 1700 web sites /61 million pages, all text searchable Public access at http://websearch.archive.org/katrina/ • Tsunami Collection • Contributors: The Internet Archive, Singapore Internet Research Centre, Web Archivist • 1500 sites / 4 million pages, all text searchable Public access at http://tsunami.archive.org/

Virginia Tech University Web archiving as a result of crisis and tragedy • Tragedy at Virginia Tech 3 million documents all text searchable accessible to the public at http://www.dl-vt-416.org/ • Northern Illinois University

World Wide Web of Humanities Collaboration between IA, Hanzo Web and Oxford Internet Institute. Funded by NEH and JISC Objective is to support new methodologies for digital humanities research built around large collections of web and digitized data, using automated tools to extract, index, and analyze the data Chose a well-rounded set of humanities materials that will allow us to test the tools against a variety of types of documents and resource types Will build focused research collections around the topics of World Wars I and II

K-12 Collaboration with LOC and CDL Chose 3 high schools from around the country (California, Illinois and Louisiana) http://www.archive-it.org/k12

Around the World in Two Billion Pages • Mellon Award - unique global snapshot of the Web • Crawled from June 2007 to December 2007 • Over 60 countries participated • Started with 18,000 seeds (websites) • Completed with 2 billion pages http://wa.archive.org/aroundtheworld/

Archive-It (state archives, state university and public libraries, university libraries and non government non profits) • Web based application that allows users to harvest manage and preserve collections of born digital content. • Own institution’s websites, topics/subjects/events and/or government records • Functions include: setting crawl frequencies, defining scope, cataloging with metadata, managing and analysis of collections and full text search • Includes hosting and storage

Video • 2007: • IA Engineers crawled over one million You Tube videos. Broad crawls off of home page links (most popular, most viewed) • Started crawling embedded videos for LOC Election ‘08 collection • 2008: • NDIIPP project with UNC: 8 weekly crawls • Broad crawls: 2 weekly crawls from You Tube home page, prioritized based on popularity • Focused/topical crawls: 3 weekly crawls with specific id’s or search queries provided by UNC • Broad and/or Focused: last 3 crawls (TBA)

Video Harvests • Difficult to interact with youtube and other proprietary flash video players • Configuration is a moving target, since these video hosting sites may change their software at any time. • Highly customized scoping rules need to be added to capture all the URLs relevant to embedded Flash videos • replay (through the Wayback Machine) is complicated by some of the same issues we face with Flash in general

s What’s Next for Internet Archive and Web Archiving • Collaboration and Partnerships • Continue to act as a technology partner in providing web archiving services • Continue to develop Open Source software • Develop common tools, storage formats and standards through the IIPC, and with our partners • Multiple copies around the world • Within IA’s own repository, and with partners such as LC, Bnf, Library of Alexandria

Archiving and Preserving the Web