What is the Internet Archive

What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San Francisco California Officially designated a library by the state of California (2007)

Archive-It www.archive-it.org First deployed in February 2006 • Web based application that allows users to create, manage and preserve collections of digital web content • Functions include: selection and scoping, harvesting, reports and analysis of captures, cataloging with metadata, full text search • Archived content includes: text, html, video, audio, images, PDF, online newspapers, social networking and more… • Includes hosting, access and storage (primary and back-up) • Archived content available for viewing 24 hours after a crawl has completed

Open Source Technology primarily developed by Internet Archive, the open source community, and the IIPC The Tools Behind Archive-It • Heritrix: web crawler - crawls and captures pages • Wayback Machine: access tool for rendering and viewing pages. Displays archived web pages--surf the web as it was. • NutchWAX: Open source search engine. Standard full-text search

Who Uses Archive-It 130 partners in 42 states and 12 countries • 35% University and College Libraries • 30% State Archives and Libraries • 15% Non Government Non Profits • 9% National Libraries/Federal Institutions • 7% K-12 Schools • 2% Cities and Public Libraries • 2% Museums and Art Libraries http://www.archive-it.org/public/partners

Archive-It Web Application 6

Why Archive Social Networking Sites? • State Agencies & Officials: An increasing number have decided that the content on these sites is a record and needs to be archived. • University libraries: Used to share information with students and alumni, and contain important records about a school's culture, student body and campus events. • Researchers: Used to preserve valuable social reactions and change on topics of interest • Currently about 20 Archive-It partners are archiving content from these sites

North Carolina State Archives & State Library of North Carolina Purpose: archive state agency websites and publications • Includes pages in a variety of formats: text, images, audio, video and social networking sites • Archive-It Partner since 2005 (pilot partner)

North Carolina State Archives & State Library of North Carolina

Library of Virginia • Purpose: Preserve websites relating to Virginia government and elections • Collection on current Governor includes Twitter and Flickr sites • Collection on Twitter, Flickr, and Facebook sites of politicians and political organizations in Virginia

Stanford University, Islamic and Middle Eastern Collection Purpose: Harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Archiving sites from Twitter, Facebook, and Youtube selected by the collection’s curators • Partner since February 2008 funded by Library of Congress

University of Texas, San Antonio • Purpose: Archive university websites, student organizations, academic departments, and other local topics important to their university • Archiving blogs, Facebook, Twitter, Flickr, MySpace • Partner since 2008

Typical Challenges • Content behind log-ins can not be archived • Content can be blocked by robots.txt files (which our crawlers respect by default) • Some parts of sites are not “archive-friendly” (i.e. complex javascript, Flash, etc.) • These sites tend to change both their technical structure and policy quickly and often. • Structure of the sites/urls means users need to add scoping rules to only capture content you are interested in. Each site has its own unique set of challenges.

Overall Approaches • Trial and Error: Try to harvest with a variety of settings • Quality Review: review archived content thoroughly • Collaborate: compare approaches and results with other Archive-It users • Document detailed instructions, lessons learned, and best practices for other partners

Thank you! www.archive-it.org http://www.facebook.com/ArchiveIt Kate Odell Partner Specialist, Internet Archive kate@archive.org

What is the Internet Archive

What is the Internet Archive

Presentation Transcript

What is the Internet?

What is the Internet ?

What is the Internet??

What is the Internet?

What is the Internet?

What is the Internet?

What is the internet?

What is the Internet

What is the Internet?

What is Internet?

What Is the Internet?

What is the Internet Archive

What is the internet

What is Internet

What is the Internet?

What is Internet ?

WHAT IS THE INTERNET?

What is the internet?

What is the Internet?

What is the Internet?

What is the Internet?