1 / 22

What is the Internet Archive

What is the Internet Archive. We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San Francisco California Officially designated a library by the state of California (2007). Archive-It. www.archive-it.org

elina
Download Presentation

What is the Internet Archive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San Francisco California Officially designated a library by the state of California (2007)

  2. Archive-It www.archive-it.org First deployed in February 2006 • Web based application that allows users to create, manage and preserve collections of digital web content • Functions include: selection and scoping, harvesting, reports and analysis of captures, cataloging with metadata, full text search • Archived content includes: text, html, video, audio, images, PDF, online newspapers, social networking and more… • Includes hosting, access and storage (primary and back-up) • Archived content available for viewing 24 hours after a crawl has completed

  3. Open Source Technology primarily developed by Internet Archive, the open source community, and the IIPC The Tools Behind Archive-It • Heritrix: web crawler - crawls and captures pages • Wayback Machine: access tool for rendering and viewing pages. Displays archived web pages--surf the web as it was. • NutchWAX: Open source search engine. Standard full-text search

  4. Who Uses Archive-It 130 partners in 42 states and 12 countries • 35% University and College Libraries • 30% State Archives and Libraries • 15% Non Government Non Profits • 9% National Libraries/Federal Institutions • 7% K-12 Schools • 2% Cities and Public Libraries • 2% Museums and Art Libraries http://www.archive-it.org/public/partners

  5. Archive-It Web Application 6

  6. Why Archive Social Networking Sites? • State Agencies & Officials: An increasing number have decided that the content on these sites is a record and needs to be archived. • University libraries: Used to share information with students and alumni,  and contain important records about a school's culture, student body and campus events. • Researchers: Used to preserve valuable social reactions and change on topics of interest • Currently about 20 Archive-It partners are archiving content from these sites

  7. North Carolina State Archives & State Library of North Carolina Purpose: archive state agency websites and publications • Includes pages in a variety of formats: text, images, audio, video and social networking sites • Archive-It Partner since 2005 (pilot partner)

  8. North Carolina State Archives & State Library of North Carolina

  9. North Carolina State Archives & State Library of North Carolina

  10. Library of Virginia • Purpose: Preserve websites relating to Virginia government and elections • Collection on current Governor includes Twitter and Flickr sites • Collection on Twitter, Flickr, and Facebook sites of politicians and political organizations in Virginia

  11. Stanford University, Islamic and Middle Eastern Collection Purpose: Harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Archiving sites from Twitter, Facebook, and Youtube selected by the collection’s curators • Partner since February 2008 funded by Library of Congress

  12. University of Texas, San Antonio • Purpose: Archive university websites, student organizations, academic departments, and other local topics important to their university • Archiving blogs, Facebook, Twitter, Flickr, MySpace • Partner since 2008

  13. Typical Challenges • Content behind log-ins can not be archived • Content can be blocked by robots.txt files (which our crawlers respect by default) • Some parts of sites are not “archive-friendly” (i.e. complex javascript, Flash, etc.) • These sites tend to change both their technical structure and policy quickly and often. • Structure of the sites/urls means users need to add scoping rules to only capture content you are interested in. Each site has its own unique set of challenges.

  14. Overall Approaches • Trial and Error: Try to harvest with a variety of settings • Quality Review: review archived content thoroughly • Collaborate: compare approaches and results with other Archive-It users • Document detailed instructions, lessons learned, and best practices for other partners

  15. Thank you! www.archive-it.org http://www.facebook.com/ArchiveIt Kate Odell Partner Specialist, Internet Archive kate@archive.org

More Related