1 / 63

Preserving Web-based digital material

Preserving Web-based digital material. Andrea Goethals Harvard University Library Why Books? Site Visit 28 October 2010. Agenda. Why preserve Web content? A look at the Web Web archiving Web archiving at Harvard Open challenges in Web archiving Questions?. 1. Why preserve Web content?.

mya
Download Presentation

Preserving Web-based digital material

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preserving Web-baseddigital material Andrea Goethals Harvard University Library Why Books? Site Visit 28 October 2010

  2. Agenda • Why preserve Web content? • A look at the Web • Web archiving • Web archiving at Harvard • Open challenges in Web archiving • Questions?

  3. 1. Why preserve Web content?

  4. Books have moved off the shelves and onto the Web!

  5. TV Shows Blogs Images Scholarly papers Stores Discussions Maps Virtual worlds Art exhibits Documents … Music Articles Magazines Newspapers Tutorials Software Databases Social networking Advertising Courses … A few other things on the Web… • Museums • Libraries • Archives • Recipes • Data sets • Oral history • Poetry • Broadcasts • Wikis • Movies • …

  6. But is it valuable?

  7. May be historically significant White House web site March 20, 2003

  8. May be the only version Harvard Magazine May/June 2009

  9. May document human behavior World of Warcraft, Fizzcrank realm, Morc the Orc’s view, Oct. 25 2010

  10. Important to researchers ABC News Aug. 2007

  11. Important to researchers • Strangers and friends: collaborative play in world of warcraft • From tree house to barracks: The social life of guilds in World of Warcraft • The life and death of online gaming communities: a look at guilds in World of Warcraft • Learning conversations in World of Warcraft • The ideal elf: Identity exploration in World of Warcraft • Traffic analysis and modeling for world of warcraft • E-collaboration and e-commerce in virtual worlds: The potential of second life and world of warcraft • Understanding social interaction in world of warcraft • Communication, coordination, and camaraderie in World of Warcraft • An online community as a new tribalism: The world of warcraft • A hybrid cultural ecology: world of warcraft in China • … etc.

  12. May be a work of art YouTube Play. A Biennial of Creative Video (Oct. 2010 -)

  13. May be important data for scholarship NOAA Satellite and Information Service

  14. May be an important reference

  15. May be of personal value

  16. 2. A look at the Web

  17. Remember this? • 1993: “First” graphical Web browser (Mosaic)

  18. Volume of content is immense! • 1998: First Google index has 26 million pages • 2000: Google index has 1 billion pages • 2008: Google processes 1 trillion unique URLs • “… and the number of individual Web pages out there is growing by several billion pages per day” • (Source: the official Google blog)

  19. Prolific self-publishers “Humanity’s total digital output currently stands at 8,000,000 petabytes … but is expected to pass 1.2 zettabytes this year. One zettabyte is equal to one million terabytes…” “Around 70 per cent of the world’s digital content is generated by individuals, but it is stored by companies on content-sharing websites such as Flickr and YouTube.” Telegraph.co.uk May 2010 on IDC study

  20. Ever-increasing # of web sites 96 million out of 233 million web sites are active (Netcraft.com)

  21. A moving target • Flickr (Feb 2004) • Facebook (Feb 2004) • YouTube (Feb 2005) • Twitter (2006)

  22. Anatomy of a web page • Typically • 1 web page = ~35 files • 1 HTML file • 7 text/css • 8 image/gif • 17 image/jpeg • 2 javascript Source: representative samples taken by Internet Archive

  23. Can’t rely on it always being out there

  24. Web content is transient • The average lifespan of a web site is between 44 and 100 days Captured April 8, 2009 Visited October 13, 2010

  25. Disappearing web sites • 2000 Sydney Olympics • Most of the Web record is only held by the National Library of Australia • Half of the URLs cited in D-Lib Magazine inaccessible 10 years after publication (McCown et al., 2005)

  26. 3. Web archiving

  27. Web archiving 101 • Web harvesting • Select and capture it • Preservation of captured Web content • “Digital preservation” • Keep it safe • Keep it usable to people long-term, despite technological changes acquisition of other digital content acquisition of web content preservation of web content preservation of other digital content

  28. Web harvesting • Download all files needed to reproduce the Web page • Try to capture the original form of the Web page as it would have been experienced at the time of capture • Also collect information about the capture process • Must be some kind of selection…

  29. Type of harvesting • Domain harvesting • Collect the web space of an entire country • The French Web including the .fr domain • Selective harvesting • Collect based on a theme, event, individual, organization, etc. • The London 2012 Olympics • Hurricane Katrina • Women’s blogs • President Obama Any type of regular harvesting results in a large quantity of content to manage.

  30. The crawl Pick a location (Seed URIs) Document exchange Make a request to Web server Examine for URI references Receive response from Web server

  31. Web archiving pioneers: mid-1990s NL of Sweden Internet Archive NL of Denmark Alexa Internet NL of Australia NL of Finland NL of Norway Collecting Partners Adapted from A. Potter’s presentation, IIPC GA 2010

  32. International Internet Preservation Consortium (IIPC): 2003- L and A Canada NL of Sweden NL of Denmark Internet Archive NL of France British Library IIPC NL of Norway Library of Congress NL of Finland NL of Italy NL of Iceland NL of Australia IIPC: http://netpreserve.org

  33. IIPC goals • Facilitate preservation of a rich body of Internet content from around the world • Develop common tools, techniques and standards • Encourage and support Internet archiving and preservation IIPC: http://netpreserve.org

  34. NL of Netherlands IIPC: 2010 WAC (UK) Hanzo Archives NL of Scotland NL of Austria TNA (UK) NL of Israel NL of Singapore NL of Spain / Catalunya NL of Sweden Denmark BANQ Canada European Archive NL of Korea NL of Japan L and A Canada British Library / UK NL of Croatia NL of France / INA Archive-It Partners Internet Archive IIPC Harvard NL of Poland NL of Norway GPO (US) NL of NZ Library of Congress NL of Germany NL of Finland UNT (US) NL of Iceland CDL (US) NL of Australia NYU (US) NL of Slovenia AZ AI Lab (US) UIUC (US) Collecting Partners NL of Switzerland NL of Italy NL of Czech Republic OCLC Collecting Partners Collecting Partners Adapted from A. Potter’s presentation, IIPC GA 2010

  35. Current methods of harvesting • Contract with another party for crawls • Internet Archive’s crawls for the Library of Congress • Use a hosted service • Internet Archive’s ArchiveIt • California Digital Library’s Web Archiving Service (WAS) • Set up institution-specific web archiving systems • Harvard’s Web Archiving Collection Service (WAX) • Most use IIPC tools like the Heritrix web crawler

  36. Current methods of access • Currently dark – no access (e.g. Norway) • Only on-site to researchers (e.g. BnF, Finland) • Public on-line access (e.g. Harvard, LAC) • What kind of access? • Most common: browse as it was • Sometimes: full text search • Very rare: bulk access for research • Non-existent: cross-web archive access http://netpreserve.org/about/archiveList.php

  37. 4. Web archiving at Harvard

  38. Web Archiving Collection Service (WAX) • Used by “curators” within Harvard units (departments, libraries, museums, etc.) to collect and preserve Web content • Content selection is a local choice • The content is publicly available to current and future users

  39. WAX workflow • A Harvard unit sets up an account (one-time event) • On an on-going basis: • Curators within that unit specify and schedule content to crawl • WAX crawlers capture the content • Curators QA the Web harvests • Curators organize the Web harvests into collections • Curators make the collections discoverable • Curators push content to the DRS – becomes publicly viewable and searchable

  40. WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

  41. WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

  42. WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

  43. Back-end services • WAX crawlers • File Movers • Importer • Deleter • Archiver • Indexers

  44. WAX WAX temp storage WAXi curator interface temp index curator back-end services HOLLIS catalog production index WAX public interface archive user DRS (preservation repository) Back end Front end

  45. Minimally at the collection level Sometimes also at the Web site level Catalog record

  46. http://wax.lib.harvard.edu

  47. 5. Open challenges in Web archiving

  48. How do we capture…? • Streaming media (e.g. videos) • Non-http protocols (RTMP, etc.), sometimes proprietary • Experiments to capture video content in parallel to regular crawls (e.g. BL’s One & Other project) • Complicates play-back as well • Still experimental, non-scalable and time-consuming

  49. How do we capture…? • Highly interactive sites (Flash, AJAX) • Experiments to launch Web browsers that can simulate Web clicks (INA, European Archive) • Still experimental and time-consuming • “Walled gardens” • Need help from content hosts • What’s next? The Web keeps changing

More Related