1 / 77

Thinking Differently About Web Page Preservation

Thinking Differently About Web Page Preservation. Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006

kyros
Download Presentation

Thinking Differently About Web Page Preservation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

  2. Background • “We can’t save everything!” • if not “everything”, then how much? • what does “save” mean?

  3. “Women and Children First” HMS Birkenhead, Cape Danger, 1852 638 passengers 193 survivors all 7 women & 13 children image from: http://www.btinternet.com/~palmiped/Birkenhead.htm

  4. We should probably save a copy of this…

  5. Or maybe we don’t have to… the Wikipedia link is in the top 10, so we’re ok, right?

  6. Surely we’re saving copies of this…

  7. 2 copies in the UK 2 Dublin Core records That’s probably good enough…

  8. What about the things that we know we don’t need to keep? You DO support recycling, right?

  9. A higher moral calling for pack rats?

  10. Just Keep the Important Stuff!

  11. Preservation metadata is like a David Hockney Polaroid collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence” Lessons Learned from the AIHT (Boring stuff: D-Lib Magazine, December 2005) images from: http://facweb.cs.depaul.edu/sgrais/collage.htm

  12. Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

  13. Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

  14. Lazy Preservation“How much preservation do I get if I do nothing?” Frank McCown

  15. Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

  16. Web Infrastructure

  17. Publisher’s cost (time, equipment, knowledge) Client-view Server-view H Filesystem backups Furl/Spurl Browser cache InfoMonitor LOCKSS Hanzo:web iPROXY TTApache Web archivesSE caches H L H Coverage of the Web Cost of Preservation

  18. Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

  19. Research Questions • How much digital preservation of websites is afforded by lazy preservation? • Can we reconstruct entire websites from the WI? • What factors contribute to the success of website reconstruction? • Can we predict how much of a lost website can be recovered? • How can the WI be utilized to provide preservation of server-side components?

  20. Prior Work • Is website reconstruction from WI feasible? • Web repository: G,M,Y,IA • Web-repository crawler: Warrick • Reconstructed 24 websites • How long do search engines keep cached content after it is removed?

  21. Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (tca is not defined) Replicated resource – available on web server and SE cache (tca < current time < tr) Endangered resource – removed from web server but still cached (tca < current time < tcr) Unrecoverable resource – missing from web server and cache (tca< tcr< current time) Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

  22. Cached Image

  23. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  24. Web Repository Characteristics C Canonical version is stored M Modified version is stored (modified images are thumbnails, all others are html conversions) ~R Indexed but not retrievable ~S Indexed but not stored

  25. SE Caching Experiment • Create html, pdf, and images • Place files on 4 web servers • Remove files on regular schedule • Examine web server logs to determine when each page is crawled and by whom • Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)

  26. Caching of HTML Resources - mln

  27. Reconstructing a Website Original URL Web Repo Warrick Starting URL Results page Cached URL Retrieved resource File system Cached resource • Pull resources from all web repositories • Strip off extra header and footer html • Store most recently cached version or canonical version • Parse html for links to other resources

  28. How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Missing link to D; points to old resource G F can’t be found

  29. Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

  30. Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources) • Use Wget to download current website • Use Warrick to reconstruct • Calculate reconstruction vector

  31. Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

  32. Aggregation of Websites

  33. Web Repository Contributions

  34. Warrick Milestones • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially “blesses” Warrick (mid Mar 2006)1 1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

  35. Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

  36. Proposed Work • How lazy can we afford to be? • Find factors influencing success of website reconstruction from the WI • Perform search engine cache characterization • Inject server-side components into WI for complete website reconstruction • Improving the Warrick crawler • Evaluate different crawling policies • Frank McCownand Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext 2006. • Development of web-repository API for inclusion in Warrick

  37. Factors Influencing Website Recoverability from the WI • Previous study did not find statistically significant relationship between recoverability and website size or PageRank • Methodology • Sample large number of websites - dmoz.org • Perform several reconstructions over time using same policy • Download sites several times over time to capture change rates

  38. Evaluation • Use statistical analysis to test for the following factors: • Size • Makeup • Path depth • PageRank • Change rate • Create a predictive model – how much of my lost website do I expect to get back?

  39. Marshall TR Server – running EPrints

  40. We can recover the missing page and PDF, but what about the services?

  41. Recovery of Web Server Components • Recovering the client-side representation is not enough to reconstruct a dynamically-produced website • How can we inject the server-side functionality into the WI? • Web repositories like HTML • Canonical versions stored by all web repos • Text-based • Comments can be inserted without changing appearance of page • Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

  42. Recover Server File from WI

  43. Evaluation • Find the most efficient values for n and r (chunks created/recovered) • Security • Develop simple mechanism for selecting files that can be injected into the WI • Address encryption issues • Reconstruct an EPrints website with a few hundred resources

  44. SE Cache Characterization • Web characterization is an active field • Search engine caches have never been characterized • Methodology • Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask • Download cached version and live version from the Web • Examine HTTP headers and page content • Test for overlap with Internet Archive • Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache

  45. Summary: Lazy Preservation When this work is completed, we will have… • demonstrated and evaluated the lazy preservation technique • provided a reference implementation • characterized SE caching behavior • provided a layer of abstraction on top of SE behavior (API) • explored how much we store in the WI (server-side vs. client-side representations)

  46. Web Server Enhanced Preservation“How much preservation do I get if I do just a little bit?” Joan A. Smith

More Related