Thinking Differently About Web Page Preservation

Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

Background • “We can’t save everything!” • if not “everything”, then how much? • what does “save” mean?

“Women and Children First” HMS Birkenhead, Cape Danger, 1852 638 passengers 193 survivors all 7 women & 13 children image from: http://www.btinternet.com/~palmiped/Birkenhead.htm

We should probably save a copy of this…

Or maybe we don’t have to… the Wikipedia link is in the top 10, so we’re ok, right?

Surely we’re saving copies of this…

2 copies in the UK 2 Dublin Core records That’s probably good enough…

What about the things that we know we don’t need to keep? You DO support recycling, right?

A higher moral calling for pack rats?

Just Keep the Important Stuff!

Preservation metadata is like a David Hockney Polaroid collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence” Lessons Learned from the AIHT (Boring stuff: D-Lib Magazine, December 2005) images from: http://facweb.cs.depaul.edu/sgrais/collage.htm

Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

Lazy Preservation“How much preservation do I get if I do nothing?” Frank McCown

Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

Web Infrastructure

Publisher’s cost (time, equipment, knowledge) Client-view Server-view H Filesystem backups Furl/Spurl Browser cache InfoMonitor LOCKSS Hanzo:web iPROXY TTApache Web archivesSE caches H L H Coverage of the Web Cost of Preservation

Research Questions • How much digital preservation of websites is afforded by lazy preservation? • Can we reconstruct entire websites from the WI? • What factors contribute to the success of website reconstruction? • Can we predict how much of a lost website can be recovered? • How can the WI be utilized to provide preservation of server-side components?

Prior Work • Is website reconstruction from WI feasible? • Web repository: G,M,Y,IA • Web-repository crawler: Warrick • Reconstructed 24 websites • How long do search engines keep cached content after it is removed?

Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (tca is not defined) Replicated resource – available on web server and SE cache (tca < current time < tr) Endangered resource – removed from web server but still cached (tca < current time < tcr) Unrecoverable resource – missing from web server and cache (tca< tcr< current time) Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

Web Repository Characteristics C Canonical version is stored M Modified version is stored (modified images are thumbnails, all others are html conversions) ~R Indexed but not retrievable ~S Indexed but not stored

SE Caching Experiment • Create html, pdf, and images • Place files on 4 web servers • Remove files on regular schedule • Examine web server logs to determine when each page is crawled and by whom • Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)

Caching of HTML Resources - mln

Reconstructing a Website Original URL Web Repo Warrick Starting URL Results page Cached URL Retrieved resource File system Cached resource • Pull resources from all web repositories • Strip off extra header and footer html • Store most recently cached version or canonical version • Parse html for links to other resources

How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Missing link to D; points to old resource G F can’t be found

Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources) • Use Wget to download current website • Use Warrick to reconstruct • Calculate reconstruction vector

Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

Aggregation of Websites

Web Repository Contributions

Warrick Milestones • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially “blesses” Warrick (mid Mar 2006)1 1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

Proposed Work • How lazy can we afford to be? • Find factors influencing success of website reconstruction from the WI • Perform search engine cache characterization • Inject server-side components into WI for complete website reconstruction • Improving the Warrick crawler • Evaluate different crawling policies • Frank McCownand Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext 2006. • Development of web-repository API for inclusion in Warrick

Factors Influencing Website Recoverability from the WI • Previous study did not find statistically significant relationship between recoverability and website size or PageRank • Methodology • Sample large number of websites - dmoz.org • Perform several reconstructions over time using same policy • Download sites several times over time to capture change rates

Evaluation • Use statistical analysis to test for the following factors: • Size • Makeup • Path depth • PageRank • Change rate • Create a predictive model – how much of my lost website do I expect to get back?

Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services?

Recovery of Web Server Components • Recovering the client-side representation is not enough to reconstruct a dynamically-produced website • How can we inject the server-side functionality into the WI? • Web repositories like HTML • Canonical versions stored by all web repos • Text-based • Comments can be inserted without changing appearance of page • Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

Recover Server File from WI

Evaluation • Find the most efficient values for n and r (chunks created/recovered) • Security • Develop simple mechanism for selecting files that can be injected into the WI • Address encryption issues • Reconstruct an EPrints website with a few hundred resources

SE Cache Characterization • Web characterization is an active field • Search engine caches have never been characterized • Methodology • Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask • Download cached version and live version from the Web • Examine HTTP headers and page content • Test for overlap with Internet Archive • Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache

Summary: Lazy Preservation When this work is completed, we will have… • demonstrated and evaluated the lazy preservation technique • provided a reference implementation • characterized SE caching behavior • provided a layer of abstraction on top of SE behavior (API) • explored how much we store in the WI (server-side vs. client-side representations)

Web Server Enhanced Preservation“How much preservation do I get if I do just a little bit?” Joan A. Smith

Thinking Differently About Web Page Preservation

Thinking Differently About Web Page Preservation

Presentation Transcript

Computing Aesthetics Thinking Differently…

Thinking Differently About Healthcare Materials Management CAHPMM Annual Meeting

Thinking About Thinking

Thinking differently about workforce …… ………improving quality and productivity

Roadway Preservation Intranet Page

Thinking Differently About Growth

Thinking about breeding differently

“What Could Possibly Go Wrong?” Thinking Differently About Security

Thinking about Thinking…

Thinking differently about health, public health and social care

Do-Now- Thinking Differently…

Thinking about Losses and Grief Differently

Do-Now- Thinking Differently…

Reducing the social: thinking differently about small-scale research

Connecting with Patients…Thinking Differently

Thinking Critically about Web Sites

Short Writing #1: Thinking About Your Writing (On Course Web Page)

Thinking differently = trouble

What Does Thinking Differently Mean?

Enhancing Diversity: Thinking Differently About Disability Clayton Keller

Thinking Differently About Web Page Preservation