1 / 24

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

WIDM 2006. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University Norfolk, Virginia, USA Arlington, Virginia November 10, 2006. Outline. Web page threats Web Infrastructure

rance
Download Presentation

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WIDM 2006 Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaNovember 10, 2006

  2. Outline • Web page threats • Web Infrastructure • Web caching experiment • Web repository crawling • Website reconstruction experiment

  3. Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

  4. How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

  5. Cached Image

  6. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  7. Web Repository Characteristics C Canonical version is stored M Modified version is stored (modified images are thumbnails, all others are html conversions) ~R Indexed but not retrievable ~S Indexed but not stored

  8. Timeline of Web Resource

  9. Web Caching Experiment • Create 4 websites composed of HTML, PDF, images • http://www.owenbrau.com/ • http://www.cs.odu.edu/~fmccown/lazy/ • http://www.cs.odu.edu/~jsmit/ • http://www.cs.odu.edu/~mln/lazp/ • Remove pages each day • Query GMY each day using identifiers

  10. Crawling the Web and web repositories

  11. First developed in fall of 2005 • Available for download at http://www.cs.odu.edu/~fmccown/warrick/ • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially endorses Warrick (mid Mar 2006)

  12. How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Four categories of recovered resources: 1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G Missing link to D; points to old resource G F can’t be found

  13. Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

  14. Reconstruction Experiment • Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources) • Perform 5 reconstructions for each website • One using all four repositories together • Four using each repository separately • Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

  15. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

  16. Recovery Success by MIME Type

  17. Repository Contributions

  18. Current & Future Work • Building a web interface for Warrick • Currently crawling & reconstructing 300 randomly sampled websites each week • Move from descriptive model to proscriptive & predictive model • Injecting server-side functionality into WI • Recover the PHP code, not just the HTML

More Related