Characterization of Search Engine Caches
280 likes | 611 Views
Characterization of Search Engine Caches. Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007. Outline. Preserving and caching the Web Lazy preservation Search engine sampling experiment.
Characterization of Search Engine Caches
E N D
Presentation Transcript
Characterization of Search Engine Caches Frank McCown & Michael L. Nelson Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaMay 22, 2007
Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment
Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
Preservation: Fortress Model 5 easy steps for preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
Alternative Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources
Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version
Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). To appear. • Frank McCown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http://www.cs.odu.edu/~fmccown/warrick/
Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive
976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions
Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date)
Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale
Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average
Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA
Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/