html5-img
1 / 28

Characterization of Search Engine Caches

Characterization of Search Engine Caches. Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007. Outline. Preserving and caching the Web Lazy preservation Search engine sampling experiment.

RoyLauris
Download Presentation

Characterization of Search Engine Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterization of Search Engine Caches Frank McCown & Michael L. Nelson Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaMay 22, 2007

  2. Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment

  3. Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

  4. Preservation: Fortress Model 5 easy steps for preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

  5. Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

  6. Alternative Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources

  7. Cached Image

  8. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  9. Crawling the Web and web repositories

  10. Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). To appear. • Frank McCown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http://www.cs.odu.edu/~fmccown/warrick/

  11. Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive

  12. Web and Cache Overlap

  13. Indexed and Cached Content by Type

  14. Distribution of Top Level Domains

  15. 976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions

  16. Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date)

  17. Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale

  18. Distribution of Staleness

  19. Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average

  20. Similarity vs. Staleness

  21. Overlap with Internet Archive

  22. Overlap with Internet Archive

  23. Distribution of Sampled URLs

  24. Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA

  25. Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/

More Related