Characterization of Search Engine Caches

Characterization of Search Engine Caches Frank McCown & Michael L. Nelson Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaMay 22, 2007

Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment

Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Preservation: Fortress Model 5 easy steps for preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Alternative Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources

Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

Crawling the Web and web repositories

Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). To appear. • Frank McCown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http://www.cs.odu.edu/~fmccown/warrick/

Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive

Web and Cache Overlap

Indexed and Cached Content by Type

Distribution of Top Level Domains

976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions

Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date)

Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale

Distribution of Staleness

Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average

Similarity vs. Staleness

Overlap with Internet Archive

Distribution of Sampled URLs

Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA

Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/

Characterization of Search Engine Caches

Characterization of Search Engine Caches

Presentation Transcript

Local Search Engine of Nigeria

Search Engine

Search Engine

Search Engine

Search Engine

Search Engine

Search Engine Optimization and Search Engine Marketing

Implementation of Meta-Search Engine

SEARCH ENGINE

Search Engine

Search Engine – Metasearch Engine Comparison

Search Engine

Search Engine

Search engine

Search Engine

Search Engine Optimization - Importance Of Search Engine Optimization

search engine

Search Engine

SEARCH ENGINE