Improved Techniques for Result Caching in Web Search Engines

Improved Techniques for Result Caching in Web Search Engines Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU

Content of this Talk Result caching in web search engines (1) The case of weighted caching: some queries more expensive to recompute than other - investigate algorithms for this case - hybrid algorithms, and impact of power laws (2) Feature-based approach to caching - improvements for result and index caching

Caching • Query processing is a major performance bottleneck • Common performance optimizations: caching, index compression, index pruning and early termination, parallel processing • Multi-level caching: result caching vs. index caching • Mostly focus on result caching (but also index)

Query Processing • Inverted index can efficiently identify pages that contain a particular word or set of words • Main challenge for query processing is the significant size of the index data for a query • Need to optimize to scale with users and data • Caching is one of such optimizations • Result caching: has query occurred before? • List caching: has index data for term been accessed before?

Related Work • Markatos (WCW 2000) studies query log distributions and compares several basic caching algorithms cache • Number of subsequent papers on result caching: • Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003) • Fagni et al. (TOIS 2006) • Lempel/Moran (WWW 2003) • Saraiva et al. (SIGIR 2001) • Xie/Hallaron (Infocom 2002) • Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache • Baeze-Yates et al. (Spire 2007) use some features for cache admission policy

Basics Sequence of queries q_1 to q_n LRU: least recently used LFU: least frequently used Can be implemented using basic data structures score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score Recency (LRU) vs. frequency (LFU) Various hybrids

SDC (Static and Dynamic Caching) Fagni et al. (TOIS 2006) LFU LRU Alpha = 0.7

Characteristics of Queries • Query frequencies follow Zipf distribution • While a few queries are quite frequent, most queries occur only once or a few times

Characteristics of Queries • Query traces exhibit some amount of burstiness, i.e., occurrences of queries are often clustered • A significant part of this burstiness is due to the same user reissuing a query to the engine.

Contributions • Study result caching as a weighted caching problem - Hit ratio - Cost saving • Hybrid algorithms for weighted caching • Caching and power laws • Feature-based cache eviction policies

Weighted Caching • Assume all cache entries have same size • Standard caching: all entries also same cost • Weighted caching: different costs • Result caching: some queries more expensive to recompute than others • In fact, costs highly skewed • Should keep expensive results longer • Note: throughput vs. latency

Weighted Caching Algorithms • LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU) • Landlord • On insertion, give entry a deadline equal to its cost • Evict entry with smallest deadline, and deduct this deadline from all other deadlines in the cache Weighed version of LFU (Young, Cao/Irani 1998) • Clairvoyant: no poly. time optimal offline known • We cook up an estimate • Assume system returns cost of query computed

Dataset • 2006 AOL query log with 36 million queries • Queries which consist of only stop words are removed • Requests for further result pages are removed

Hit Ratio of Basic Algorithms

Cost Reduction

New Hybrid Algorithms • SDC • lru_lfu • landlord_lfu_w

Weighted Caching and Power Laws • Problem with weighted caching with high skew • Suppose q_1 has occurred once and has cost 10, and q_2 has occurred 10 times and has cost 1 • LFU_w gives same priority  is that right? • Lottery: • Multiple rounds, one winner per round • Some people buy more tickets than others • But each person buys same number each week • Given past history, guess future winners • Suppose ticket sales are Zipfian

Weighted Caching and Power Laws • Compare: smoothing techniques in language models • Three solutions: • Good-Turing estimator • Estimator derived from power law • Pragmatic: fit correction factors from real data • Last solution subsumes others

Weighted Zipfian Caching E.g, in LFU_w, Priority score = cost * frequency * g()

Hybrid Algorithms After Adding Correction

Feature-Based Caching • Most standard algorithms view input as sequence of object IDs • Hides many application details! • E.g., query length, frequency of query terms in query logs or in collection, click behavior, navi/info query • But these could be very useful for caching! • So, can/should we use more features in caching? • … and, should we keep using “explicit” algorithms, or rely on machine learning? • Compare: ranking functions in IR • Previous work: Baeza-Yates et al. (SPIRE 2007)

Features • F1: steps to last occurrence of this query; • F2: steps between last two occurrences of this query, if a query occurs at least twice; • F3: query frequency so far; • F4: query length; • F5: length of shortest inverted index list of all query terms in the query; • F6: the frequency of the rarest query term; • F7: the number of users who issue this query; • F8: among F7, the gap between the last two queries issued by the most recently active user; • F9: average number of clicks per query; • F10: the query frequency of the rarest pair of terms in the query.

Caching Algorithm • Trivial machine learning approach (i.e., counting) • Split each feature into a few bins, thus placing each cache entry into one bin • For each bin, estimate likelihood of reoccurrence using past queries • During caching (online), can efficiently move entries between bins until eviction • O(lg c) cost per element (c is cache size)

Experimental Results – Hit Ratio

Experimental Results – Hit Ratio (cont.)

Experimental Results – Cost Savings Priority score = probability score * cost of this query

Experimental Results – List Caching

Discussion A bunch of results on caching, in two parts … Note: feature-based beats the stuff in first part! Open: Cache size versus cache freshness issue Other apps of feature-based approach

Questions?

Improved Techniques for Result Caching in Web Search Engines

Improved Techniques for Result Caching in Web Search Engines

Presentation Transcript

Search Engines for Semantic Web Knowledge

Personalized Ontologies for Web Search and Caching

Web Search Engines

Web Technologies Search Engines

Web Technologies Search Engines

Web search engines

Web Search Engines

Web Search Engines

Page Ranking Techniques In Search Engines

Web and Search Engines

Web Search Engines

Web search engines

Improved Techniques for Result Caching in Web Search Engines

Web Search Engines

Web Search Engines

Web Search Engines

Deep Web Search Engines

Web search engines

Web Search Engines