1 / 22

The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008

ResIn : A Combination of Res ults Caching and In dex Pruning for High-performance Web Search Engines. The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008. Motivation. Caching – crucial for WSE to save resources Results caching:

nau
Download Presentation

The 31 st Annual International ACM SIGIR Conference Singapore, 21 July 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines The 31st Annual International ACM SIGIR Conference Singapore, 21 July 2008

  2. Motivation • Caching – crucial for WSE to save resources • Results caching: • Is efficient with real queries • But its hit rate is limited due to singletons • How to increase the hit-rate further? – index pruning

  3. Contents • ResIn architecture • Original query stream vs. query stream after the results cache (misses) • Static pruned index: • Term pruning • Document pruning • A combination of both • Conclusion

  4. Main Index Term cache Term cache Term cache Back Back Back end end end query result query Front Broker end result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 1. from the main index Top results Top results Top results query query query

  5. Main Index Term cache Term cache Term cache Back Back Back end end end query result query miss Front Results Broker end cache hit result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 2. from the results cache query

  6. Main Index Term cache Term cache Term cache Back Back Back end end end query result miss Front Pruned Pruned Broker end index index hit result ResIn architecture • We study Results Caching and Index Pruning together • … to reduce latency and load on back-end servers Query processing: 3. from the pruned index query query miss Results cache hit

  7. Original query stream (all queries) vs. query stream after the results cache (misses)

  8. All queries vs. Misses: Experimental setup • Real query log to test results cache and generate a “miss-log”: Original query log all queries “Miss-log” misses Q1: britney spears Q1: britney spears Q2: sigir 2007 Q2: sigir 2007 Results cache (LRU) miss Q3: britney spears Q4: sigir 2008 185M queries from yahoo.co.uk Q4: sigir 2008 hit Q3: britney spears

  9. All queries vs. Misses: Number of terms in a query • Average number of terms for all queries = 2.4 • Most single term queries are hits in the results cache • Queries with many terms are unlikely to be hits , for misses = 3.2

  10. All queries vs. Misses: Query result size distribution • Randomly selected 2000 queries from all queries and misses: • Avg. result size for misses is ~100 times smaller than for all queries • Approx. half of the misses returns less than 5000 results – SMALL! • Similar results with a “small” UK document collection (78M)

  11. All queries vs. Misses: Term popularity distribution • Each point -> avg. popularity of 1000 consecutive terms • The order of terms for misses is the same as for all queries • Terms which were popular before the results cache remain popular after Log sizes: 185M – all queries, 41M - misses

  12. Static index pruning

  13. Static pruned index • Smaller version of the main index, returns: • the top-k response that is the same as the main index’s, or • a miss otherwise. • Assumes Boolean query processing • Types of pruning: • Term pruning – full posting lists for selected terms • Document pruning – truncated posting lists • Term+Document pruning – combination of both Full index Term pruning Document pruning T+D pruning t1 t1 t1 t1 t2 t2 t2 t2 t3 t3 t3 t3 t4 t4 t4 t4 Posting list

  14. Term Pruning: Performance • Term pruning based on profit(t)=popularity(t)/df(t) • Answers a query if all query terms are in the pruned index • Performs well for all queries • For misses as well: e.g., can process almost 50% of the queries with 25% of the index UK document collection, 78M documents:

  15. Result Caching + Term Pruning • Results caching performance is independent of the collection size results cache capacity is up to 10% of the full index size

  16. Term pruning: Frequent terms in misses • MinDF (df of the least frequent query term) correlates to the result size • MaxDF (df of the most frequent query term) is high for most of the misses • Many misses contain at least one frequent term • => the term pruned index has to include large posting lists MinDF Gleb Flavio Vassilis Ricardo •••••••••• •••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••• MaxDF

  17. Document pruning • Based on Fagin’s top-k intersection algorithm • Keeps postings with high scores only: • Sufficient to compute top-k results for some queries • Determining correctness of the result requires computing of a scoring threshold – LATENCY! Top-2 results: t1 D1 D2 t2 Score threshold: s(D2,t1)+s(D1,t2)+s(D2,t3) t3 Posting list, sorted by score

  18. Document pruning: Experimental setup • Scoring function: • pr(d) – query independent score of the document d (pagerank) • ω, k – normalization constants: • ω=[0,10,20] • k=1 • We try different values of PLLmax – maximum Posting List Length and choose the one that maximizes the hit rate • We only look at the upper bound for the hit rate: Whether the original top-10 results found in the top portions of all PLs?

  19. Document pruning: performance • Doc. pruning needs high pagerank weights • It performs better for All queries than for Misses

  20. Term+Document pruning: performance • T+D pruning is the best but expensive (high latency) • profit2is better than profit1 • Improvement is marginal for misses unless the pagerank weight is very high

  21. Conclusions • Results caching: • delivers good hit rates with a constant capacity • but hit rate is limited because of singletons • Index pruning: • has no limit on hit rate, • but the pruned index size grows with the doc. collection – more expensive • Static index pruning: addition to results caching, not replacement • Term pruning performs well for misses also =>“compatible” with results cache • Document pruning: all queries - OK, misses - only with high pagerank weights • Term+Document pruning slightly improves over document pruning Lesson learned: Important to consider the interaction between the components

  22. Last slide Thank you Questions?

More Related