1 / 25

Nov 11, 2006

P2PIR’2006, collocated with CIKM’06, Arlington VA, USA. Cache. Hash. D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks. Nov 11, 2006. EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland.

amelie
Download Presentation

Nov 11, 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. P2PIR’2006, collocated with CIKM’06, Arlington VA, USA Cache Hash Distributed Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks Nov 11, 2006 EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland Gleb Skobeltsyn, Karl Aberer P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  2. Problem definition • Given a document corpus stored in a DHT P2P network • Provide an efficient indexing mechanism to find matching documents given a multi-term query • Traffic consumption to be minimized • The storage space provided by peers is limited • Solutions: broadcast, naïve indexing of terms, HDK… P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  3. K I K I K I K I K I K I K I K I K I (h(T2), {I2,I3}) (h(T3), {I4,I5}) (h(T1), {I1,I2}) {I1,I2} {I2} How the naïve approach works (1)? • Naïve approach 1: store terms’ Inverted Lists in a DHT • An inverted lists contains document ids. Query: “T1 AND T2” This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  4. K I K I K I K I K I K I K I K I K I (h(T2), {I2,I3}) (h(T3), {I4,I5}) (h(T1), {I1,I2}) {I2} {I2} How the naïve approach works (2)? • Naïve approach 2: store terms’ Inverted Lists in a DHT • An inverted lists contains document summaries. OR Query: “T1 AND T2” P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  5. Can we do better? • Inverted lists can be very large => consume traffic • Indexing of all/selected terms in all documents => huge redundancy in the index, space limitations • Indexing of term combinations => how to choose them? • Many index items are never or very rarely used. • Our idea: • Indexing=caching • Efficiently fill in the available (distributed) storage space with result sets for popular queries • Use stored caches to answer queries P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  6. What is our idea? • Conventionally, index is generated purely from the data • Very large number of unused index entries Let us use the query popularity distribution by gathering statistics! • We try to build an index specifically targeted for the current query log • The size of the index is bounded by the available storage provided by peers • Everything which is not indexed is searched via broadcast P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  7. T1&T2 D1, D3 What is our idea? Another explanation • Given a set of documents, each doc contains a set of terms • We have an inverted index over all extracted terms: {key=h(term)} – {inverted list} • We can monitor Query Load statistics: D o c u m e n t s: Search Keys: Inverted lists: D1 T1, T2, T3 T1 T2 T3 T4 T5 T6 T7 T8 T9 D1, D2, D3, D5 Index term combinations (queries) Querypopularity T1 & T2 very high T3 high T3 & T4 high T7 low T8 & T9 very low D1, D3 D2 T1, T4, T5 D1 D2 D3 T1,T2,T6 D2, D3, D4 D4 D4 T5,T6,T7 D4 Delete unused index entries D5 D5 T1,T8,T9 D5 P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  8. Idea: example Data: Index: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  9. Query-driven indexing structure What are we searching for? Index all data Cache all queries Query subsumption? Unused index items? P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  10. Contents • Motivation & Idea • Query subsumption • Optimization problem • DCT’s indexing and caching strategy: • Meta-index • Cache management • Top-K caching • Load Balancing • Evaluations • Conclusions P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  11. Query subsumption • Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q) • Query subsumption: q’subsumesq if all terms of q’ are contained in q. That means RS(q’)containsRS(q). • We can demonstrate subsumption on a lattice of size 2m-1, where m is the number of terms Query subsumption if a and cd are cached P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  12. Optimization problem • A vocabularyT=t1,t2…tm: all terms in the query load. • A queryq=t1,t2…tn: q in 2T • A documentd=t1,t2…tr: d in 2T • A Query loadL=q1,q2…ql: qi in 2T, • p(qi) – probability, |RS(qi)| – result set size for qi in L • A cachehit function: • cachehit(q)=1, if there exists a cached query q’ subsuming q; • cachehit(q)=0, otherwise. • Problem: to find a set of cached queries Ω, s.t: • Ω=argmax Σqi in L cachehit(qi)*p(qi) • Having a storage constraint: SΩ= Σqi in Ω |RS(qi)|<S0 A document d is the valid answer for a query q <=> dcontainsq P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  13. DCT: Indexing and caching strategy • DCT caches result sets of certain queries without constraining physical cache locations • Each peer is running two services: • Meta-index service: stores index items with cache locations • Caching service: answers a query form a cache • Meta-index: given a query q finds a list of cache locations capable of answering q. • Cache service: returns the result set for q from the q’ cache (q’ subsumes q). P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  14. DCT: Meta-index • Meta-index is based on the standard DHT indexing functionality. • Index update: If a peer πcaches a query q, it advertise the cache availability in the meta-index: It inserts a tuple {q-> address(π)} at the peer responsible for a random term from q. • Lookup: If a query q=t1&t2&…&tn is submitted, every peer responsible for t1,t2…tnis asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly. P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  15. DCT: Meta-index example q=“acd” is submitted at πorig • πorig looks up the meta-index: contacts peers πa,πcand πd* • πa, πcand πd response with known locations of caches subsuming q • πorig randomly selects a cache from the obtained list. Assume “cd” is picked. • RS(q) is sent to πorig * interactions withπdare not shown P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  16. DCT: Cache Management • Each peer provides some storage space s0 for caches • Caches with low profits are evicted: profit(q)=popularity(q) /(|RS(q)|+1) • Every time a peer has to broadcast a query, it tries to cache it • The query q with the result set size |RS(q)| is cached if: • There is enough free space to store |RS(q)|, • There is NOT enough free space butthe least profitable caches can be dropped to fit q cache. P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  17. DCT: Top-K caching • Problem: • A popular query q with a large result set might NOT be cached as its profit is relatively low • Solution: • Introduce a top-k cache: • Can serve onlyq, no subsumption; • But consumes little space, avoids broadcasting the popular q P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  18. Evaluation P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  19. Evaluations: query load and data • Source data: • English Wikipedia XML dump (6Gb) 05.2006 • Two Wikipedia query traces from August and September 2004 • Query load properties (August trace): • 1.3M unique queries, asked 4.6M times during the month • 500K repeated at least twice, 800K only once • 225K unique terms in both traces (after stemming) • Average number of terms in a query = 2.6 • Java simulation: • Simulates a number of virtual peers • Each peer provides 200K records of storage space P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  20. Evaluations: how much storage do we need? 98% maxcache hit withunlimited storage 81% maxcache hit withunlimited storage but no subsumption P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  21. Evaluations: Traffic consumption • 100 peers, 200K each • Converges to 85% cache hit with 100x200K=20M records global cache capacity • The naïve approach requires at least240M records for the term index (if built for query load terms only) P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  22. Evaluations: stress test • 300 peers, 200K each • Converges to 97% cache hit with 300x200K=60M capacity • Very small cache hit drop when changing the load due to the subsumption P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  23. Evaluations: load balancing • Cache imbalance => only several peers are overloaded • Meta-index imbalance => has less impact, can be partially avoided P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  24. Conclusions • Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both: • Query load • Data distribution • Properties: • Efficiently utilizes and adapts to the available storage space • Trade off between huge index size and extra traffic costs for broadcasting rare queries • Subsumption is important: resilient to query load changes • Sufficiently load balanced • Requires 1-2 orders of magnitude less traffic than the naive approach • Requires substantially less storage then per-term index P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

  25. Last slide Thank you for your attention! Questions? P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

More Related