Nov 11, 2006

P2PIR’2006, collocated with CIKM’06, Arlington VA, USA Cache Hash Distributed Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks Nov 11, 2006 EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland Gleb Skobeltsyn, Karl Aberer P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Problem definition • Given a document corpus stored in a DHT P2P network • Provide an efficient indexing mechanism to find matching documents given a multi-term query • Traffic consumption to be minimized • The storage space provided by peers is limited • Solutions: broadcast, naïve indexing of terms, HDK… P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

K I K I K I K I K I K I K I K I K I (h(T2), {I2,I3}) (h(T3), {I4,I5}) (h(T1), {I1,I2}) {I1,I2} {I2} How the naïve approach works (1)? • Naïve approach 1: store terms’ Inverted Lists in a DHT • An inverted lists contains document ids. Query: “T1 AND T2” This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

K I K I K I K I K I K I K I K I K I (h(T2), {I2,I3}) (h(T3), {I4,I5}) (h(T1), {I1,I2}) {I2} {I2} How the naïve approach works (2)? • Naïve approach 2: store terms’ Inverted Lists in a DHT • An inverted lists contains document summaries. OR Query: “T1 AND T2” P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Can we do better? • Inverted lists can be very large => consume traffic • Indexing of all/selected terms in all documents => huge redundancy in the index, space limitations • Indexing of term combinations => how to choose them? • Many index items are never or very rarely used. • Our idea: • Indexing=caching • Efficiently fill in the available (distributed) storage space with result sets for popular queries • Use stored caches to answer queries P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

What is our idea? • Conventionally, index is generated purely from the data • Very large number of unused index entries Let us use the query popularity distribution by gathering statistics! • We try to build an index specifically targeted for the current query log • The size of the index is bounded by the available storage provided by peers • Everything which is not indexed is searched via broadcast P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

T1&T2 D1, D3 What is our idea? Another explanation • Given a set of documents, each doc contains a set of terms • We have an inverted index over all extracted terms: {key=h(term)} – {inverted list} • We can monitor Query Load statistics: D o c u m e n t s: Search Keys: Inverted lists: D1 T1, T2, T3 T1 T2 T3 T4 T5 T6 T7 T8 T9 D1, D2, D3, D5 Index term combinations (queries) Querypopularity T1 & T2 very high T3 high T3 & T4 high T7 low T8 & T9 very low D1, D3 D2 T1, T4, T5 D1 D2 D3 T1,T2,T6 D2, D3, D4 D4 D4 T5,T6,T7 D4 Delete unused index entries D5 D5 T1,T8,T9 D5 P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Idea: example Data: Index: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Query-driven indexing structure What are we searching for? Index all data Cache all queries Query subsumption? Unused index items? P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Contents • Motivation & Idea • Query subsumption • Optimization problem • DCT’s indexing and caching strategy: • Meta-index • Cache management • Top-K caching • Load Balancing • Evaluations • Conclusions P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Query subsumption • Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q) • Query subsumption: q’subsumesq if all terms of q’ are contained in q. That means RS(q’)containsRS(q). • We can demonstrate subsumption on a lattice of size 2m-1, where m is the number of terms Query subsumption if a and cd are cached P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Optimization problem • A vocabularyT=t1,t2…tm: all terms in the query load. • A queryq=t1,t2…tn: q in 2T • A documentd=t1,t2…tr: d in 2T • A Query loadL=q1,q2…ql: qi in 2T, • p(qi) – probability, |RS(qi)| – result set size for qi in L • A cachehit function: • cachehit(q)=1, if there exists a cached query q’ subsuming q; • cachehit(q)=0, otherwise. • Problem: to find a set of cached queries Ω, s.t: • Ω=argmax Σqi in L cachehit(qi)*p(qi) • Having a storage constraint: SΩ= Σqi in Ω |RS(qi)|<S0 A document d is the valid answer for a query q <=> dcontainsq P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DCT: Indexing and caching strategy • DCT caches result sets of certain queries without constraining physical cache locations • Each peer is running two services: • Meta-index service: stores index items with cache locations • Caching service: answers a query form a cache • Meta-index: given a query q finds a list of cache locations capable of answering q. • Cache service: returns the result set for q from the q’ cache (q’ subsumes q). P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DCT: Meta-index • Meta-index is based on the standard DHT indexing functionality. • Index update: If a peer πcaches a query q, it advertise the cache availability in the meta-index: It inserts a tuple {q-> address(π)} at the peer responsible for a random term from q. • Lookup: If a query q=t1&t2&…&tn is submitted, every peer responsible for t1,t2…tnis asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly. P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DCT: Meta-index example q=“acd” is submitted at πorig • πorig looks up the meta-index: contacts peers πa,πcand πd* • πa, πcand πd response with known locations of caches subsuming q • πorig randomly selects a cache from the obtained list. Assume “cd” is picked. • RS(q) is sent to πorig * interactions withπdare not shown P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DCT: Cache Management • Each peer provides some storage space s0 for caches • Caches with low profits are evicted: profit(q)=popularity(q) /(|RS(q)|+1) • Every time a peer has to broadcast a query, it tries to cache it • The query q with the result set size |RS(q)| is cached if: • There is enough free space to store |RS(q)|, • There is NOT enough free space butthe least profitable caches can be dropped to fit q cache. P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DCT: Top-K caching • Problem: • A popular query q with a large result set might NOT be cached as its profit is relatively low • Solution: • Introduce a top-k cache: • Can serve onlyq, no subsumption; • But consumes little space, avoids broadcasting the popular q P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluation P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluations: query load and data • Source data: • English Wikipedia XML dump (6Gb) 05.2006 • Two Wikipedia query traces from August and September 2004 • Query load properties (August trace): • 1.3M unique queries, asked 4.6M times during the month • 500K repeated at least twice, 800K only once • 225K unique terms in both traces (after stemming) • Average number of terms in a query = 2.6 • Java simulation: • Simulates a number of virtual peers • Each peer provides 200K records of storage space P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluations: how much storage do we need? 98% maxcache hit withunlimited storage 81% maxcache hit withunlimited storage but no subsumption P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluations: Traffic consumption • 100 peers, 200K each • Converges to 85% cache hit with 100x200K=20M records global cache capacity • The naïve approach requires at least240M records for the term index (if built for query load terms only) P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluations: stress test • 300 peers, 200K each • Converges to 97% cache hit with 300x200K=60M capacity • Very small cache hit drop when changing the load due to the subsumption P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Evaluations: load balancing • Cache imbalance => only several peers are overloaded • Meta-index imbalance => has less impact, can be partially avoided P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Conclusions • Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both: • Query load • Data distribution • Properties: • Efficiently utilizes and adapts to the available storage space • Trade off between huge index size and extra traffic costs for broadcasting rare queries • Subsumption is important: resilient to query load changes • Sufficiently load balanced • Requires 1-2 orders of magnitude less traffic than the naive approach • Requires substantially less storage then per-term index P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Last slide Thank you for your attention! Questions? P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

Nov 11, 2006

Nov 11, 2006

Presentation Transcript

SACME Funding Panel Nov 11, 2006

Nov 11, 2005

JAIIB NOV 01/2006

Warm-Up Nov 11

Tablets (Nov. 11/12)

Nov 11, 2005

Dynamics I 15-nov-2006

Lesson 11 (Nov. 24)

Brussels Nov. 16th 2006

Announcements – Nov. 3, 2006

Nov. 17, 2006

Dynamics: Nov. 11

eLeadership, Nov 25, 2006

Daily Questions Nov. 11 to Nov. 15

Nov.2006

Monday , Nov. 11

CS 101 – Nov. 11