1 / 28

Bandwidth-Efficient Continuous Query Processing over DHTs

Bandwidth-Efficient Continuous Query Processing over DHTs. Yingwu Zhu. Background. Instantaneous Query Continuous Query. Instantaneous Query (1). Documents are indexed Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists)

lumina
Download Presentation

Bandwidth-Efficient Continuous Query Processing over DHTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu

  2. Background • Instantaneous Query • Continuous Query

  3. Instantaneous Query (1) • Documents are indexed • Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists) • Retrieve “one-time” relevant docs • Latency is a top priority • Query Q = t1Λ t2 … • Fetch lists of doc IDs stored under t1, t2…. • Intersect these lists • E.g.: Google search engine

  4. cat? cat:1,4,7,19,20 dog? “cat Λ dog”? dog:1,5,7,26 Instantaneous Query (2) cat:1,4,7,19,20 A B C dog:1,5,7,26 Send Result: Docs 1,7 D cow:2,4,8,18 bat: 1,8,31

  5. Continuous Query (1) • Reverse the role of documents and queries • Queries are indexed • Query Q = t1 Λ t2 … stored at one of the terms t1, t2 … • Question 1: How is the index term selected?(query indexing) • “Push” new relevant docs (incrementally) • Enabled by “long-lived” queries • E.g.: Google New Alert feature

  6. Continuous Query (2) • Upon a new doc D = t1Λ t2 (insertion) • Contacts the nodes responsible for the inverted query lists of D’s keywords t1 and t2 • Question 2: How to locate the nodes (query nodes QN)? (document announcement) • Resolve the query lists  the final list of satisfied queries (by D) • Question 3: What is the resolution strategy? (query resolution) • E.g., Term Dialogue, Bloom filters (Infocom’06) • Notify owners of satisfied queries

  7. 1. Document announcement Doc 2. “dog” & “cow” cat dog cow 3. “11” (bit vector) 4. “horse” 5. “0” (bit vector) Query Resolution: Term Dialogue Inver. list for “cat” • cat (query): • dog • horse & dog • horse & cow A B Notify owner of Q1 C D Inver. list for “cow” Inver. list for “dog”

  8. Doc 1. Doc announcement “10110” (bloom filter) 2. “dog” (Term Dialogue) cat dog cow 3. “1” (bit vector) Query Resolution: Bloom filters Inver. list for “cat” • cat (query): • dog • horse & dog • horse & cow A B Notify owner of Q1 C D Inver. list for “cow” Inver. list for “dog”

  9. Motivation • Latency is not the primary concern, but bandwidth can be one of the important design issues • Various query indexing schemes incur different cost • Various query resolution strategies cause different costs  Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

  10. Contributions • Novel query indexing schemes  Question #1 • Focus of this talk! • Multicast-based document announcement  Question #2 • In the paper • Adaptive query resolution  Question #3 • Make intelligent decisions in resolving query terms • Minimize the bandwidth cost • In the full tech. report paper

  11. Design • Focus on simple keyword queries, e.g., Q = t1Λ t2Λ … Λtn • Leverage DHTs • Location & storage of documents and continuous queries • Query indexing • How to choose index terms for queries? • Doc. announcement, query resolution • Not covered in this talk!

  12. Current Indexing Schemes • Random Indexing (RI) • Optimal Indexing (OI)

  13. Random Indexing (RI) • Randomly chooses a term as index term • Q = t1Λ … Λ tm • Index term ti is randomly selected • Q is indexed in a DHT node responsible for ti • Pros: simple • Cons: • Popular terms are more likely to be index terms for queries • Load imbalance • Introduce many irrelevant queries in query resolution, wasting bandwidth

  14. Optimal Indexing (OI) • Q = t1Λ … Λ tm • Index term ti is deterministically chosen, the most selective term, i.e., with the least frequency • Q is indexed in a DHT node responsible for ti • Pros: • Maximize load balance & minimize bandwidth cost • Cons: • Assume perfect knowledge of term statistics • Impractical, e.g., due to large number of documents, node churn, continuous doc updates, ….

  15. Solution 1: MHI • Minimum Hash Indexing • Order query terms by their hashes • Select the term with minimum hash as the index term • Q = t1Λ… Λ tm • Index termti is deterministically chosen, s.t. h(ti) < h(tx) (for all x≠i) • Q is indexed in a DHT node responsible for ti

  16. RI v.s. MHI D = {t2, t4, t5, t6} t1 t2 t3 t4 t5 t6 t7 • Where h(ti) < h(tj) for i < j. • 3 queries, irrelevant to D: • Q1= t1Λt2Λ t4 • Q2= t3Λ t4Λ t5 • Q3= t3Λ t5Λ t6 • (1) RI: • Q1, Q2, and Q3 will be considered in query resolution each with • probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6) • (2) MHI • All of them will be filtered out!  bandwidth savings! • How?

  17. Disregarded in query resolution, saving bandwidth! MHI: filtering irrelevant queries! t2: none B No action t5: none C t1: Q1 G No action A D = {t2, t4, t5, t6} t4: none t3: Q2, Q3 F D No action No action t6: none E • Q1= t1 Λ t2 Λ t4 • Q2= t3 Λ t4 Λ t5 • Q3= t3 Λ t5 Λ t6

  18. MHI • Pros: • Simple and deterministic • Does not require term stats • Saves bandwidth over RI (up to 39.3% saving for various query types) • Cons: • Some popular terms can be index terms by their minimum hashes in their queries! • Load imbalance & irrelevant queries to process

  19. Solution 2: SAP-MHI • MHI is good but may still index queries under popular terms • SAmPling-based MHI(SAP-MHI) • Sampling (synopsis of K popular terms) + MHI • Avoid indexing queries under K popular terms • Challenge: support duplicate-sensitive aggregates of popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! • Borrow idea from duplicate-sensitive aggregation in sensor networks

  20. SAP-MHI • Duplicate-sensitive aggregation • Goal:  a synopsis of K popular terms • Based on coin tossing experiment CT(y) • Toss a fair coin until either the first head occurs or ycoin tosses end up with no head, and return the number of tosses • Each node a • Produce a local synopsis Sa containing K popular terms (the terms with the highest values of CT(y)) • Gossip Sa to its neighbor nodes • Upon receiving a synopsis Sb from a neigbor b, aggregate Sa and Sb, producing a new synopsis Sa(max() operations) • Thus, each node has a synopsis of K popular terms after a sufficient number of gossip rounds • Intuition: If a term appears in more documents then its value produced by CT(y) will be larger than the values of rare terms

  21. SAP-MHI: Indexing Example • Query Q=t1Λ t2Λ t3Λ t4Λ t5, where h(t1)<h(t2)<h(t3)<h(t4)<h(t5) • Synopsis S={t1,t2} • Q is indexed on the node which is responsible for t3, instead of t1

  22. Simulations

  23. SAP-MHI v.s. MHI SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

  24. SAP-MHI v.s. MHI Bloom filters are used in query resolution.

  25. SAP-MHI v.s. MHI Term Dialogue is used in query resolution.

  26. SAP-MHI v.s. MHI This shows why SAP-MHI saves bandwidth over MHI!

  27. Summary • Focus on a simple keyword query model • Bandwidth is a top priority • Query indexing impacts bandwidth cost • Goal: Sift out as many irrelevant queries as possible! • MHI and SAP-MHI • SAP-MHI is a more viable solution • Load is more balanced, more bandwidth saving! • Sampling cost is controlled • # of popular terms is relatively low • Memberships of popular terms do not change rapidly • Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

  28. Thank You!

More Related