1 / 30

Date

XML processing in DHT networks Serge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, Chong Sun INRIA-Saclay & UC Santa-Cruz. Date. 1. Outline. Topic KadoP System Overview of DHT Query evaluation Optimization techniques DPP: Distributed postings partitioning

oliana
Download Presentation

Date

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML processing in DHT networksSerge Abiteboul, Ioana Manolescu, Neoklis Polyzotis, Nicoleta Preda, Chong Sun INRIA-Saclay & UC Santa-Cruz Date 1

  2. Outline • Topic • KadoP System • Overview of DHT • Query evaluation • Optimization techniques • DPP: Distributed postings partitioning • Structural Bloom Filters • Conclusion 2

  3. Topic • Querying large volume of content in a P2P network for a community of users • Focus on indexing • Content = XML • P2P network = structured - around DHT • XML indexing DHT networks 3

  4. Example: Edos distribution system • A system for managing Linux distribution (Mandriva) • System releases • about 10 000 software packages + metadata (XML)‏ • Community of open-source developers: thousands • Functionalities • Publish/update releases • Query the metadata • Retrieve packages 4

  5. The KadoP system 5

  6. [K, Object] DHT: A P2P indexing infrastructure ID15 ID0=x mod 24 ID1=(x+20)mod24 ID2=(x+21)mod24 Pastry ID4=(x+22)mod24 Pointer in the finger table Look-up (K) from client ID0 Look-up (K) from client ID1 ID8=(x+23)mod24 • Which API? • locate (K) → Peer IP • get (K) → Object • put (K, Object) • Use a ring • each peer takes an ID in the space Modulo(2N)‏ • each peer stores (K, Object) pairs, for K satisfying: • ID peer ≤ K < ID next peer 6

  7. Advantages and Disadvantage • Advantages • Availability and reliability • No centralization (bottleneck) and replication • Scalability • Scalable solution for keyword queries • Disadvantage • Difficult to maintain the structure • Not suited for transient population of peers 7

  8. XML query processing in KadoP • Query evaluation: • Step 1. • Given a XQuery Q, decompose Q in tree pattern queries • Evaluate each tree pattern query using the DHT index to identify a set candidates peers P that can provide answers • Step 2. • Ship Q to these peers P and evaluate it there 8

  9. Indexing XML documents Doc.xml 8 X ancestor of Y ↔ start(X) < start(Y) ≤ end(X)‏ 1 A 2 6 8 7 C B X parent of Y↔ X ancestor of Y and level(X) = level(Y) - 1 4 4 4 3 5 6 D E F 4 4 “John” G 6 6 Posting = peer, doc, start, end, level 9

  10. XML indexing in DHT • Publish them via a DHT • put (k,postings), where k is a label or a keyword • Remark: all the postings for author accumulate at the same peer put(author;[p2,d2,start,end,lev])‏ Posting list for author p2 DHT p(author)‏ p1 put(author;[p1,d2,start,end,lev])‏ 10

  11. Some technical issues • Goal: manage millions of documents with thousands of peers • First experiments were a disaster • First works • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)‏ • Extend the API of the DHT with Append and not only Read/Write • Extend the API of the DHT with a streaming exchange of postings • With this, KadoP scaled but was slow due to e.g. long postings 11

  12. Optimization 12

  13. Main issue: long postings • Transfer of long posting is hurting performance • Bad response time • Parallelization: Distributed Posting Partitioning (DPP)‏ • Communication load • Bloom filter: Structural Bloom Filter p(Name)‏ long posting for “Name” 13

  14. DPP structure (p,d)‏ p(Name)‏ long posting for “Name” (p3,d3)‏ (p1,d1)‏ (p2,d2)‏ (p4,d4)‏ • DPP structure • Split and distribute postings according to conditions • Each condition is an interval: C1=[(p1,d1),(p2,d2)] • Each two conditions are over disjoint intervals • Some kind of B-tree for postings C1 C2 C3 C4 p(Name)‏ 14

  15. Query processing (no DPP) ∞ 0 article article 0 ∞ QP-peer abstract author author “database” “Ullman” ∞ 0 abstract index-Q 0 0 ∞ ∞ database Ullman Pipeline transfers of postings to query processing peer Holistic twig-join algorithm to compute the result in parallel at QP peer 15

  16. Query processing with DPP At p(client)‏ Conditions sorted according (p,d) p( )‏ C2 p( )‏ C1 abstract C5 p( )‏ C4 p( )‏ C3 “XML” Fetch from p(abstract) and p(XML) the conditions C1-C5 Prune intervals Transfer and compute in parallel the join for each sub-interval 16

  17. Experiments • Platform • Grid5000: P2P platform for research in P2P systems • Distributed geographically across 6 sites in France • KadoP tested on more than 100 machines • 1000 logical peers • Conclusions in brief • Good performance • KadoP scales very nicely • Issue: does not support high churn of peers (index copying)‏ 17

  18. Query response time Q=article//author//Ullman 18

  19. Optimization • (b) Structural Bloom Filters • Ancestor Bloom Filter • Also in paper: Descendant BF 19

  20. ABF(a)‏ Using an Ancestor Bloom Filters Query: a//b Compute the Bloom Filter of the a-postings and send to p(b)‏ Compute the b-postings that have an a-ancestor (and more)‏ Send it to the p(a) that can compute the answer L(a)‏ DHT p(a)‏ L(b)‏ F(b, ABF(a))‏ p(b)‏ 20

  21. Technique: dyadic intervals Dyadic intervals 23 [1 8] 22 [1, 4] [5, 8] 1 2 3 4 5 6 7 21 [1, 2] [3, 4] [5, 6] [7, 8] start end ap [1,1] [2,2] [3,3] [4,4] [5,5] [6,6] [7,7] [8,8] 20 bp • Dyadic covers • D(ap)={[1,4], [5,6], [7,7]} • apisancestor of bp if • D(ap) (start(bp)  )‏ • Here 3 [1,4], so answer is yes! 21

  22. Ancestor Bloom Filter (simplified)‏ • Publication: d, ap in d, D(ap)‏ • Insert a trace in the Bloom Filter • Say T[h(d,)] = 1 for some has function h • Test: for bp in d, • for each dyadic interval  s.t. start(bp) , • test if T[h(d,)] = 1 • If one test is positive, conclude bp in d is a solution • Wrong positives because of Hash collisions 22

  23. Query evaluation strategies p(a)‏ p(a)‏ ABF(a)‏ DBF(b’’)‏ b’’ = F(b, DBF(c)‏ ^ DBF(d)) b’ = F(b, ABF(a))‏ p(b)‏ p(b)‏ ABF(b’)‏ ABF(b’)‏ DBF(c)‏ DBF(c)‏ d’ = F(d, ABF(b’))‏ p(d)‏ p(c)‏ p(d)‏ p(c)‏ c’ = F(c, ABF(b’))‏ Descendant Bloom Reducer Ancestor Bloom Reducer

  24. Postings DB Filter AB Filter Performances

  25. Conclusion 25

  26. Related works • Very active area • DHT-based platforms for XML data management • Locating data sources (Galanis & al. VLDB03)‏ • XPath lookup queries in P2P networks (Bonifati et al. WIDM04)‏ • Other DHT-based systems for data management • PIER query processor (Huebsch & al, CIDR05)‏ • Indexing in P2P networks (Aberer & al, VLDB05)‏ • Dyadic Intervals • Maintenance of dynamic intervals (Gilbert & al, VLDB02)‏ 26

  27. Contribution • Two optimization techniques for index processing • Distributed Posting Partitioning • Structural Bloom Filters • A full system for P2P XML indexing • As opposed to some simulation • Lots of engineering details that are important for performance • Extensively tested for performance • Tested with a real application, EDOS 27

  28. On-going and future work • New indexing techniques • Trading-off precision for performance • Publish summarizations of documents • Index/transfer postings at a coarse level of detail • Index views (query caching)‏ • Query optimizer for KadoP • This is standard distributed query processing • Use standard optimization techniques, e.g., use OptiMax ActiveXML optimizer (demo in ICDE08) • Develop what is specific for KadoP: cost model 28

  29. Merci

  30. Indexing time 30

More Related