1 / 54

Distributed Structural and Value XML Filtering

Distributed Structural and Value XML Filtering. Iris Miliaraki and Manolis Koubarakis Department of Informatics and Telecommunications National and Kapodistrian University of Athens. 9 Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος.

kalyca
Download Presentation

Distributed Structural and Value XML Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Structural and Value XML Filtering Iris Miliarakiand ManolisKoubarakis Department of Informatics and Telecommunications National and Kapodistrian University of Athens 9Ο Ελληνικό Συμπόσιο Διαχείρισης Δεδομένων, Αγία Νάπα, Κύπρος *Το άρθρο θα παρουσιαστεί στο “4th ACM International Conference on Distributed Event-Based Systems (DEBS 2010)”, Cambridge, UK.

  2. Outline • XML Filtering scenario • Background • DHTs • Structural matching • Value matching • Experiments • Sum up and future work

  3. XML Filtering scenario Centralized Distributed Subscriber Publisher XML Filtering system XPath/XQuery ? Index-Filter Parallel/Hierarchical XTrie ONYX YFilter Miliaraki [WWW 2008] XTrie FiST Gong et al. [ICDE05] Subscriber Publisher XPath/XQuery ? XPush Snoeren [SOSP 2001]

  4. XML Filtering scenario Subscriber Publisher XPath/XQuery ? Subscriber Publisher XPath/XQuery ?

  5. Background: DHTs • Structured overlay networks • Solve the item location problem in a distributed and dynamic network of nodes (in O(log N) hops): • Let x be some data item. Find x! • Distributed version of hash tabledata structure • id=Hash(K) • Main operations: • Put: given a key (for a data item), map the key onto a node. • Get: Find the location of a data item with a given a key.

  6. XML Filtering scenario Subscriber Publisher XPath/XQuery ? DHT Subscriber Publisher XPath/XQuery ?

  7. XML data model - example <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <school> Univ. of Athens </school> <author institure=“Harvard”> John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <school> Univ. of Athens </school> <authorinstiture=“Harvard” John Smith </author> </article> </bib> <bib> <article title=“XML Filtering” conf=“VLDB” year=“2007”> <school> Univ. of Athens </school> <authorinstiture=“Harvard” John Smith </author> </article> </bib> Value Matching Structural matching Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q1: /bib/*/author[text()="John Smith"] Q2: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770]

  8. Automata-based approaches • XFilter and YFilter, ONYX, XTrie, IndexFilter, FiST etc. • Main idea • Construct an automaton from a set of XPath/Xquery queries • Use it as a matching engine against the XML documents

  9. bib title Q3 Example NFA (YFilter) Q1: /bib/phdthesis/year = ‘2008’ Q2: /bib/proceedings/school = ‘Univ. of Athens’ Q3: /bib/proceedings/title = ‘XML Dissemination’ Q1 year 3 Q4: /bib/*/author = ‘John Smith’ 2 phdthesis Q5: //*/cite [@id = 12743] Q2 school proceedings 5 1 4 * 6 0 Q4 author 7 8 * ε Q5 cite * 11 9 10

  10. Distributed structural matching • Utilize a distributed version of a state-of-the-art approach YFilter • Instead of a centralized NFA • Distribute the NFA in the DHT Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

  11. Distributed NFA Structural matching! What about value matching? Miliaraki, Z. Kaoudi and M. Koubarakis. XML Data Dissemination using automata on top of structured overlay networks. In WWW 2008.

  12. What about value matching? • Automata-based approaches efficient for structural matching • Queries apart from defining a structural path also contain value-based predicates /dblp/phdthesis[@year=2005]/author[@nationality=greek] • Our goal: Scale for both the size of the query set and the number of predicates per query

  13. Definitions • Attribute predicates: element[@attr op value]e.g. /bib/phdthesis[@published=2007] • Textual predicates: element[text() op value]e.g. /bib/*/author[text()=“John Smith”]

  14. Direct evaluation with automaton/trie author 3 • Treat predicates as elements! • Lazy DFA [Gupta and Suciu, 2003] • Hope is that only a small set of DFA states will be computed at runtime author Q1: /dblp/phdthesis[@year=2005]/author[@nationality=greek] 5 2 phdthesis year author nationality conference 3 8 10 Huge increase of NFA states! 7 bib * 0 1 4 author text() Q2: /bib/*/author[text()=Michael Smith] 5 9 article Destroy sharing of path expressions! 6 conference text() 7 11 Q3: /bib/article/conference[text()=WWW 2009]

  15. Bottom-up evaluation • Common rule in relational query optimization  apply selections as early as possible • Works well for relational query processing • pFist [Kwon et al. 2005] A lot of effort evaluating predicates while the structure may not be matched

  16. Step-by-step evaluation • XPath queries consist of distinct steps • Each step contains one or more value-based predicates • Perform value matching with structural matching in a stepwise manner • YFilter – Inline [Diao et al. 2003] • process predicates when NFA state is reached Effort spent for evaluating predicates while the structure may not be fully matched

  17. Top-down evaluation • Check predicates after structural matching • YFilter – Selection-Postponed [Diao et al. 2003] • performs predicate evaluation after the execution of the NFA • VA-RoXSum [Vagena et al. 2007] • Focus on message aggregation depending on predicate selectivity number of false positives may be very large

  18. Moving on to details • Parse XML document and generate a set of candidate predicates to perform predicate evaluation Enriched parsing events Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard]

  19. Step-by-step evaluation Top-down evaluation Top-down evaluation with pruning Bottom-up evaluation

  20. Step-by-step evaluation Check value-predicates while matching the structure • Associate NFA states with relevant predicate information  organized using a hash index • At each step of the execution • check predicates • update list with partially matched queries Q • continue with expanding state if Q not empty

  21. Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Example • At each step of the execution • check predicates • update list with partially matched queries Q • continue with expanding state if Q not empty Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8

  22. Step-by-step evaluation • Top-down evaluation Top-down evaluation with pruning Bottom-up evaluation

  23. Top-down evaluation Delay value matching after structural matching • Execute distributed NFA • Only check predicates if an accepting state is reached • Each peer uses a local index mapping predicates to the list of queries that contain them (hash index)

  24. Example Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8

  25. Step-by-step evaluation • Top-down evaluation • Top-down evaluation with pruning Bottom-up evaluation

  26. Top-down evaluation with pruning • At each step of the execution, part of the NFA is revealed • Applies on equality predicates 2 5 IDEA: Use a compact summary of predicate information to stop NFA execution (prune)if we can deduce that no match can be found phdthesis 6 author bib article 0 1 3 cite * 7 conference 4 8

  27. TD with pruning – Details • Each peer responsible for storing many NFA fragments • Each peer keeps one Bloom filter which summarizes predicates of queries indexed in the relevant NFA fragments  Value filter (VF) • Assuming a peer p and a state st, for each query q whose NFA accepting path contains st, we insert one predicate of q in the VF of p

  28. TD with pruning - Main idea cont. • Predicates are inserted as a whole in VFs using their string representation: • element[@attr=value]  element + attr + value • element[text()=value]  element + text + value • VFs are updated during query indexing • Since we traverse the NFA accepting path of a query to index  all relevant VFs will be updated

  29. Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Example: Constructing Value filters 1 0 0 Select 1 predicate per query to insert … author 2 5 1 0 phdthesis 1 6 author bib article 0 1 3 is responsible for cite * 7 conference m-bit filter 4 8 keeps value filter DHT

  30. Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Q2: /bib/*/author[text()="John Smith"] Q3: /bib/article[@conf=www] Q4: /bib/article[@year=2009]/author[@degree-from="UOA"] Q5: /bib/article[@year=2009]/cite[@paper-id=2392] Q6: /bib/article/cite[@paper-id=2770] Q7: //article[@year=2007] Example – Querying Value filters Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] author 2 5 phdthesis 6 author 1 1 bib article 0 1 3 0 0 cite * 7 0 0 … … e conference check Value filter 4 8 1 1 0 0 * article 1 1 9 10 Step 1: expanding state 0 DHT Step 2: expanding state 1 Execution stops MATCH! MISS! Execution continues

  31. Online selectivity estimation • TD with pruning  select one of the predicates of each query to insert in the value filter • Randomly • Or…. most selective predicate • Example /bib/article[@year=2009]/author[text()=“John Smith”] • It is no feasible to store the entire set of XML data that have been processed by our system Sampling! 1 2

  32. Step-by-step evaluation • Top-down evaluation • Top-down evaluation with pruning • Bottom-up evaluation

  33. Check values as early as possible Bottom-up evaluation • Queries are indexed in the network using their predicates • For each distinct predicate in query set  select a responsible peer using DHT • peer organizes its queries using a local index mapping predicates to the list of queries that contain them • This indexing model resembles works from area of Information Filtering

  34. Check values as early as possible Bottom-up evaluation cont. • Construct set of candidate predicates • For each candidate predicate contact responsible peer • Peer checks its local index • Performs locally structural matching

  35. Example Q1: /bib/phdthesis[@published=2005]/author[@nationality=greek] Find responsible Find responsible Candidate predicates CP1:article[@title="XML Filtering"] CP2:article[@conf=VLDB] CP3:article[@year=2007] CP4:author[text()="John Smith"] CP5:author[@institute=Harvard] DHT

  36. Experiments • Implemented methods using FreePastry release in Java • Environment • Cluster (http://www.grid.tuc.gr/) – 28 machines (4 peers per machine) • 253 peers from Planetlab network • Queries • Sets of 1000000 queries • 1 to 8 predicates per query • Data • NITF DTD • Bloom filter size • 100K bits

  37. Cluster (2 predicates per query)

  38. Cluster (4 predicates per query)

  39. Cluster (4 predicates per query) Arrival time in msecs

  40. Network traffic

  41. Sum up & future work • Described methods to combine both structural and value XML filtering in a distributed environment • Experimental evaluation of our methods • Future work • Potential improvements for SBS method • More sophisticated methods for selectivity estimation • Range predicates • Textual predicates

  42. Questions?

  43. Planetlab (2 predicates per query)

  44. Performance improvement

  45. Structural vs. value matching (small query set)

  46. Structural vs. value matching (large query set)

More Related