1 / 57

GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher

EDBT’2011. GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher. Mohammad Sadoghi , Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research Group University of Toronto. An X-ToPSS Project. MIDDLEWARE SYSTEMS. RESEARCH GROUP. http://msrg.org/tags/x-topss.

sloane-chan
Download Presentation

GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDBT’2011 GPX-Matcher - A Generic Boolean Predicate-based XPath Expression Matcher Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen Middleware Systems Research Group University of Toronto An X-ToPSS Project MIDDLEWARE SYSTEMS RESEARCH GROUP http://msrg.org/tags/x-topss MSRG.ORG

  2. The Problem in a Nutshell Event/Publication XML XPath Expressions (XPE) (Millions of XPE) XML Filtering Subscriptions (Boolean Expressions) Pub/Sub Engine Matched XPE Matched Subscriptions

  3. Notification Notification Publish/Subscribe Systems TSX Stock markets NASDAQ NYSE Publisher Publisher AMGN=58 Publications IBM=84 ORCL=12 JNJ=58 HON=24 INTC=19 MSFT=27 Broker Subscriptions Subscriptions: IBM > 85 ORCL < 10 JNJ > 60 Subscriber Subscriber X-ToPSS & GPX-Matcher

  4. Pub/Sub Matching Algorithms • Rete algorithm [Forgy, late 70s] • A graph-structure to correlate events, process rules (solves a more general problem) • SIFT [Yan et al. TODS‘94] • Predicate counting et al. • Gough algorithm [Gough et al. ACSC‘95] • Based on a finite state representation of subscriptions • Gryphon algorithm [Aguilera, et al. PODC‘99] • Decision tree over predicates • Clustering algorithm [Fabret et al. SIGMOD‘01] • Clusters subscriptions based on common predicates • k-Index [Whang et al. VLDB‘09] • Hardware-based matching acceleration [Sadoghi et al. VLDB‘10] • BE-Tree [Sadoghi & Jacobsen, SIGMOD’2011] X-ToPSS & GPX-Matcher

  5. The Key Question? Can XML Filtering be benefited from the efficient publish/subscribe matching algorithms thathave been developed for more than three decades? X-ToPSS & GPX-Matcher

  6. XML Filtering Challenges XML XPath Expressions (XPE) (Millions of XPE) Matched XPE Filter XML according to XPEs Efficiently, at Internet-scale, for millions of XPEs, and for many XML documents per unit of time X-ToPSS & GPX-Matcher 6

  7. XML Filtering Systems • XML filtering systems are • publish/subscribe systems • XPath & XML are subscription • and publication, respectively • Growing need for XML filtering • Application-level firewalls • Maleware detection and prevention • Document routing • RSS aggregators • XML-based messaging and application integration • Selected industry players (XML appliances) • SolaceSystems • IBM DataPower • Talerian • Sarvega (Intel) X-ToPSS & GPX-Matcher 7

  8. The Core Problem • XML Document Filtering Problem • Given a set of XPath expressions Q and an XML document d, find all expressions in Q that are matched by d • An expressions q is matched by an XML document d if and only if q selects a non-empty set of nodes in d • XPath expressions are used to select entire documents or fragments of documents X-ToPSS & GPX-Matcher 8

  9. Agenda • Supported XPath Language • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 9

  10. section figure subsection figure relative query absolute query descendent operator location step wildcards XML and XPath XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure child operator

  11. XPath 2.0 Subset Considered • Absolute path expressions • /a/b • Relative path expressions • a/b/c • Descendant operators in path expressions • a/b//a/d • Wildcards in path expressions • a/*/*/b • Not discussed, but shown how to address • Filter predicates in path expressions • <path>[@x>1]/<path> • Nested path filters (the XPE becomes a tree) • <path>[a/b]/<path> X-ToPSS & GPX-Matcher

  12. Agenda • Supported XPath Language • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 12

  13. Our Question(s) • How can we map XPath expressions onto subscriptions? • Conjunctive Boolean formula over predicates • S = (a1 op v1)  (a2 op v2)  …  (an op vn) • How can we map XML documents onto publications? • Set of attribute-value pairs • P = {(a1, v1),(a2, v2),…, (am, vm)} X-ToPSS & GPX-Matcher 13

  14. Predicate Calculus • Single-tag predicate • Double-tags predicate • End-tag predicate • Length-constraint predicate  X-ToPSS & GPX-Matcher

  15. Single-tag Predicate Example Tag b at position 1 • XPath expression /b/… • Predicate b d a c b-a-c (b, 1), (a, 2), (c, 3) X-ToPSS & GPX-Matcher

  16. Double-tags Predicate Example I Distance between Tag a and Tag b is one location step • XPath expression … a/b … • Predicate x d a b x-a-b (x, 1), (a, 2), (b, 3) X-ToPSS & GPX-Matcher

  17. Double-tags Predicate Example II Distance between Tag a and Tag b is at least one location step • XPath expression a//b • Predicate a d x b a-x-b (a, 1), (x, 2), (b, 3) X-ToPSS & GPX-Matcher

  18. End-tag Predicate Example Tag a at least two location steps away from path end • XPath expression /a/*/* • Predicate a d x y a-x-y (a, 1), (x, 2), (y, 3), (length, 3) X-ToPSS & GPX-Matcher

  19. Length-constraint Predicate Example Length of the path is at least 3 • XPath expression */*/* • Predicate x d y z x-y-z (x, 1), (y, 2), (z, 3) (length, 3) X-ToPSS & GPX-Matcher

  20. Q1: a/b//a Q2: a//b/d Q3: a/*/*/*//b/d Q1: a1/b1//a2 Q2: a1//b1/d1 Q3: a1/*/*/*//b1/d1 Q1: Q2: Q3: Putting it Together:XPath Query Encoding Example P2 P1 P3 P4 P4 P5 Our XPath encoding grows linearly in the size of the XPath expression

  21. XML Document Path Encoding a-b-c-d Document path Without duplicate tags (i.e., all occurrence numbers are 1) a1-b1-c1-d1 Attribute-value pair (length, 4), (a1, 1), (b1, 2), (c1, 3), (d1, 4) (a1, b1, 1), (a1, c1, 2), (a1, d1, 3), (b1, c1, 1), (b1, d1, 2), (c1, d1, 1) Publication The resulting attribute-value “pairs” set has O(n2) tags.

  22. Mapping XML Filtering to Pub/Sub Matching Event/Publication XML Subscriptions (Boolean Expressions) Pub/Sub Engine XPath Expressions (XPE) (Millions of XPE) Matched Subscriptions Matched XPE

  23. Matching Algorithms • Pick any pub/sub matching algorithm • We used • Counting algorithm [exact origin is unknown] • Clustering algorithm [Fabret, Jacobsen et al., 2001] • Both are two-phased matching algorithms • Predicate matching: Match all predicates. • Subscriptions matching: Match subscriptions using the result from step 1. X-ToPSS & GPX-Matcher

  24. Predicate value Predicate Matching: Single Tag Predicate with id i Publication: Hash on the tag 1 2 3 4 (length, 4), (a1, 1), (b1, 2), (c1, 3), (d1, 4) (a1, b1, 1), (a1, c1, 2), (a1, d1, 3), (b1, c1, 1), (b1, d1, 2), (c1, d1, 1) = i a c j with id j i 1 0 0 0 Predicate bit vector

  25. Cluster queries based on the access predicates Access predicates shared by all queries in cluster Only check clusters whose access predicates are matched Open Question: how to choose an effective access predicate Subscription Matching: Clustering Algorithm Access predicates false pi false pi X-ToPSS & GPX-Matcher

  26. Experimental Evaluation • All algorithms implemented in C • GPX – the base encoding with counting • GPX-ap – the base encoding with clustering (access pred.) • YFilter & BPA • DTDs used for generating workloads • NITF DTD (News Industry Data Format) • PSD DTD (Protein Sequence Database) • Total filtering time averaged over 500 XML documents • XML parsing time is negligible in the overall filtering time • Intel Quad-Core 2.66 GHz, 4GB XML encoded XPath expressions X-ToPSS & GPX-Matcher

  27. Scalability in Number of XPEs All XPEs are distinct 1 ms vs. 18 ms ap on first ap on last X-ToPSS & GPX-Matcher

  28. Scalability in Number of XPEs XPEs workload contains duplicates X-ToPSS & GPX-Matcher

  29. Effect of Path Length X-ToPSS & GPX-Matcher

  30. Effect of Wildcards X-ToPSS & GPX-Matcher

  31. Conclusions • Novel XML/XPath encoding • Leverages existing matching techniques • Differs significantly from predominantly automata-based related work • Outperforms related approach by an order of magnitude under many experimental conditions X-ToPSS & GPX-Matcher

  32. Thank You! • To learn more about X-ToPSS, please see • http://msrg.org/tags/x-topss X-ToPSS & GPX-Matcher

  33. X-ToPSS & GPX-Matcher

  34. Agenda • XML-based Filtering Systems • Mapping XML Filtering to Pub/Sub Matching • XPath encoding • XML encoding • Experimental results • Outlook X-ToPSS & GPX-Matcher 34

  35. Content-based Publish/Subscribe • Subscription: Boolean expressions (i.e., an attribute-operator-value triple) (subject = news)  (topic = travel)  (date > 21.2.2011) • Publication (a.k.a. event): Sets of attribute-value pairs (subject, news), (topic, travel), (date, 21.2.2011), … X-ToPSS & GPX-Matcher 35

  36. The Pub/Sub Matching Problem event / publication subscriptions matches Given an event, e, and a set of subscriptions, S, determine all subscriptions, s  S, that match e. 36 X-ToPSS & GPX-Matcher

  37. Wide Applicability Selective information dissemination Location-based services Personalization, alerting services Application integration Service & resource discovery Network and distributed system management Monitoring, surveillance, and control Network and distributed system management Workforce management Workload management & job scheduling Business activity monitoring Business process management, monitoring, and execution X-ToPSS & GPX-Matcher 37

  38. Matching Algorithm Techniques • Amortized storage & processing • Access predicates • Cost model-driven subscription partitioning • Cache-conscious data structure layout • Asynchronous cache-level pre-fetching • Event queue re-ordering and batch processing • Parallelization of algorithms for SMP & multi-core • FPGA-based acceleration (hardware-level) X-ToPSS & GPX-Matcher

  39. eXtensible Markup Language • XML – de facto standard for data exchange • Web Services, data and application integration, information dissemination • XPath – XML query language • Also used as basis for other query languages (e.g., XQuery, Xpointer, XSLT et al.) X-ToPSS & GPX-Matcher

  40. XML and XPath XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure section figure subsection figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure X-ToPSS & GPX-Matcher 40

  41. XML and XPath section figure subsection figure location step XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure 41 child operator

  42. XML and XPath section figure subsection figure descendent operator location step XML fragment XML tree XML paths <section> <subsection><figure> … </figure> </subsection><figure> … </figure></section> section-subsection-figure section-figure XPath queries /section/*/figure */figure /section/subsection/figure section/figure /section//subsection/figure section//figure 42 child operator

  43. Our Research Goal • Solve the XML filtering problem using content-based pub/sub matching algorithm. • Why • Build on and exploit several decades worth of insights, rather than construct special purpose solutions. X-ToPSS & GPX-Matcher 43

  44. In a Nutshell section figure subsection encoded XPath expressions figure section-subsection-figure section-figure X-ToPSS & GPX-Matcher

  45. Special purpose XML/XPath Filtering Algorithm XFilter [Altinel et al. VLDB‘00] WebFilter [Pereira et al. VLDB’01] YFilter [Diao et al. TODS‘03] XTrie [Chan et al. ICDE‘03] AFilter [Candan et al. VLDB‘06] BPA [Huo & Jacobsen, ICDE‘06] BoXFilter [Moro et al. VLDB‘07] pFiST [Kwon et al. DKE’08] X-ToPSS & GPX-Matcher 45

  46. From XML Filtering to Publish/Subscribe Matching • XPath expressions are encoded in a predicate calculus • XML documents are expressed as a set of paths from the root to a leave in the document tree • Each path is translated into sets of attribute-value pairs (tags and their location in the path) • Matching algorithm • The attribute-value pairs are matched against the predicates with traditional pub/sub matching algorithms X-ToPSS & GPX-Matcher

  47. Possibly Extensions • Extend predicate calculus to encompass other XPath 2.0 features • Alternative encodings • Exploit DTD or schema information • Exploit information about XPath expressions processed X-ToPSS & GPX-Matcher

  48. X-ToPSS: XML-based Toronto Publish/Subscribe System • Distributed, content-based publish/subscribe (cf. ICDCS’08) • Exploit DTDs (Document Type Definition) to optimize subscription routing in distributed pub/sub systems • Explain covering and merging optimizations for XML/XPath • Alternative predicate-based XML/Xpath matching algorithm that cannot exploit traditional pub/sub schemes (cf. ICDE’06) • Encoding presented herein, cf. EDBT’2011 (forthcoming) http://msrg.org/tags/x-topss

  49. Example: XPath Query Encoding 1 2 3 4 P1 = a 1b1 1 5 3 P2 = P3 b 1a2 2 P4 = 1d1 4 P5 Predicate identifier (pid) X-ToPSS & GPX-Matcher

  50. That’s Like Data Base Querying  !! query publication data tuples subscriptions About past About future sets of tuples sets of tuples Query and subscription are very similar. Data tuples and publication are very similar. However, the two problem statements are inverse.

More Related