1 / 20

Query Processing for High-Volume XML Message Brokering

Query Processing for High-Volume XML Message Brokering. Yanlei Diao Michael Franklin University of California, Berkeley. XML Message Brokers. Data exchange in XML: Web services, data and application integration, information dissemination.

emiesner
Download Presentation

Query Processing for High-Volume XML Message Brokering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Processing for High-Volume XML Message Brokering Yanlei Diao Michael Franklin University of California, Berkeley

  2. XML Message Brokers • Data exchange in XML: Web services, data and application integration, information dissemination. • XML message brokers: central exchange points for messages sent between applications/users. • Main functions: For a large set of queries, • Filtering:matches messages to predicates representing interest specifications. • Transformation:restructures matched messages according to recipient-specific requirements. • Routing:delivers the customized data to the recipients.

  3. XML streams user queries Data Source Data Source Data Source query results Personalized Content Delivery Message Broker • User subscriptions: Specification of user interests, written in an XML query language. • XML streams: Continuously arriving XML data items. The message broker matches data items to queries, transforms them, and routes the results.

  4. Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/c Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c • A large set of linear path expressions. {Q1} c {Q3, Q8} * b {Q2} c {Q4} a  {Q6} * c c {Q5} * c {Q7} b c XML Filtering and YFilter XML filtering systems: XFilter, YFilter, XMLTK, XTrie, Index-Filter, MatchMaker YFilter: high-performance shared path matching engine • A single Non-Deterministic Finite Automaton, sharing all the common prefixes. • Path sharingis the key to efficiency and scalability, orders of magnitude performance improvement! • Diao et al. Path sharing and predicate evaluation for high-performance XML filtering.TODS, Dec. 2003 (to appear).

  5. Efficient Transformation • Goal: customized result generation for tens of thousands of queries! • Leverage prior work on shared path matching (i.e.,YFilter) • How, and to what extent can a shared path matching engine be exploited? • Build customization functionality on top of it • What post-processing of path matching output is needed? • How can this be done most efficiently?

  6. Message Broker Architecture

  7. binding path predicate paths return paths Query Specification A query is a FLWR expression enclosed by a constant tag. <sections> { for$s in $doc//section where$s/title = “XML” and$s/figure/title = “XML processing” return<section> { $s//section//title } { $s//figure } </section> } </sections>

  8. parsing events 1 2 3 1 section 4 2 YFilter figure section 1 2 3 3 parsing events Node labeled tree 3 1 4 figure PathTuple Streams <section> <section><figure> … </figure> </section> <figure> … </figure></section> //section//figure /section/section/figure A PathTuple stream for each matched path expression: • PathTuple:A unique path match, one field per location step. • Ordering: PathTuples in a stream are always output in increasing order of node ids in the last field. • Path oriented shredding:query processing = operations on tuple streams.

  9. section_2: {title_6, title_8}, {figure_5, figure_9} section_11: {title_14}, {figure_15, figure_16} Output of Query Processor GroupSequence-ListSequence format for all the nodes selected from the input message. <sections> { for$s in $doc//section where$s/title = “XML” and$s/figure/title = “XML processing” return<section> { $s//section//title } { $s//figure } </section> } </sections>

  10. Basic Approaches • Three query processing approaches exploiting shared path matching. • Post-process path tuple streams to generate results. • Plans consist of relation-style/tree-search based operators. • Differ in the extent they push work down to the path engine. • Tension between shared path matching and result customization! • PathTuples in a stream are returned in a single, fixed order for all queries containing the path. • They can be used differently in post-processing of the queries.

  11. //section for$s in $doc//section where$s/title = “XML” and$s/figure/title = “XML processing” return<section> { $s//section//title } { $s//figure } </section> 2 4 7 11 13 … Alternative 1: PathSharing-F //section Insert part of the binding pathfromthe for clauses into the path engine. An external plan for eachquery: • Selection: value-based comparisons in the binding path (//section[@id <= 2]). • DupElim: when same node is bound multiple times in the stream. • Where-Filter: tests predicate paths in the where clause (tree-search routine). • Return-Select: applies the return clause (tree-search routine).

  12. Return-Select 1 Where-Filter section id = 1 2 section id = 2 DupElim 3 1 1 1 3 3 3 figure 2 2 3 3 id <=2 //section//figure YFilter Duplicate Elimination <figures> {for$f in $doc//section[@id<=2]//figure where … return … } </figures> • Duplicates for the binding path: PathTuples containing the same node id in the last field. • Cause redundant work in later operators and a duplicate result. • DupElimensures that the same node is emitted only once.

  13. 2 2 7 7 11 11 for$s in $doc//section where$s/title = “XML” and$s/figure/title = “XML processing” return<section> { $s//section//title } { $s//figure } </section> … … 2 3 7 8 11 12 2 … … 4 7 11 13 … 4 6 5 2 9 10 11 17 16 … … … //section //section/figure/title > > 2 //section //section/title 11 … Alternative 2: PathSharing-FW //section //section/title //section/figure/title In addition: push predicate paths from the where clause into the path engine. Semijoins: find query matches after paths in the for and the where clause are matched. • order-preserving • hash vs. merge based: hash based joins are more expensive

  14. Alternative 3: PathSharing-FWR Also push return paths from the return clause into the path engine. OuterJoin-Select: generate results. • create a group for each binding path tuple in the leftmost input. • left outer join the binding path tuple with a return stream to create a list. • order preserving • hashvsmergebased • Duplicates for a return path: • Defined on the join field and the last field of the return path stream. • Need DupElim on return paths before outer joins.

  15. Optimizations • Observation: More path sharing  more sophisticated processing plans. • Tension between shared path streams and result customization. • Different notions of duplicates for binding/return paths. • Different stream orders for the inputs of join operators. • Optimizations based on query / DTD inspection: • Removing unnecessary DupElim operators; • Turning hash-based operators to merge/scan-based ones.

  16. Performance Comparison • Three alternatives w./w.o. optimizations, non-recursive data Bib DTD, number of distinct queries = 5000, number of predicate paths = 1, number of return paths = 2, “//” probability = 0.2 Multi-Query Processing Time (MQPT) = wall clock time of processing a message – message parsing time (msec)

  17. Other Results • Three alternatives w./w.o. optimizations, recursive data • Vary number of predicate paths • Vary number of return paths • Vary “//” probability • Summary of the results: • PathSharing-FWR when combined with optimizations based on queries and DTD usually provides the best performance. • It performs rather poorly without optimizations. • Effectiveness of optimizations: • Query inspection improves the performance of all alternatives; • Addition of DTD-based optimizations improves them further. • Recursive data challenges the effectiveness of optimizations.

  18. Shared Post-processing So far, a separate post-processing plan per query. • The best performing approach (PathSharing-FWR) only uses relational style operators. • Sharing techniques similar to shared Continuous Query processing, but highly tailored for XML message brokering. • Query rewriting • Shared group by for outer joins • Selection pullup over semijoins (NiagaraCQ) • Shared selection (TriggerMan, NiagaraCQ, TelegraphCQ) • Shared post-processing can provide great improvement in scalability!

  19. Conclusions Result customization for a large set of queries: • Sharingis key to high-performance. • Can exploit existing path sharing technology, but need to resolve the inherent tension between path sharing and result customization. • Results show that aggressive path sharing performs best when using optimizations. • Relational style operators in post-processing enable use of techniques from the literature (multi-query optimization, CQ processing).

  20. Future work • Extending the range of shared post-processing. • Additional features in result customization • OrderBy, aggregation, nested FLWR expressions, etc. • Customization solutions based on shared tree pattern matching. • Third component of the XML message broker • content-based routing in an overlay network deployment.

More Related