1 / 34

Efficient Evaluation of XQuery over Streaming Data

Efficient Evaluation of XQuery over Streaming Data. Xiaogang Li Gagan Agrawal The Ohio State University. Motivation. Why Stream Data needs to be analyzed at real time - Stock Market, Climate, Network Traffic Rapid improvements in networking technologies Lack of disk space

ann-lamb
Download Presentation

Efficient Evaluation of XQuery over Streaming Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University

  2. Motivation • Why Stream • Data needs to be analyzed at real time - Stock Market, Climate, Network Traffic • Rapid improvements in networking technologies • Lack of disk space - 101.13 Gbps at SC2004 Bandwidth Challenge - Retrieval from local disk may be much slower than from remote site

  3. Motivation • Why XML - Standard data exchanging format for the Internet - Widely adapted in web-based, distributed and grid computing - Virtual XML is becoming popular • Why XQuery - Widely accepted language for querying XML - Declarative: Easy to use - Powerful: Types, user-defined functions, binary expressions,

  4. Current Work: XQuery Over Streaming Data • XPath over Streaming Data • XPath is relatively simple • XQuery over Streaming Data • Limited features handled • Focus on queries that are written for single pass evaluation

  5. Contributions • Can the given query be evaluated correctly on streaming data? - Only a single pass is allowed - Decision made by compiler, not a user • If not, can it be correctly transformed ? • How to generate efficient code for XQuery? - Computations involved in streaming application are non-trivial - Recursive functions are frequently used - Efficient memory usage is important

  6. Our Approach • For an arbitrary query, can it be evaluated correctly on streaming data? - Construct data-flow graph for a query - Static analysis based on data-flow graph • If not, can it be transformed to do so ? - Query transformation techniques based on static analysis • How to generate efficient code for XQuery? - Techniques based on static analysis to minimize memory usage and optimize code - Generating imperative code -- Recursive analysis and aggregation rewrite

  7. Query Evaluation Model • Single input stream • Internal computations - Limited memory -Linked operators • Pipeline operator and Blocking operator Op1 Op2 Op3 Op4

  8. Pipeline and Blocking Operators • Pipeline Operator: - each input tuple produces an output tuple independently - Selection, Increment etc • Blocking Operator: - Can only compute output after receiving all input tuples - Sort, Join etc • Progressive Blocking Operator: (1)|output|<<|input|: we can buffer the output (2) Associative and commutative operation: discard input - count(), sum()

  9. Pixels with x and y Q1: let $i := …/pixel sortby (x) Q2: let $i := for $p in/pixel where $p/x > .. x = count(/pixel) A blocking operator exists A progressive blocking operator is referred by another pipeline operator or progressive operator Single Pass? Check condition 2 in a query

  10. Single-Pass? Challenges Must Analyze data dependence at expression level A Query may be complex Need a simplified view of the query to make decision

  11. Low level Transformation GNL Generation High level Transformation Horizontal Fusion Recursion Analysis Vertical Fusion Aggregation Rewrite Overall Framework Data Flow Graph Construction Single-Pass Analysis Stream Code Generation

  12. Roadmap • Stream Data Flow Graph • High-Level Transformations - Horizontal Fusion - Vertical Fusion • Single Pass Analysis • Low Level Optimization • Experiment and Conclusion

  13. S1 S2 v1 i b Stream Data Flow Graph (DFG) • Node represents variable: Explicit and implicit Sequence and atomic • Edge: dependence relation v1->v2 if v2 uses v1 Aggregate dependence and flow dependence • A DFG is acyclic • Cardinality inference is required to construct the DFG S1:stream/pixel[x>0] S2:stream/pixel V1: count()

  14. High-level Transformation • Goals 1: Enable single pass evaluation 2: Simplify the SDFG and single-pass analysis • Horizontal Fusion and Vertical Fusion - Based on SDFG

  15. S0 S1 S2 S1 S2 v1 v2 v1 v2 b b Horizontal Fusion • Enable single-pass evaluation - Merge sequence node with common prefix S1:stream/pixel[x>0] S2:stream/pixel/y V1: count() V2: sum() S0:/stream/pixel S1:[x>0] S2: /y V1: count() V2: sum()

  16. S1 S1 i i b b S2 S2 v v S v j j Vertical Fusion • Simplify DFG and single-pass analysis - Merge a cluster of nodes linked by flow dependence edges

  17. Roadmap • Stream Data Flow Graph • High-Level Transformations - Horizontal Fusion - Vertical Fusion • Single Pass Analysis • Low Level Optimization • Experiment and Conclusion

  18. Single-pass Analysis • Can a query be evaluated on-the fly? THEOREM 1. If a query with dependence graph G=(V,E)contains more than one sequence node after vertical fusion, it can not be evaluated correctly in a single pass. Reason: Sequence node with infinite length can not be buffered

  19. Single-pass Analysis- Continue THEOREM 2. Let Sbe the set of atomic nodes that are aggregate dependent on any sequence node in a stream data flow graph. For any given two elements s1 and s2, if there is a path between s1 and s2, the query may not be evaluated correctly in a single pass. Reason: A progressive blocking operator is referred by another progressive blocking operator Example : count (pixel) where /x>0.005*sum(/pixel/x)

  20. S1 i b S2 v S2 v j Single-pass Analysis - Continue THEOREM 3. In there is a cycle in a stream data flow graph G, the corresponding query may not be evaluated correctly using a single pass. Reason:A progressive blocking operator is referred by a pipeline operator

  21. Single-pass Analysis • Check conditions corresponding to Theorem 1 2 and 3 -Stop further processing if any condition is true • Completeness of the analysis - If a query without blocking operator pass the test, it can be evaluated in a single pass THEOREM 4. If the results of a progressive blocking operator with an unbounded input are referred to by a pipeline operator or a progressive blocking operator with unbounded input, then for the stream data flow graph, at least one of the three conditions holds true

  22. Conservative analysis • Our analysis is conservative - A valid query may be labeled as “cannot be evaluated in a single-pass” Example:

  23. A review of the procedure S1 S2 S v1 i v1 i b b S S v1 v i b Can not be evaluated in a single pass!!

  24. Roadmap • Stream Data Flow Graph • High-Level Transformations - Horizontal Fusion - Vertical Fusion • Single Pass Analysis • Low Level Optimization • Experiment and Conclusion

  25. Low-level Transformations • Use GNL as intermediate representation - GNL is similar to nested loops in Java - Enable efficient code generation for reductions - Enable transformation of recursive function into iterative operation • From SDFG to GNL - Generate a GNL for each sequence node associated with XPath expression - Move aggregation into GNL using aggregation rewrite and recursion analysis

  26. S0 S1 S2 v1 v2 b GNL Example Facilitate code generation for any desired platform

  27. Low-Level Transformations • Recursive Analysis - extract commutative and associative operations from recursive functions • Aggregation Rewirte - perform function inlining - transform built-in and user-defined aggregate into iterative operations

  28. Code Generation • Using SAX XML stream parser - XML document is parsed as stream of events <x> 5 </x>: startelement <x>, content 5, endelement <x> - Event-Driven: Need to generate code to handle each event • Using Java JDK -Our compiler generates Java source code

  29. Code Generation: Example startElement: Insert each referred element into buffer endElement: Process each element in the buffer, dispatch the buffer

  30. Roadmap • Stream Data Flow Graph • High-Level Transformations - Horizontal Fusion - Vertical Fusion • Single Pass Analysis • Low Level Optimization • Experimental Results • Conclusions

  31. Experiment • Query Benchmark - Selected Benchmarks from XMARK - Satellite, Virtual Microscope, Frequent Item • Systems compared with - Galax - Saxon - Qizx/Open

  32. Performance: XMARK Benchmark >25% faster on small dataset Scales well on very large datasets

  33. Performance: Real Applications >One order of magnitude faster on small dataset Works well for very large datasets 10-20% performance gain with control-flow optimization

  34. Conclusions • Provide a formal approach for query evaluation on streaming XML - Query transformation to enable correct execution on stream - Formal methods for single-pass analysis - Strategies for efficient low-level code generation - Experiment results show advantage over other well-known systems

More Related