310 likes | 332 Views
Integrating automata and algebraic paradigms for enhanced querying of XML data streams. Challenges, approaches, and optimizations explored.
E N D
A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003
Need for Stream Processing • New environment • Data sources are everywhere • Data requests are everywhere • New applications • Sensor networks • Analysis of XML web logs • Selective dissemination of XML information (e.g., news)
Token-by-Token access manner Pattern retrieval + Filtering/Restructuring <biditems> <book> <title> Dream Catcher </title> … FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive> timeline Token: not a direct counterpart of a tuple Specific Challenges for XML Streams <biditems> <book year=“2001"> <title>Dream Catcher</title> <author><last>King</last><first>S.</first></author> <publisher>Bt Bound </publisher> <price> 20 </price> </book> …
Two Computation Paradigms • Automata-based [yfilter02, x-scan01, xsm02, xsq03, xpush03…] • Algebraic [niagara00, …] This project intends to integrate both paradigms into one
Automata Paradigm: • Auxiliary structures for: • Buffering data • Evaluating predicates • Restructuring buffered data • … FOR $b in stream(biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> //book/title title 4 * book 1 2 price Text() 3 5 //book //book/price/text()
Tagger Navigate //book, title Selection push-down enabled Tagger Select price < 30 Select price < 30 Navigate //book, title Navigate //book, price Navigate //book, price Algebraic Computation FOR $b in doc (biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> book book book title author publisher price Text Text Text last first Text Text Navigate //book, /title
Observations • Automata paradigm • Good and long studied for pattern retrieval on tokens • Patches needed for complex filtering and restructuring • Algebraic paradigm • Good and long studied for expressing and optimizing query plans on sets oftuples • Tokenized inputs not accommodated yet Either paradigm has deficiencies Both patterns complement each other
Research Challenges • How to integrate the two models? • How to optimize a query within the integrated query model?
Raindrop Approach:Uniform Modeling in an Algebraic Framework
Uniform Algebraic Plan Query answer Algebraic Plan XML data stream
Uniform Algebraic Plan Tuple-based plan Query answer Tuple stream Token-based plan (automata plan) XML data stream
Modeling the Automata in Algebraic Plan:Black Box[xscan] vs. White Box FOR $b in stream(biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> $b := //book $p := $b/price $t := $b/title SJoin //book Xscan Extract //book/price Extract //book/title White Box Black Box
A Unified Process at the Logical View FOR $b in doc (biditems.xml) //book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return <Inexpensive> $t </Inexpensive> Tuple-based plan Token-based plan (automata plan)
SJoin //book Extract $p, //book/price Extract $t, //book/title A Unified Process at the Logical View FOR $b in doc (biditems.xml) //book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return <Inexpensive> $t </Inexpensive> Tuple-based plan
Navigate //book, //book/title Select //book/price >5 0 SJoin //book Extract //book/price Extract //book/title A Unified Process at the Logical View FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive>
The Algebra Core Relational-like XML-Specific SJ
Extract Operator Extract //book/title * book title 1 1 2 <bib> <book> <title> Dream Catcher </title> … </book>…
Structural Join Operator FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive> SJoin //book Extract //book/title Extract //book/price * title 3 book 1 2 price 4 <biditems> <book> <title> Dream Catcher </title> … </book>…
In or Out? Tuple-based Plan Query answer Pattern retrieval Tuple stream Token-based plan (automata plan) XML data stream
Plan Alternatives Tagger Tagger Navigate book/title Select price < 30 Select price<30 Navigate /price SJoin //book Extract //book Extract //book/title Extract //book/price The pull-out plan The push-in plan
<book>…… </book> <title>…</title> <price>…</price> <book>…… </book> <title>…</title> <price>…</price> <book>…… </book> <title>…</title> <book>…… </book> <title>…</title> <book year=“2001"> <title>Dream Catcher</title> <author> <last> King </last> <first> S. </first> </author> <publisher> Bt Bound </publisher> <price> 20 </price> </book> <book>…… </book> * title 4 book 1 2 price * book 3 1 2 Out of Automata(/title, /price) Pattern Retrieval Alternatives SJ t2 t2 t10 t10 In Automata (/title, /price)
Experiment: Selectivity = 5% Selectivity = 90%
0,0,0 *r=er|r++ *r=sr|r++ *r=<a>|w(x,sx),w(x,<a>),r++,x”++ 1,0,0 *r!=<a>|r++ *r=</a>|w(x,</a>),w(x,ex),r++,xs=x 2,1,0 *r!=</a>&*r!=</b>|w(x,*r),r++,x”++ *r=<b>|w(x,<b>),r++ 2,2,1 *true|xm=x’, w(o,<res>),w(o,<b>),x’++ !AE(x”)&*x”=ex|xs=x” 2,2,2 *r!=</a>&*r!=</b>|w(x,*r),w(o,*r),x”++,r++ *r=</b>|w(x,</b>),w(o,</b>),r++,x”++ 2,1,3 AE(x’)&*r!=</a>|w(x,*r),w(o,*r),r++,x”++ !AE(x’)&*x’!=ex|w(o,*x’),x’++ AE(x’)&*r=</a>|w(x,</a>),w(o,</a>),w(x,ex),r++,x’++ 1,1,3 !AE(x’)&x’!=ex|w(o,*x’),x’++ !AE(x”)&x”=</b>|w(o,</b>),x”++ 1,2,2 1,1,0 !AE(x”)&*x”!=</b>|w(o,*x”),x”++ !AE(x”)&*x”!=<b>&*x”!=ex|x”++ !AE(x”)&*x”=<b>|x”++ 1,2,1 True|xm=x’,w(o,<res>),w(o,<b>),x’++ Camp 1: Complete Automata Model [XSQ, XSM, XPush] For $x in $R/a return for $Y in $X/b return <res>$Y, $X </res>
Camp 1: Complete Automata Model [XSQ, XSM, XPush] • All details are presented on the same level (and low level!) • Hard to understand • Not suitable for optimizing at different levels • Little has been studied for using automata as query processing paradigm
$b $p $t Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter] • Fixed interface for automata computation (all pattern retrieval pushed down) • No opportunity of pushing/pulling computation into/from automata • Bloated, black box operator • Algebraic rewriting impossible for internal optimization Automata Plan $b := //book $p := //book/price $t := //book/title
Contributions • Combining automata and algebra leads to a powerful query processing model • Modeling: • Uniform, simple logical view – better understandability • Optimization: • Uniform rewriting – more optimization opportunities (e.g., pushin/pullout) • Optimization necessity is verified by experiments
http://davis.wpi.edu/dsrg/raindrop/ Project Overview Publications Talks Email: suhong@cs.wpi.edu
Experiment 2 Number of patterns = 2 Number of patterns = 20