A Query Algebra for Fragmented XML Stream Data

A Query Algebra for FragmentedXML Stream Data Sujoe Bose Leonidas Fegaras David Levine Vamsi Chaluvadi University of Texas at Arlington

Processing Streamed XML Data Most web servers are pull-based: A client submits a request, the server returns the requested data. This doesn’t scale very well for large number of clients and large query results. Alternative method: pushed-based dissemination • The server broadcasts/multicasts data in a continuous stream • The client connects to multiple streams and evaluates queries locally • No handshaking, no error-correction • All processing is done at the client side • The only task performed by the server is slicing, scheduling, and broadcasting data: • Critical data may be repeated more often than no-critical data • Invalid data may be revoked • New updates may be broadcast as soon as they become available.

A Framework for Processing XML Streams • The server slices an XML data source into XML fragments. Each fragment: • is a filler that fills a hole • may contain holes which can be filled by other fragments • is wrapped with control information, such as its unique hole ID, the path that reaches this fragment, etc. • The client opens connections to streams and evaluates XQueries against these streams • For large streams, it’s a bad idea to reconstruct the streamed data in client’s memory • need to process fragments as soon they become available from the server • There are blocking operators that require unbounded memory: • Sorting • Joins between two streams or self-joins • Group-by with aggregation.

The Fragmented Hole-Filler Model <commodities> <vendor> <name> Wal-Mart </name> <items> <stream:hole id="10" tsid="5"/> <stream:hole id="20" tsid="5"/> ... </vendor> ... </commodities> <stream:filler id="10" tsid="5"> <item> <name> PDA </name> <make> HP </make> <model> PalmPilot </model> <price currency="USD">315.25<price> </item> </stream:filler> <stream:filler id="20" tsid="5"> <item> <name> Calculator </name> <make> Casio </make> <model> FX-100 </model> <price currency="USD">50.25<price> </item> </stream:filler>

An Algebra for Stored XML Data Based on the nested-relational algebra: v(T) access the XML data source T using v pred(X) select fragments from X that satisfy pred v1,….,vn(X) project X  Y merge X predY join predv,path (X)unnest (retrieve descendents of elements) pred,h (X)apply h and reduce by  gs,predv,,h(X) group-by gs, apply h to each group, and reduce each group by

Semantics v(T) = { < v = T > } pred(X) = { t | t  X, pred(t) } v1,….,vn(X) = { <v1=t.v1,…,vn=t.vn> | t  X } X  Y = X ++ Y X predY = { tx ty | tx X, ty  Y, pred(tx,ty) } predv,path(X)={ t  <v=w> | t  X, w  PATH(t,path), pred(t,w) } pred,h (X)= /{ h(t) | t  X, pred(t) } gs,predv,,h (X) = …

Example #1 where

Example #1 (cont.) ,element(“book”,$b/title)  $b/publisher=“Addison-Wesley” and $b/@year > 1991 $b  $v/bib/book $v  document(“http://www.bn.com”)

Example #2 for $u in document(“users.xml”)//user_tuple return <user> { $u/name } { for $b in document(“bids.xml”)//bid_tuple[userid=$u/userid]/itemno $i in document(“items.xml”)//item_tuple[itemno=$b] return <bid> { $i/description/text() } </bid> sortby(.) } </user> sortby(name)  sort, elem(“bid”,$i/description/text()) $i/itemno=$b sort($u/name), elem(“user”,$u/name++) $b $i    $c/itemno $is/items/item_tuple $u $c  $is $us/users/user_tuple $bs/bids/bid_tuple   $c/userid=$u/userid $us  $bs  document(“items.xml”) document(“users.xml”) document(“bids.xml”)

XPath Expressions • Path evaluation is central to the algebra: PATH: ( XML-data, simple-XPath )  set(XML-data) • Some rules for stored XML data: PATH(<A>x</A>,A/path) = PATH(x,path) PATH(<A>x</A>,A) = { <A>x</A> } PATH(x1 x2,path) = PATH(x1,path)  PATH(x2,path) PATH(x,path) =  otherwise • Predicates have existential semantics $v/A/B = “text”  x  PATH(v,A/B): x = “text”

The Streamed XML Algebra Much like the stored XML algebra, but works on streams. A stream  takes the forms: • t ; ’ a fragment t followed by the rest of the stream ’ • eos end-of-stream Each stored XML algebraic operator has a streamed counterpart eg, pred(t ; ) = t ; pred() if pred is true for t pred(t ; ) = pred() otherwise pred(eos) = eos but … we may not be able to validate pred due to holes in t

Streamed Algebra Semantics • To keep the suspended fragments, each streamed algebraic operator has • one state 0 for the output and • optional state(s) 1/2 for the input(s) • The result of PATH may now be unspecified: PATH(<hole id=“m” …>,path) = PATH(1 (m),path) if m 1 = {  } otherwise • When in predicates,  requires 3-value logic • Incomplete fragments are suspended when necessary, eg: pred(t ; ) = t ; pred() if truePATH(t,pred) pred(t ; ) = pred() otherwise 0  0 {t} if PATH(t,pred)

Join Much like main-memory symmetric join • states: • 0 all suspended output tuples due to unfilled holes • 1 all tuples from left stream • 2 all tuples from right stream • a tuple from left stream: (t1;1) pred2 = { t1 t2 | t22, truePATH(t1 t2,pred) }; (1pred2) 1  1  t1 0  0  { t1 t2 | t22, PATH(t1 t2,pred) } • a tuple from right stream: 1pred (t2;2 ) = { t1 t2 | t11, truePATH(t1 t2,pred) }; (1pred2) 2  2  t2 0  0  { t1 t2 | t11, PATH(t1 t2,pred) }

Reconstructing the XML Data : set(int  XML-data) is an environment that binds filler ids to XML. x   replaces holes with fillers in x using the environment : <A> x </A>   = <A> x   </A> (x1 x2) = (x1 ) (x2 ) <hole id=“m” …>   = [m] if m x   = x otherwise R() returns a pair (a,), where and a is [0] (the reconstructed data): if R() = (a,) then R(<filler id=“m” x>; ) = R(eos) = (,) Basically, R(t ; ) = f(R()) { (x , ) if m=0 (a’, ’) if m0 where ’={(m,x )}  [m/x]

Equivalence Between Stored & Streamed Algebras If we reconstruct the XML document from the streamed fragments and evaluate a query using the stored algebra, we get the same result as when we use the equivalent streamed algebra over the streamed XML fragments and reconstruct the result. result XML document stored XML algebra reconstruction reconstruction streamed XML algebra XML fragments XML fragments Proof sketch: We prove R(p())=p(R()) inductively, where p is the stream version of p. If truePATH(t,pred), then R(p(t;))=R(t;p())=f(R(p()))=f(p(R())) =p(f(R())) =p(R(t;)) …

Conclusion • Fragmented XML data are easier to handle and synchronize than an infinitely long stream • Associating holes with fillers takes care of out-of-sequence transmission, repetitions, replacements, and removals • Our streamed algebra has similar operators but different semantics than our stored algebra • Our algebra can capture most non-recursive XQueries • Our future work includes • the development of main-memory algorithms for processing XML data streams under memory and power constraints • The development of a comprehensive approach to optimizing XQueries that utilizes our main-memory algorithms.

A Query Algebra for Fragmented XML Stream Data