Efficient Evaluation of Regular Path Expressions on Streaming XML Data

Efficient Evaluation of Regular Path Expressions on Streaming XML Data By - Zachary G. Ives, Alon Y. Levy and Daniel S. Weld

Table of Contents • A bit about XML (yes, again) • Our goal, problem and solution • Our XML data model • How to ask questions ?

Table of Contents • X-scan operation and structure • Digging deep into x-scan • How good is it ? – Performance Evaluation • Conclusion

A Bit About XML (yes, again) • XML – the eXtensible Markup Language • Become a standard • Useful for the dissemination and exchange of information

A Bit About XML (yes, again) • Advantages • Simple • Self-describing nature • Flexible • Represents both structured and semi-structured data

XML Structure • Consists of : • Elements – pairs of matching open and close tags. • Elements may enclose additional elements or data values. • Attributes – included in element tags. • Attributes are single-valued and describe the element.

XML Structure (Cont.) • ID is special attribute which uniquely identify the element. • IDREF form links the other elements in the document. • Combining ID and IDREF forms a graph structure rather than just a tree structure.

<db> <lab ID=“baselab” manager=“smith1”> <name>Seattle Bio Lab</name> <location> <city>Seattle</city> <country>USA</country> </location> </lab> <lab ID=“lab2”> <name>PMBL</name> <city>Philadelphia</city> <country>USA</country> </lab> <paper ID=“Smith991231” source=“baselab” biologist=“smith1”> <title>Autocatalysis of Spect…</title> … </paper> <biologist ID=“smith1”> <lastname>Smith</lastname> … </biologist> </db> XML Example We will use this example throughout the rest of the lecture

Our Goal • Our goal is to perform queries and search operations on the XML document. • Several query languages have been proposed. • Represents the XML document as a graph.

Our Goal (Cont.) • Represents the query as a regular path expression that should be matched against XML source. • These regular path expressions describe traversals along edges in the XML graph. • The variables in the query are mapped to XML elements along these paths.

Our Problem • Most XML query processors • Loading the data into a local repository • Building indexes on the repository • Processing the query • The repository is either • Relational database • An object oriented database • A repository of semi structured data

Our Problem (Cont.) • The local storing and indexing is expensive. • Especially when the query is made over streams of incoming XML. • The streams can come from many sources, some fast and some slow. • Sometimes we want some partial answer but as soon as possible.

Our Solution • The query can be performed while the data streams in. • The XML-Scan (x-scan) operator does exactly that. • Used at the lowest level of the query plan and supplies data to other operators.

The X-Scan Operator • Input : • An XML data stream. • Set of regular path expressions. • Output : • Stream of binding for the variables occurring in the expressions. • The bindings are produced incrementally, as the XML data is streaming in.

The X-Scan Operator (Cont.) • The entire graph can be constructed in a single pass. • X-Scan simultaneously. • Parse the XML data. • Indexing nodes by their IDs. • Resolving IDREFs. • Return the nodes that match the path expressions of the query.

The X-Scan Operator (Cont.) • Some issues in the X-Scan operation are • Deal with possibly cyclic data • Preserve order of elements • Remove duplicate bindings that are generated due to multiple paths to the same elements

Data Model for XML • Naturally, the XML data model is a graph. • Each XML tag is an edge labeled with the tag name. • It is directed to a node which label is the tag’s ID. (if it has no ID it gets a number). • A given element node will have labeled edges directed to it’s attribute values, sub-elements, and any other elements referenced via IDREF.

Data Model for XML (Cont.) • Example is always the best way

How to ask questions ? • A variety of query languages have been proposed. • The key feature in all of these languages is the use of regular path expressions over the data. • Most of them also give the answer to the query as XML document. • X-Scan uses XML-QL.

The XML-QL Syntax • The syntax of XML-QL is • patterni template is matched against the XML data graph from sourcei and the resulted tuples are formatted as described in result. WHERE pattern1 IN source1, pattern2 IN source2,… CONSTRUCT result

The XML-QL Syntax (Cont.) • An XML-QL pattern is a set of nested tags with embedded variable names (prefixed by $) that specify bindings of graph nodes to variables. • The CONSTRUCT clause specifies a tree-structured set of edges and nodes to add to the output graph for each tuple of variable bindings.

The XML-QL Syntax (Cont.) • Again, example is the best way • Lets look at WHERE <db> <lab> <name>$n</> <_*><city>$c</></> </> ELEMENT_AS $l </> IN “fig1.xml” CONSTRUCT <result> <center> <name>$n</> <location>$c</> </> </>

The XML-QL Syntax (Cont.) • As we can see, the result will be <result> <center> <name>Seattle Bio Lab</name> <location>Seattle</location> </center> <center> <name>PMBL</name> <location>Philadelphia</location> </center> </result>

The XML-QL Syntax (Cont.) • If the variable is bound to a node with sub-elements, all the sub-graph will be inserted to the resulted graph. • We will use dot-notation to describe the X-Scan operation. • The previous example will rewritten as. • El = root.”db”.”lab” • En = El.”name” • Ec = El._*.”city”

The X-Scan Place • The goal of the X-Scan operator is therefore to produce a set of bindings for each pattern in the WHERE clause.

So, What X-Scan do ? • Given the XML Stream and a set of regular path expressions, outputs a stream of tuples assigning binding values to each variable in the set of regular path expression. • The central mechanism is a set of state machines that traverse the XML graph, trying to satisfy the path expressions.

What is it made of ? • The data components of X-Scan are

Where the data flows? • As the data streams into the system, several structures are created • The data get parsed and stored locally • A structural index of the XML graph is created • An ID index records the IDs of all elements and their location in the structural index • A list of references to not-yet-seen element IDs is maintained

Where the data flows? • In parallel to the creation of those structures, a set of finite state machines perform a DFS over the partial structural index. • When a machine reaches an accepting state, a new value is added to the binding-value table of that machine. • Those values are later combine to form the complete image.

Example problems • It sounds easy, but yet there some problems to meet, for example • The handling of cycles • How to prune duplicate bindings as they are created ? Remember X-Scan is online operator

The State Machines • As described earlier, we create one regular expression for every variable in the query – in the dot-notation. • So, we build a finite-state machine for each expression. • State transition is correspond to edge traversals in the XML data graph

The State Machines (Cont.) • The end of the path expression yield an accepting state, which outputs instances of the corresponding variables. • When one variable is dependent upon other variable, the other variable machine accepting state is pointing to the state machine of the first one.

The State Machines (Cont.) • And back to our example

Indexing the XML Graph • The structural index should allow x-scan to quickly traverse the XML data graph. • Each node in the index contains • The ID of the element and its offset in the document • Pointers to all the sub-elements, attributes and IDREFs of the element. • Essentially it looks like the graph except for the leafs.

The Algorithm – Step by Step • X-Scan proceeds by building the structural index and running a set of active state machines in parallel. • The core algorithm is in fact the way those state machines run, lets focus on that by running our example.

The Algorithm – Step by Step • Initially, only the top level machine is active. • When a machine M reaches an accepting state, it produces a binding b for its variable, writes it and the parent value to its table and activates all of its dependent state machines.

The Algorithm – Step by Step • Those machines remain active while x-scan is scanning b or any element accessible by a path from b. • The final output of x-scan is the equi-join of all the appropriate tables.

The Algorithm – By Example • Ml is initialized on state 1 as the only active machine.

The Algorithm – By Example • The root got a “db” edge, so the machine is pushed to its stack and moving to state 2 with value node #1

The Algorithm – By Example • Next, following the first outgoing edge, pushing the old state value, and setting Ml to state 3 with value baselab

The Algorithm – By Example • Since it now in accepting state • the baselab value is written to the Ml table • Ml is suspended • Mn and Mc are activated

The Algorithm – By Example • The next edge takes Mn from state 4 to 5 • And Mc run on the loop back to state 6 • Both machines have #2 as binding value

The Algorithm – By Example • Since Mn is now in an accept state x-scan writes <#2,baselab> into Mn’s table. • Since no edges remain for exploration, x-scan pops the stack and backs up the state machines, resetting Mn to state 4 and Mc to state 6

The Algorithm – By Example • The next edge is labeled location so • Mn stay in state 4 • Mc also stay in state 6 but advanced to node #3 • Then Mc is advanced to state 7 on the city edge to node #4

The Algorithm – By Example • At this point x-scan writes <#4,baselab> into Mc’s table. • It can also produce the first tuple of bindings <l/baselab,n/#2,c/#4>

The Algorithm – By Example • X-Scan keeps running Mc but no more cities are found • It pops back up to baselab • Running Mc along the IDREF to smith1 gives no more cities

The Algorithm – By Example • Now, Mn and Mc are deactivated and the control return to Ml • X-scan pops up to node #1 to state 2 • The other lab edge yield another tuple <l/lab2,n/#6,c/#7>

Where should we go ? • On occasion x-scan will encounter an IDREF to a node that has not yet been parsed. • Unknown node simply will not be in the ID index.

Where should we go ? • When X-Scan hits such unseen reference • It pauses all the relevant state machines • Adds an entry to the list of unresolved IDREFs • <desired ID value, referrer’s address> • Continue to parse and build the structural index

Where should we go ? • Once the target element is parsed x-scan • fills its address into each referring IDREF in the structural index • Removes the entry from the list of unresolved IDREFs • Awakens the state machines and proceeds

Efficient Evaluation of Regular Path Expressions on Streaming XML Data