Containment and Equivalence for an XPath Fragment

Containment and Equivalence for an XPath Fragment By Gerome Mikla Dan Suciu Presented By Roy Ionas

SEMINAR OBJECTIVES • PRSENTING THE PROBLEM OF NON POLYNOMIAL COMPLEXITY FOR CONTAINMENT AND EQUIVALENCE OF XPath FRAGMENTS. • PRESENTING TWO ALGORITHMS THAT IMPROVE THE COST OF XPATH CONTAINMENT AND EQUIVALENCE PROBLEM. • PRESENTING TREE PATTERNS AS AN EFFECTIVE TOOL FOR PROVING IN XPATH FRAGMENTS.

SO WHAT IS XPath? • A simple language for navigating XML documents and selecting a set of nodes • With XPATH we can query XML data , describe key constraints , express transformations and reference elements in remote documents. • We can find XPath influence in other XML query languages and features such as XQuery , XSLT , XML schema , XLink , XPointer and more...

DEFINTIONS • Simple XPath fragment. • Containment between two XPath fragments. • Equivalence between two XPath fragments. • Computability definitions. • Tree patterns as a proving tool for XPath fragments.

Simple XPath fragment • An XPath statement. • Contains three most important features for navigating: • Child and descendant axis. “//” “/” • Wildcards. “*” • Qualifiers. “[]” • We disregard attributes , conditions... • We identify and compare nodes only by their label. • We disregard order completely. • Example: a//*[b//d][c]

Simple XPath fragment • Are these all the features we have in XPath??? • Are these all the features we need for representing navigation in XML documents ? NO!!!!! YES!!!!! At least these are the needed ones for the proof of this article.

Containment • The meaning of Containment between two XPath’s fragments A and B is that for every XML document the result of applying XPath A will be contained in the result of applying XPath B. • Result is stated as a Set of nodes and does not consider order. • Can we apply this containment on the entire XML documents world?? • Is there another way to determine containment between two XPath fragments???

Equivalence • The meaning of Equivalence between two XPath fragments A and B is that for every XML document the result of applying XPath A will equal to the result of applying XPath B. • The problem of Equivalence can be reduced to the problem of Containment • Equivalence = containment in both ways between patterns. • Containment can be computed with an algorithm that computes equivalence and runs in polynomial time. • From now we will mention only the problem of containment and the results will be valid as well for equivalence.

Computability Definitions • NP - stands for “Nondeterministic-Polynomial". • P class - A class of mathematical problems for which an efficient solution has been found , which is solvable in polynomial time. • NP class - A class of mathematical problems which most likely has Exponential Complexity, for which no efficient solution has been found (yet), which is not solvable in polynomial time. • NP hard problem - a problem that can be reduced from each NP problem ( even worst than NP… ). • NP complete problem – a problem which belongs to the NP class of problems and is a NP hard problem by itself.

Tree Patterns • An unordered tree over the alphabet of the XPath. • XPath nodes are marked as nodes in the tree pattern. • Child axis are marked as edges. • Descendant are marked as edges with double lines. • K-tuple of nodes called the result type. • For a tree pattern P The arity of the result tuple is called the of arity of P. • Pattern tree P is Boolean iff its arity is 0.

Tree Patterns • Tree patterns are more elegant and general than XPath fragments. • We can reduce from XPath to Tree Patterns and via versa quite easily. Now we can prove attributes using the graph theory.

Tree Pattern - example • For the Xpath expression : • a//*[b//d][c] will be the next tree root a wildcard * child c b descendant d

Usage of Tree Patterns for navigating in XML trees • Embedding from Tree pattern to XML tree. • Imagine it as a function that must: • preserve root. • Respects node labels. • Respects edge relationships. • After embedding return the information from the nodes marked as return nodes and down. • For Boolean Patterns return true if such an embedding exists.

Example for embedding a a s * t b c b c d d

PROBLEM…. • Testing Containment between two XPath fragments is a NP complete problem. • Can be proven by a reduction from the 3CNF Co-NP class to our class.

Do We really care about it??? • In almost all the applications we described so far. • Inference of keys. • Optimization of XPath queries. When do we need to test for containment or equivalence between fragments? I guess we care...

Solving the problem • Finding an algorithm that will be both efficient and complete for this problem is quite difficult ( like proving P = NP ). • Finding an algorithm which is efficient but not complete. • Finding an algorithm that is complete but not always efficient.

First solution : Pattern homomorphism

Pattern Homomorphisms - definition • An homomorphism h between two tree patterns p,p’ is a function h:Nodes(p) -> Nodes(p’) that maintains the following conditions: • Root preserving. • For each x in p h(x) in p’ is x or *. • Child and descendant relations preserving. • Finding weather a homomorphism between two patterns exist has many efficient algorithms. • The algorithm is sound. Whenever there exists homomorphism between tree patterns p and p’ than p  p . • The existence of homomorphism is always a sufficient condition for containment. • But is it a necessary condition?

Example for homomorphism a a h(a) = a h(b) = * * b c

Homomorphism is not a complete solution for containment • A Homomorphism between the two tree patterns does not exist even though they are equivalent. a a * * b b

Cases where homomorphism applies • Fragments contain only *,[] • Fragments contain only //,[] • Fragments that contain all three but can be translated to an expression that belongs to one of the above without changing the semantic.

Conclusion for homomorphism • Sound. • Efficient. • Incomplete. Now we aim searching over an algorithm which will be sound and complete and may be efficient in several cases.

ALGORITHM FOR CONTAINMENT

Containment between regular languages • Reducing the problem of containment between two XPath fragments to containment between two regular languages by translating from Tree Pattern to an automata. • The algorithm is complete , with defined rules we can translate completely from automata to Tree Pattern and via versa.

Automata for XPath fragment • Defined on ranked trees. • Bottom up structure. • Only the root is an accepting state. • The initial states are the leaves of the tree. • The transitions are of the form:(q1,q2,…,qn;a) -> q

definitions • FTA - finite tree automata, an automata that contains set of states and transitions of the form described. • FTA can be deterministic - DFTA. • Each FTA A with Q states can be translated to a DFTA B with maximum of 2Q states . • AFTA - alternating finite tree automaton extends the definition of FTA by adding “AND transitions” of the form (q1,q2,…,qm)->qi. • A DFTA can be built as well for AFTA without increasing the cost of determinisiting the automata.

The entire algorithm • Construct the DFTA A accepting the “regular expressions of P” • Construct the AFTA A’ accepting the regular expressions of P’ ” • Compute the AFTA B=A x A’ • compute the DFTA C=Det(B) • if lang(A) lang(C) the return true else return false.

r r ? a  a * b b a b * b

Step 1:Building FTA A from Tree pattern p • States(A) = Nodes(p). • For each node x with children x1,…,xk we add a transition (x1,x2,…;x) -> x • For each descendant edge e from node x to node y we add (y;e)->x. we add internal circle (y,*) -> y • The terminal state will be only the root.

Example for building FTA r r a a * b b * b a a b b b

Step 2:Building an AFTA A’ from pattern p’ • States(A’) = Nodes(p’)  Edges(p’) • (q,a) -> for every symbol a that has out coming edge e. if it is a descendant relationship than we also add an internal circle to the source node. • (e1,e2,e3..) -> a for every a that has incoming edges.

Example for building AFTA for pattern p’ r r a a b * b *

Conclusion for the containment algorithm • Sound • Complete. • Not always efficient.

THE END

Containment and Equivalence for an XPath Fragment