Containment and Equivalence for an XPath Fragment

Containment and Equivalence for an XPath Fragment Authors: Gerome Miklau Dan Suciu Presented by: Shnaiderman Lila

Presentation Outline • Introduction • Final Destination • Definitions and background • Canonical models and Match Sets • Exponential time containment algorithm (complete) • Homomorphism • Polynomial time containment algorithm (incomplete) • co-NP hardness of containment • Additional topics of interest • Conclusion Presented by Shnaiderman Lila

Introduction • XPath is a simple language for navigating XML documents and selecting a set of nodes. • With XPath we can query XML, describe key constraints, express transformations and reference elements in remote documents. • We can find XPath influence in other XML query languages and features such as XQuery, XSLT, XML schema, XLink, XPointer and more... Presented by Shnaiderman Lila

Example: a//*[b//d][c] a b c d * x Introduction (continue) • This article deals with simple XPath fragments, that consist of: • node tests • child axes (/) • Descendant axes(//) • Wildcards (*) • Predicates ([…]) • This class of queries is called XP{[] , * , //} Presented by Shnaiderman Lila

Final Destination • Showing that the containment problem for XP{[] , * , //} is co-NP complete (surprising!) • To present an efficient, sound algorithm which is complete in some cases (this algorithm always runs in PTIME) • To present a sound and complete algorithm which is efficient in some cases (the worst time for that algorithm is exponential) Presented by Shnaiderman Lila

Definitions and background • NP - stands for “Nondeterministic-Polynomial". • P class - A class of mathematical problems for which an efficient solution has been found, which is solvable in polynomial time. • NP class - A class of mathematical problems which most likely has Exponential Complexity, for which no efficient solution has been found (yet), which is probably not solvable in polynomial time. • NP hard problem - a problem that each NP problem can be reduced to ( even worse than NP… ). • NP complete problem – a problem which belongs to the NP class of problems and is an NP hard problem by itself. • coNP - is the class of problems whose complement is in NP.Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that: • If x  L, then M(x) = “yes” for all computation paths. • If x  L, then M(x) = “no” for some computation path. Presented by Shnaiderman Lila

Definitions and background (continued) • Embedding: • Given a tree pattern p and a tree t, an embedding from p to t is the function e: NODES(p)  NODES (t) with the following conditions:Root-preserving:e(ROOT(p)) = ROOT(t)Label-preserving: For each x  NODES(p), LABEL(x) = * or LABEL(x) = LABEL(e(x))Child-edge-preserving: For each (x,y)  EDGES/ (p), (e(x), e(y))  EDGES(t)Descendant-edge-preserving:For each (x,y)  EDGES//(p), (e(x), e(y))  EDGES+(t) (EDGES+, means that there is at least one edge between two nodes) Presented by Shnaiderman Lila

Pattern pa[a]//*[b]//c a b c a a c b a * x a b b c b d b c b c c Tree instance t Definitions and background (continued) • example Presented by Shnaiderman Lila

Definitions and background (continued) • From XPath to Tree Patterns: Every XPath expression can be translated into a tree pattern of arity 1, and vice-versa, while preserving semantics. • From now on we shall consider tree patterns only – P{[],*,//} and its fragments. • Boolean patterns – patterns with arity 0 • Definition: If p is boolean then: p(t) =  (false) orp(t) = {()} (true) • Containment means implication: p p’ iff t p(t)  p’ (t) • Proposition 1: Let s1,…,sk be k labels that are not in ∑. There is a translation of k-ary patterns over the alphabet ∑, to Boolean patterns over the alphabet ∑{s1,…,sk}, such that for any k-ary patterns p, p’, and their translation po,po’, we have p  p’ iff p0 p0’ Presented by Shnaiderman Lila

a c b a c b a a x1 * * x2 x3 s2 s3 s1 Definitions and background (continued) • Example: A tree pattern of arity 3, with the distinguished nodes x1,x2,x3, and its translation to a Boolean Pattern po, used in Proposition 1: po has three extra nodes labeled s1, s2, s3: • In the rest of this article, we will assume all tree patterns to be boolean, unless otherwise stated. Presented by Shnaiderman Lila

Definitions and background (continued) • Mutual Reducability of Containment and Equivalence:The containment and equivalence problems are mutually reducible in polynomial time.Equivalence is simply two-way containment. • We will only discuss containment in the reminder of this article. • Tree pattern evaluation:There is an algorithm that decides for any tree pattern p, and input tree t whether p (t) is true and runs in time O(|p||t|). • |p|, |t| - are the sizes of p, t, meaning the number of nodes in p, t. • p(t) is true – means that there is an embedding from p to t. Presented by Shnaiderman Lila

Canonical models and Match Sets • Model of Boolean pattern P: • A Model of p is a tree t T∑ on which p evaluates to true. • Mod(p): a set of models – Mod(p) = {t  T∑ | p(t) is true} • p p ’ iff Mod(p)  Mod(p ’) • Witness:a tree t such that p(t) is true and p ’(t) is false  pp ’ • In order to find a witness we need to check an infinite set so we need to restrict it: • Canonical Models: • First step: Eliminate all descendant edges by replacing each edge // with a sequence of wildcards */*/…/*.Second step: replace each wild card with a symbol z. • Formally (first step):p has d descendant edges EDGES//(p)={r1,…,rd}.Given d numbers û=(u1,…,ud), u10,…,ud 0, p [û] is a pattern obtained by replacing each descendant edge with any sequence of *’s. • distance: d(x,y) = ui + 1 (where x and y are nodes). Presented by Shnaiderman Lila

a b c b a a a * Extension nodes * * * c Tree pattern p Tree pattern p[0,2] Canonical models and Match Sets (continued) • Example • LEMMA: Let e: p t be an embedding from the tree pattern p to the tree t. There exists a unique extension p[û] and a unique embedding e’: p[û]  t such that x  NODES(p), e(x) = e’(x). • Proof: For each i=1,...,d, e maps the descendant edge ri=(xi,yi)  EDGES//(p) into a pair of nodes (e(xi),e(yi))  EDGES+(t). Define ui=d(e(xi),e(yi)) - 1 (d is the distance in t), and let û= (u1,…,ud). Extend e to e’: p[û]  t by mapping the extension nodes between xi and yi to the nodes connecting e(xi) to e(yi). Presented by Shnaiderman Lila

Canonical models and Match Sets (continued) • Formally (second step): replace the *’s with some symbol – sz(p) the tree pattern p obtained by replacing each * in p with z. • Set of canonical models:modz(p) = {sz(p[û]) | û=(u1,...,ud), u10,..., ud0} • This set is infinite in case it has at least one descendant edge • Set of bounded canonical models for n 0:modzn(p) = {sz(p[û]) | û=(u1,...,ud), 0u1n,..., 0ud n} • This set is always finite. • Star length w in pattern q,is the largest number of nodes labeled with *’s and connected by child edges. • Need to show: For searching a witness for p p’ it is enough to check a finite set modzn(p) where z does not occur in p ’ and n depends only on p ’. Presented by Shnaiderman Lila

Canonical models and Match Sets (continued) Let e: p t be an embedding from the tree pattern p to the tree t. There exist a unique extension p[û] and a unique embedding e’: p[û]  t such that x  NODES(p), e(x) = e’(x). • Proposition: Let p and p’ be two Boolean tree patterns, z∑ be a symbol that does not appear in p’, and w be the star length of p’. Then, the following are equivalent: (1) pp’ (2) modz(p)  Mod(p’ ), (3) modzn(p)  Mod(p’ ), where n = w + 1. • Proof: (1)(2)(3) is obvious (p p’ is equivalent to mod(p) Mod(p’ )).This leaves (3)(1): • Suppose pp’, and let t be a witness( p(t) is true and p’ (t) is false)). p(t) is true  there exists an embedding e : p  t There exists e’ : p[û]  t which agrees with e on the nodes of p (follows from the Lemma). • t1 = sz(p[û])  modz(p) is still a witness (p’ (t1) is false), to show that: suppose p’ (t1) were true  there exists an embedding e1 : p’  t1, • let f be a function:f: NODES(p) -> NODES(t) by composing e1: p’  t1 with e’: p[û]  t , (because NODES(t1) = NODES(p[û]).  • contradiction (f:p’  t p’ (t) is true)  p’ (t1) is false.This ends the proof (t1 = sz(p[û])  modz(p) is a witness p(t1) is true while p’ (t1) is false) . Presented by Shnaiderman Lila

Canonical models and Match Sets (continued) • We now construct some canonical model t2  modzn(p) that is still a witness. • This follows directly from the next lemma:Let p and p’ be two Boolean tree patterns, z  ∑ be a symbol that does not appear in p’ , and w’ be the star length of p’. Let t1 = sz(p[û]) be a canonical model such that p’ (t1) is false. Define v = (v1, ... , vd) to be vi = min(ui,n), for i = 1, ... , d, where n = w’ + 1, and t2 = sz(p[v]). Then p’ (t2) is false. • Intuition: if p’ (t2) were true, then we could stretch the chains of extra nodes in t2 to obtain t1, and we would still have p’ (t1) true. • Remark: the n from part (3) depends only on p’ : n = w’ + 1 (w’ is star length). • That concludes the proof: modzn(p)  Mod(p’ )  pp’(t2 is the witness that for p  p’ ). Presented by Shnaiderman Lila

x a y y y y b b b b z c u u z z z u u c c c * * * * p’u p’y,u p’ p’z p’y p’y,z Canonical models and Match Sets (continued) • Match Sets: For a tree t (or a pattern p), each node and each edge defines a subtree. • x  NODES(t) defines tx that consists of the node x and its subtree. (ROOT(tx) = x; tROOT(t) = t) • (x,y)  EDGES(t) defines tx,y that consists of ty, the node x and the edge (x,y). • S(t) – a set of all subtrees of nodes and adges. = p’x = p’x,y Presented by Shnaiderman Lila

a a b b x x a a Ms(t1) = {p’x , p’x,y , p’y,u, p’u } Ms(t2) = {p’y,u, p’u } ? c y y y y b b b b z u u u u z z u u c c * * * * * * c p’x= p’x,y p’x= p’x,y t1 = /a/b/c p’u p’u p’y,u p’y,u t2 = /a/b/z/c Canonical models and Match Sets (continued) • q* - the pattern obtained by replacing the root of q with *. • ms(t) = {px | x  NODES(p), px (t) = true}  {px,y | (x,y)  EDGES/(p), px,y(t) = true}  {px,y | (x,y)  EDGES//(p), (px,y)*(t) = true} • MS[p] = { ms(t) | t modz(p) } Presented by Shnaiderman Lila

Exponential time containment algorithm • Naive algorithm:to decide if p p’ :iterate over all t modzw’+1(p) and check p’ (t) (requires O(|t ||p’ |) steps). • The complete time: O(|p||p’ |(w’+2)(d+1))(based on the size of sz(p [û]), and the fact that d|p| ) • Problem: The naïve algorithm is not practical, since much of the work in evaluating p’ (t) is repeated for various canonical models t. • Main idea of the Match Set algorithm:pp’ iff there exists a canonical tree t modz(p) and p’ (t) is false. So it suffices to compute ms(t) for some t and to check if p’ROOT(p’ )  ms(t). • Problem: we don’t know for what tree t to compute ms(t)…Solution: To compute the set of all match sets - MS[p]. And then it suffices to check the condition ms  MS[p], p’ROOT(p’ )  ms to determine that pp’. Presented by Shnaiderman Lila

MS(p) = {{p’x, p’x,y , p’y,u , p’u} , {p’y,u, p’u} } pp’, because:ms  MS[p], p’ROOT(p’ )  ms  p’x {p’y,u , p’u } x a y b z u c * Exponential time containment algorithm (continued) • Example: a b • Remark: MS(p) has at most as many elements as canonical trees in modzw’+1(p) (w’ is the star length of p’ ). But in many cases it is much smaller because many canonical trees gives the same match sets (like in the example above).  Match Sets algorithm is better than the naïve one. • The full algorithm to check if p  p’ (complete): • Compute MS(p) • check if ms  MS[p], p’ROOT(p’ )  ms • If it exists, return pp’ • If it doesn’t, return p p’ c tree pattern p tree pattern p’ Presented by Shnaiderman Lila

x a a y3 y2 y1 b b b b .... .... cn c1 c2 c1 c1 c1 tree pattern p tree pattern p’ Exponential time containment algorithm (continued) • The running time: O(|p||p’ |(w’+2)d)(based on the size of sz(p [û]), and the fact that d|p| ) • This algorithm is sound and complete, and in some cases runs in exponential time: • In the following example, one ms is:{p’x,p’x,y1,…, p’x,yn}, and the other ms are subsets of: {p’x,y1,…, p’x,yn}, so the answer of the algorithm is false – p  p’, but it takes exponential time to decide it (because there are 2n mssets to check). Presented by Shnaiderman Lila

a a P = P’ = a a b * c d a c a c b a b Homomorphism • A homomorphism h: p’  p between two tree patterns p,p’ is a function h:Nodes(p’ ) -> Nodes(p) that satisfies the regular embedding with the following strengthening of the child edge preservation condition: • (x,y)  EDGES/(p’ )  (h(x),h(y))  EDGES/(p) (and not EDGES//(p) ) • Example: Root-preserving:e(ROOT(p)) = ROOT(t)Label-preserving: For each x  NODES(p), LABEL(x) = * or LABEL(x) = LABEL(e(x))Child-edge-preserving: For each (x,y)  EDGES/(p), (e(x), e(y))  EDGES(t)Descendant-edge-preserving:For each (x,y)  EDGES//(p), (e(x), e(y))  EDGES+(t) Presented by Shnaiderman Lila

a P = P’’ = 1 a a a ? P = P’ = b * * * b b b Homomorphism (continued) • Problem – homomorphism fails in the following case for P{//,*}: • Solution– adornment: combining // with *: • //  //0 • //m * / //m+1 • / * // n // n+1 • // m * // n // m+n+1 • Only * nodes with unique children may be eliminated this way. • In homomorphism with adornment d(h(x),h(y))  d(x,y), where d is the distance function. Example - • p’= a//*/*/b/*/c//d  p’= a//2 b/*/c //0 d Presented by Shnaiderman Lila

b b b * * * * tree pattern p’ c c a tree pattern p tree pattern p Homomorphism (continued) b • Problem: In the following case there is no homomorphism: • Shadowing: for any leaf node in both p and p’ add a shadow leaf with a label that does not exist in p and p’ , connected with the descendant edge to the original leaf. 1 a Has no outgoing edge can’t be eliminated by adornment tree pattern p’ Presented by Shnaiderman Lila

Polynomial time containment algorithm • The algorithm: • Add shadow leaf symbols to p and p’ • Apply rewriting rules (adornment) to p’ and get p’’ • Find a homomorphism from p’’ to p • If found return true • Else return false • Properties of the algorithm: • This algorithm is sound. • The running time: polynomial - is O(|p||p’ |) – depends on the part which checks homomorphism existence. • This algorithm is not complete… • This algorithm is complete in the following 4 cases: • p P{[],*} • p’ P{[],*} • p’ P{[], //} • p’ P{*, //} • The proof is given in the paper. Presented by Shnaiderman Lila

a a b b b c b c b c * c * c d d d d d Algorithm fails though p p’… (can be shown by reasoning by case) no more options… Polynomial time containment algorithm (continued) Tree pattern p Tree pattern p’ • Example of an incomplete case: =0 1 0 1 In homomorphism with adornment d(h(x),h(y))  d(x,y). =0 Presented by Shnaiderman Lila

co-NP hardness of containment • First we will show that the problem: “p,p’ P{ [],*,// } decides whether p p’ ” is in co-NP: • Reminder: to show that pp’ we have to find t  modzn(p) and to show that there is no embedding from p’ to t. • To prove that the problem is in co-NP: we will present an algorithm to check that pp’ : • guess d numbers u1,…ud, each ui  w’+1, where w’ is the star length of p’, and construct a canonical model t = sz(p[u1,…ud]), then check in polynomial time that p’(t) is false.  the problem is in co-NP. • Another definition of containment: containment of Boolean pattern p in a union of patterns is defined as follows: • p  p1…pk holds if, for all trees t, p(t)  p1(t)  p2(t)  …  pk(t). • Lemma: Given the patterns p, p1, p2,…, pk in P{ [],*,// }, there exist patterns q, q’ in P{ [],*,// } such that p  p1…pk iff q  q’. • q and q’ are polynomial in the sizes of p, p1, p2,…, pk. • q and q’ have no more wildcards than those present in p, p1, p2,…, pk. • Suppose L is a coNP problem, there exists a polynomial-time nondeterministic algorithm M such that: • If x  L, then M(x) = “yes” for all computation paths. • If x  L, then M(x) = “no” for some computation path. Presented by Shnaiderman Lila

r r c c k-1 nodes k nodes c V p1 c p2 c V c k-1 nodes c p pk V c V co-NP hardness of containment • Proof: in order to prove the lemma we will do the following construction: q pattern q’ pattern • V has no * and no // • V  pj fusing the (common) roots in pi subtrees, and replacing * in pi with some letter a and // with / The canonical models of q are completely determined by a choice of canonical model for q’s subtree p : for each t  modz(q), tp  modz(p) is the subtree corresponding to p Presented by Shnaiderman Lila

r r c c k-1 nodes q pattern q’ pattern k nodes c V p1 c p2 c V c k-1 nodes c p pk V c V co-NP hardness of containment (continued) • Returning to lemma, p  p1…pk q  q’:(for every t modz(q), q’(t) is true): • for t modz(q), p(tp) is true  • pi(tp) is true for some i  {1,…,k}  • q’(t) is true for the following embedding e: q’  t : e maps the root of q’ to the root of q, e maps the subpattern pi to tp , e maps every other pj to a corresponding V (there is enough V below and above p to make it). • q  q’  p  p1…pk: (for every tp modz(p), p1(tp)  p2(tp)  …  pk(tp) is true  p  p1…pk): • tp modz(p),t is the extension of tp to t modz(q), by adding the spine and k-1 copies of V above and bellow tp. • q(t) is true  q’(t) is true  • there exists an embedding e: q’  t. This embedding must map the spine in q’ to the spine in t. Let x be the spine node in t that is right above tp at least one spine node in q’ must be mapped to x (because there are only k-1 nodes above or below x, and the spine in q’ has only k nodes and no descendant edges  • There is some node y in q’ mapped to x  we found pi such that pi(tp) is true p1(tp)  p2(tp)  …  pk(tp) is true. Presented by Shnaiderman Lila

co-NP hardness of containment (continued) • Now we are ready to prove the co-NP hardness: we will do it by reduction from 3-CNF. • Let ψ be a 3-CNF formula with n propositional variables y1, y2, ... , yn, and k clauses c1, c2, ... , ck. We construct patterns A,C1, ... , Ck , such that ψ is not satisfiable iff A C1 … Ck. The tree pattern A is constructed so that its canonical models, modz(A), encode truth assignments to the n variables of ψ. Tree pattern Ciis constructed so that the following property holds: • (*) For every t modz(A), Ci(t) is true iff the truth assignment encoded by t makes the clause cifalse. • Property (*) is sufficient to prove co-NP hardness because of the following equivalences, and the last Lemma:(A C1 … Ck)  (for every t modz(A) there exists i such that Ci (t) is true)  (for every truth assignment there exists i such that, ci is false under that assignment)  (ψ is not satisfiable). • lets show how to construct A,C1, ... ,Ckso that property (*) is satisfied. Presented by Shnaiderman Lila

yi T(yi) F(yi) b Tree pattern A ai ai ai * y1 y2 yk b b b r Tree pattern Ci T(yj) F(yk) T(yl) co-NP hardness of containment (continued) • For t  modz(Yi), if t consists only of ai followed by b, it corresponds to a truth assignment making yi true. If t contains one or more added nodes between ai and b, it corresponds to a truth assignment making yi false. • We define a tree pattern Ci for each clause of ψ by an example: • For Ci = (yj  yk   yl ): Presented by Shnaiderman Lila

co-NP hardness of containment (continued) • In case of some arbitrary bounds on the number of occurrences of //, or *, or []: • For //:the containment problem p p’ remains in PTIME if we bound the number of // edges to some d  0. • We have shown that at the beginning of the lecture when we worked on bounded canonical models. • For *: the containment problem p p’ remains co-NP hard even if we allow at most two *. • Won’t be proved now • For []: the containment problem p p’ remains co-NP hard even if we allow at most five [] in p and at most three [] at p’. • Won’t be proved now Presented by Shnaiderman Lila

Additional topics of interest • Disjunction: • Containment for P{ [], | } patterns is already co-NP complete • Can be shown that Containment for P{ //,*,[],| } is also co-NP. • Given the expresions p,p’ XP { | }, deciding containment is co-NP hard  and of course in case of XP { //,*,[],| } it is also co-NP hard. • Finite Alphabet: • This article’s results do not hold for finite alphabet of size which is not two. • In another article (Neven & Schwentick) it is shown that in case of finite alphabet, containment is in PSPACE for P{ //,*,[],| } and complete for PSPACE for P{ [], | }. • Evaluation on graphs: • All results in this article apply directly to an extension of Boolean patterns evaluated on graphs (in our article we deal with trees). • Application to CTL (computation tree logic): • All co-NP completeness results in this article apply to a fragment of CTL (ECTL) as well. Presented by Shnaiderman Lila

Conclusion • We have studied the complexity of containment and equivalence for an important core fragment of XPath. Many XML applications benefit from a practical decision procedure for containment of such expressions. • Our results provide intuition into the factors that contribute to its high complexity. Nevertheless, we show that in some significant special cases, containment can be decided efficiently, and we provide an algorithm which does so. • One direction for future work is to expand this fragment of XPath with additional features, although it is clear that it will be even more challenging to prove efficient special cases of the problem. Another direction is to study containment of XPath expressions over sets of documents conforming to constraints or schema restrictions. Preliminary work shows that sufficiently expressive constraints make this problem intractable for XPath fragments that otherwise have efficient containment problems. THE END ! Presented by Shnaiderman Lila

Containment and Equivalence for an XPath Fragment