From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos (VLDB 2003) Fatih Gön 2002701366 Mehmet Şenvar 2003700221 Bogazici University Department of Computer Engineering

Overview Motivation: Current approach for XQuery evaluation is not efficient. Need a concise XQuery model as the basis to generate the efficient evaluation physical plan Main contribution: • Generalized Tree Patterns query model (GTP) • Algorithm translating from function-free XQuery to GTP • Physical algebra and algorithm translating from GTP to physical plan • Schema-aware optimization of GTP and physical plan

Motivation Current approaches Navigational plan (NAV) : traverses down the path by recursively getting all children nodes and filter unwanted before next iteration Baseline plan (BASE) : use TAX operator which take tree pattern and sequence of trees as input. Some tree patterns may be repeatedly evaluated. Our approach Generalized Tree Pattern (GTP) : use GTP as XQuery model to generated an efficient evaluation plan

Tree pattern query $p $p.tag = person & $s.tag = state & $l.tag = profile & $g.tag = age & $g.content > 25 & $s.content != ‘MI’ Tree T Boolean formula F $s $l $g (a) $p Boolean formula F $p.tag = person & $w.tag = watches & $t.tag = watch Tree T $w $t (b)

Generalized tree pattern (GTP) FOR $p IN document(“auction.xml”)//person, $l IN $p/profile WHERE $l/age > 25 AND $p//state != ‘MI’ RETURN <result> {$p//watches/watch} {$l/interest} </result> (a) An XQuery example $p (0) $p.tag = person & $s.tag = state & $l.tag = profile & $i.tag = interest & $w.tag = watches & $t.tag = watch & $g.tag = age & $g.content > 25 & $s.content != ‘MI’ (0) $s $l $w (0) (1) $t $g $i (1) (0) (2) (b) Generalized tree pattern

Generalized tree pattern (GTP) GTP: A pair G=(T,F), where T is a tree and F is a boolean formula. • Each node of T is labeled by a distinct variable and has an associated group number. • Each edge of T has a pair of associated labels <x,m>, where x specifies the axis (pc or ad) and m specifies the edge status (mandatory or optional). • F is a boolean combination of predicates applicable to nodes. Group: each maximal set of nodes in a GTP connected to each other by paths not involving optional edges. By convention, group 0 include the GTP root.

Pattern match (Formal Description) A pattern match of G into a collection of trees C is a partial mapping h: GC such that: • h is defined on all group 0 nodes. • If h is defined on a node in a group, then it is necessarily defined on all nodes in that group. • h preserves the structural relationships in G. • h satisfies the boolean formula F.

Pattern match A pattern matchis a mapping from the pattern nodes to nodes in an XML database such that the formula associated with the pattern as well as the structural relationships among pattern nodes.

Universal GTP Universal GTP is a GTP G=(T,F) such that some solid edges may be labeled ‘EVERY’. ‘SOME’ quantifier is already handled. Eg. FOR $o IN document(“auction.xml”)//open_auction WHERE EVERY $b in $o/bidder SATISFIES $b/increase > 100 RETURN <result> {$o} </result> (0) $o F_L: pc($o,$b) & $b.tag = bidder F_R: pc($b,$i) & $i.tag = increase & $i.content > 100 EVERY (1) $b $b: [F_L  $i: (F_R)] (2) $i

Grammer for XQuery Fragment Function-free XQuery captured by the following grammar FLWR ::= ForClause LetClause WhereClause ReturnClause. ForClause ::= FOR $fv1 IN E1, … , $fvn IN En. LetClause ::= LET $lv1 := E1, … , $lvn := En. WhereClause ::= WHERE (E1, … , En). ReturnClause ::= RETURN {E1} … {En}. Ei ::= FLWR | XPATH.

Algorithm GTP Input: a FLWR expression Exp, a context group number g Output: a GTP or GTPs with a join formula if (g’s last level !=0) let g = g + “.0”; foreach (“For $fv in E”) do parse(E,g); let ng = g; foreach (“Let $lv := E”) do{ let ng = ng + 1; parse(E, ng); } foreach predicate p in WHERE do { if (p is “every El satisfies Er” ){ let ng = ng+1; parse (El, ng); F_L be the formula associated with the pattern result from El; let ng = ng+1; parse(Er,ng); F_R be the formula associated with the pattern result from Er; } else{ foreach Ei as p’s argument do parse(Ei, g); } } foreach “{Ei}” do { let ng = ng + 1; parse (E, ng); } Procedure parse Input: FLWR expression or XPath expression E, context group number g Output: Part of GTP resulting from E if (E is FLWR expression) GTP (E, g); else buildTPQ(E); end procedure

Algorithm GTP Input: a FLWR expression Exp, a context group number g Output: a GTP or GTPs with a join formula The GTP can be informally understood as follows: 1)Find matches for all nodes connected to the root by only solid edges 2)Next, find matches to the remaining nodes (whose path to the GTP root involves one or more dotted edges), if they exist.

Translating GTP Into an Evaluation Plan • Avoid repeated matching of similar tree patterns • Postpone the materilization of nodes as much as possible • Operators and methods are avaliable in any XML database system

Physical algebra Index Scan ISp(S) : output each node satisfying the predicate p using an index for input trees S. Filter Fp(S) : output only the trees satisfying the predicate p given trees S. Order is preserved. Sort Sb(S) : Sort the input sequence of trees S based on the sorting basis b. Value Join Jp(S1,S2) : a value-based comparison on the two input sequences of trees via the join predicate p. output sequence order is based on the left S1 input sequence order. Structural Join SJr(S1,S2) : input tree sequences S1,S2 must be sorted based on the node id. Operator joins S1 and S2 based on the structural relationship r between them for each pair. Output is sorted by S1 or S2 as needed. Outer Structural Join (OSJ) where all S1 is included in the output. Semi structural Join (SSJ) where only S1 is retained in the output. Group By Gb(S) : input is sorted on the grouping basis b. Group trees based on the grouping basis b. Merge M(S1,…,Sn) : Sj’s are assumed to have the same cardinality k. For each i<=i<=k, merge tree i from each input under an artificial root and produce an output tree. Order is preserved.

Translating GTP to Physical Plans • Evaluation Algorithm • Plan is a DAG where each node is a physical operator or input document • Helper functions used findOrder(SJs, $n), getGroupBasis(g), getGroupEvalOrder(G)

Stages of Evaluation Algorithm ( 7 steps) • Compute structural joins • Filter based on predicates depending on contents of more than 2 pattern nodes • Compute value joins • Compute aggregation • Filter based on predicates depending on aggr. value (if needed) • Compute value joins based on aggr. values (if needed) • Group return arguements (if any)

Physical plan from the GTP RETURN ARGUMENT #1 RETURN ARGUMENT #2 M G G person, profile person, profile S S person, profile person, profile OSJ SJ profile/interest watches/watch IS IS S interest watch OSJ profile person/watches IS SJ watches person/profile SSJ SSJ F : filter IS : tag index scan SSJ : structural semi-join SJ : strcutural join OSJ : outer structural join S : sort M : merge person//state profile/age IS F IS F content != ‘MI’ content > 25 person profile IS IS state age

Schema-Aware Optimization • Logical Optimization - simplfy GTP by eliminating nodes using DTD or XML schema • Phsysical Optimization - eliminate duplicate operators (e.g. sorting, duplicate elimination)

Schema-Aware Optimization Internal node elimination a//b//c  a//c, $a $a $b $c $c if schema implies every path from a to c passes through b. a/b/c  a//c?

Schema-Aware Optimization Identifying two nodes with same tag FOR $b IN …//book WHERE $b/title = ‘DB’ RETURN <x> {$b/title} {$b/year} </x> $b $b $t $t2 $y $t $y $t2 can be eliminated, if schema says every book has at most one title child

Schema-Aware Optimization Eliminate redundant leaves FOR $a IN …./a[b] RETURN {$a/c} $a $a $b $c $c $b can be eliminated, If schema implies every a has at least one b

Schema-Aware Optimization Elimination of sorting SJ “p1” person person/profile person profile Provided two sorted input, the output will be in either person order or profile order. Not both in general. However, if schema implies no person can have person descendants, output of the structural join ordered by person node id will also be in profile node id order. “p2” “l2” profile “l1” {p1 – l2, p2 – l1} Not both in order!!!

Schema-Aware Optimization Elimination of group-by {$l/interest} We must group the return argument results for the FOR variable in general. However, if schema implies each profile has at most one interest subelement, then grouping on interest can be eliminated.

Schema-Aware Optimization Elimination of duplicate elimination watches “ws1” $p//watches//watch If schema implies watches cannot have watches descendants, the duplicate elimination is unnecessary. watches watch “ws2” “w1” watch $p//watches/watch? “w2” ws1: {w1,w2} ws2: {w2} Note: 1. t can not have t descendants 2. A can only have one child B

GTP Simplification • Algorithm : pruneGTP(G) simplifies GTP based on child/descendant constraints and avoidance constraints • Steps (4) • Detect emptiness of (sub)queries • Identify nodes with same tag • Eliminate reduntant leaves • Eliminate redundant internal nodes

Theorem 1 (Optimality) Let C : set of child/descendant constraints Let G : GTP There is a unique GTP Hmin equivalent to G under C, which has the smallest size among all equivalent GTPs. GTP simplification algoritm will correctly simplfy G to Hmin in polynomial time

Experiments • TIMBER native XML database • XMark generated documents • P-III 866 MHz • Windows 2000 professional • TIMBER had 100 MB buffer pool • 5 execution, eliminate max&min, get avr. • 479 MB XML document

Navigational & Base Plans • NAV • Traverses recursively getting all children of a node checking condition or name before next iteration • Dependent on path size & number of children of each node • BASE • Straightforward tree pattern translation approach that utilizes set-at-a-time processing • Unlike GTP does not make use tree pattern reuse

Interesting Cases • Parameters: path length, number of return arguements, query selectivity, data materilization cost • GTP outperforms NAV and BASE for every query by a magnitude of 1 or 2 • All algorithms effected by path length, Nav is mostly • Query selectivity, Number of return arguements does not effect all algoritms, NAV will do same iteration • Data materilization cost affects both GTP and BASE, but not much NAV

CPU Timings

Scalability • Used 24 MB, 47 MB, 239 MB, 479 MB, 2397 MB documents (Factor 1-5). Results: • GTP scales linearly with size of database

Schema-Aware Optimization Results • In come case greatly enhance performance, but very little in others. • Well when data materilization is not the dominating cost. • Beneficial when path is of the form many/many/many and converted to many//many.

Related Work • Navigation-based XQuery processing systems : Galax, Natix, Tamino, TIMBER • No optimization and plan generation systems for XQueries for native systems as a whole • GTP is 3-20 times faster than TIMBER system • Resech is going on optimizing XPath expressions by using TPQs and schema knowledge

Summary & Future Work • A novel structure called GTP is proposed • GTPs are used as a a basis for physical plan generation and query optimization • Compared GTP with other methods with extensive set of tests and observed that GTP win by at least an order of magnitude. • Presented an algorithm for schema-based simplification of GTP • Evaluation of GTP on relational XML systems as well as native systems

Thanks... Questions ?

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery