Loading in 5 sec....

From tree patterns to generalized tree patterns: On efficient evaluation of XQueryPowerPoint Presentation

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Download Presentation

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Loading in 2 Seconds...

- 106 Views
- Uploaded on
- Presentation posted in: General

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

From tree patterns to generalized tree patterns: On efficient evaluation of XQuery

Z.M. Chen, H.V. Jagadish, L.V.S. Lakshmanan, S. Paparizos

(VLDB 2003)

Fatih Gön 2002701366

Mehmet Şenvar 2003700221

Bogazici University Department of Computer Engineering

Motivation: Current approach for XQuery evaluation is not efficient.

Need a concise XQuery model as the basis to generate the efficient evaluation physical plan

Main contribution:

- Generalized Tree Patterns query model (GTP)
- Algorithm translating from function-free XQuery to GTP
- Physical algebra and algorithm translating from GTP to physical plan
- Schema-aware optimization of GTP and physical plan

Current approaches

Navigational plan (NAV) : traverses down the path by recursively getting all children nodes and filter unwanted before next iteration

Baseline plan (BASE) : use TAX operator which take tree pattern and sequence of trees as input. Some tree patterns may be repeatedly evaluated.

Our approach

Generalized Tree Pattern (GTP) : use GTP as XQuery model to generated an efficient evaluation plan

$p

$p.tag = person &

$s.tag = state &

$l.tag = profile &

$g.tag = age &

$g.content > 25 &

$s.content != ‘MI’

Tree T

Boolean formula F

$s

$l

$g

(a)

$p

Boolean formula F

$p.tag = person &

$w.tag = watches &

$t.tag = watch

Tree T

$w

$t

(b)

FOR $p IN document(“auction.xml”)//person, $l IN $p/profile

WHERE $l/age > 25 AND $p//state != ‘MI’

RETURN <result> {$p//watches/watch} {$l/interest} </result>

(a) An XQuery example

$p

(0)

$p.tag = person & $s.tag = state &

$l.tag = profile & $i.tag = interest &

$w.tag = watches & $t.tag = watch &

$g.tag = age & $g.content > 25 &

$s.content != ‘MI’

(0)

$s

$l

$w

(0)

(1)

$t

$g

$i

(1)

(0)

(2)

(b) Generalized tree pattern

GTP: A pair G=(T,F), where T is a tree and F is a boolean formula.

- Each node of T is labeled by a distinct variable and has an associated group number.
- Each edge of T has a pair of associated labels <x,m>, where x specifies the axis (pc or ad) and m specifies the edge status (mandatory or optional).
- F is a boolean combination of predicates applicable to nodes.
Group: each maximal set of nodes in a GTP connected to each other by paths not involving optional edges. By convention, group 0 include the GTP root.

A pattern match of G into a collection of trees C is a partial mapping

h: GC such that:

- h is defined on all group 0 nodes.
- If h is defined on a node in a group, then it is necessarily defined on all nodes in that group.
- h preserves the structural relationships in G.
- h satisfies the boolean formula F.

A pattern matchis a mapping from the pattern nodes to nodes in an XML database such that the formula associated with the pattern as well as the structural relationships among pattern nodes.

Universal GTP is a GTP G=(T,F) such that some solid edges may be labeled ‘EVERY’.

‘SOME’ quantifier is already handled.

Eg. FOR $o IN document(“auction.xml”)//open_auction

WHERE EVERY $b in $o/bidder SATISFIES $b/increase > 100

RETURN <result> {$o} </result>

(0)

$o

F_L: pc($o,$b) & $b.tag = bidder

F_R: pc($b,$i) & $i.tag = increase &

$i.content > 100

EVERY

(1)

$b

$b: [F_L

$i: (F_R)]

(2)

$i

Function-free XQuery captured by the following grammar

FLWR ::= ForClause LetClause WhereClause ReturnClause.

ForClause ::= FOR $fv1 IN E1, … , $fvn IN En.

LetClause ::= LET $lv1 := E1, … , $lvn := En.

WhereClause ::= WHERE

(E1, … , En).

ReturnClause ::= RETURN {E1} … {En}.

Ei ::= FLWR | XPATH.

Input: a FLWR expression Exp, a context group number g

Output: a GTP or GTPs with a join formula

if (g’s last level !=0)

let g = g + “.0”;

foreach (“For $fv in E”) do

parse(E,g);

let ng = g;

foreach (“Let $lv := E”) do{

let ng = ng + 1;

parse(E, ng);

}

foreach predicate p in WHERE do {

if (p is “every El satisfies Er” ){

let ng = ng+1;

parse (El, ng);

F_L be the formula associated with the pattern

result from El;

let ng = ng+1;

parse(Er,ng);

F_R be the formula associated with the pattern

result from Er;

}

else{

foreach Ei as p’s argument do

parse(Ei, g);

}

}

foreach “{Ei}” do {

let ng = ng + 1;

parse (E, ng);

}

Procedure parse

Input: FLWR expression or XPath expression E,

context group number g

Output: Part of GTP resulting from E

if (E is FLWR expression)

GTP (E, g);

else buildTPQ(E);

end procedure

Input: a FLWR expression Exp, a context group number g

Output: a GTP or GTPs with a join formula

The GTP can be informally understood as follows:

1)Find matches for all nodes connected to the root by only solid edges

2)Next, find matches to the remaining nodes (whose path to the GTP root involves one or more dotted edges), if they exist.

- Avoid repeated matching of similar tree patterns
- Postpone the materilization of nodes as much as possible
- Operators and methods are avaliable in any XML database system

Index Scan ISp(S) : output each node satisfying the predicate p using an index for input trees S.

Filter Fp(S) : output only the trees satisfying the predicate p given trees S. Order is preserved.

Sort Sb(S) : Sort the input sequence of trees S based on the sorting basis b.

Value Join Jp(S1,S2) : a value-based comparison on the two input sequences of trees via the join predicate p. output sequence order is based on the left S1 input sequence order.

Structural Join SJr(S1,S2) : input tree sequences S1,S2 must be sorted based on the node id. Operator joins S1 and S2 based on the structural relationship r between them for each pair. Output is sorted by S1 or S2 as needed. Outer Structural Join (OSJ) where all S1 is included in the output. Semi structural Join (SSJ) where only S1 is retained in the output.

Group By Gb(S) : input is sorted on the grouping basis b. Group trees based on the grouping basis b.

Merge M(S1,…,Sn) : Sj’s are assumed to have the same cardinality k. For each i<=i<=k, merge tree i from each input under an artificial root and produce an output tree. Order is preserved.

- Evaluation Algorithm
- Plan is a DAG where each node is a physical operator or input document
- Helper functions used findOrder(SJs, $n), getGroupBasis(g), getGroupEvalOrder(G)

- Compute structural joins
- Filter based on predicates depending on contents of more than 2 pattern nodes
- Compute value joins
- Compute aggregation
- Filter based on predicates depending on aggr. value (if needed)
- Compute value joins based on aggr. values (if needed)
- Group return arguements (if any)

RETURN

ARGUMENT #1

RETURN

ARGUMENT #2

M

G

G

person, profile

person, profile

S

S

person, profile

person, profile

OSJ

SJ

profile/interest

watches/watch

IS

IS

S

interest

watch

OSJ

profile

person/watches

IS

SJ

watches

person/profile

SSJ

SSJ

F : filter

IS : tag index scan

SSJ : structural semi-join

SJ : strcutural join

OSJ : outer structural join

S : sort

M : merge

person//state

profile/age

IS

F

IS

F

content != ‘MI’

content > 25

person

profile

IS

IS

state

age

- Logical Optimization
- simplfy GTP by eliminating nodes using DTD or XML schema

- Phsysical Optimization
- eliminate duplicate operators (e.g. sorting, duplicate elimination)

Internal node elimination

a//b//c a//c,

$a

$a

$b

$c

$c

if schema implies every path from a to c passes through b.

a/b/c a//c?

Identifying two nodes with same tag

FOR $b IN …//book

WHERE $b/title = ‘DB’

RETURN <x> {$b/title} {$b/year} </x>

$b

$b

$t

$t2

$y

$t

$y

$t2 can be eliminated,

if schema says every book has at most one title child

Eliminate redundant leaves

FOR $a IN …./a[b]

RETURN {$a/c}

$a

$a

$b

$c

$c

$b can be eliminated,

If schema implies every a has at least one b

Elimination of sorting

SJ

“p1”

person

person/profile

person

profile

Provided two sorted input, the output will be in either person order or profile order. Not both in general.

However, if schema implies no person can have person descendants, output of the structural join ordered by person node id will also be in profile node id order.

“p2”

“l2”

profile

“l1”

{p1 – l2, p2 – l1}

Not both in order!!!

Elimination of group-by

{$l/interest}

We must group the return argument results for the FOR variable in general.

However, if schema implies each profile has at most one interest subelement, then grouping on interest can be eliminated.

Elimination of duplicate elimination

watches

“ws1”

$p//watches//watch

If schema implies watches cannot have watches descendants, the duplicate elimination is unnecessary.

watches

watch

“ws2”

“w1”

watch

$p//watches/watch?

“w2”

ws1: {w1,w2}

ws2: {w2}

Note: 1. t can not have t descendants

2. A can only have one child B

- Algorithm : pruneGTP(G)
simplifies GTP based on child/descendant constraints and avoidance constraints

- Steps (4)
- Detect emptiness of (sub)queries
- Identify nodes with same tag
- Eliminate reduntant leaves
- Eliminate redundant internal nodes

Let C : set of child/descendant constraints

Let G : GTP

There is a unique GTP Hmin equivalent to G under C, which has the smallest size among all equivalent GTPs.

GTP simplification algoritm will correctly simplfy G to Hmin in polynomial time

- TIMBER native XML database
- XMark generated documents
- P-III 866 MHz
- Windows 2000 professional
- TIMBER had 100 MB buffer pool
- 5 execution, eliminate max&min, get avr.
- 479 MB XML document

- NAV
- Traverses recursively getting all children of a node checking condition or name before next iteration
- Dependent on path size & number of children of each node

- BASE
- Straightforward tree pattern translation approach that utilizes set-at-a-time processing
- Unlike GTP does not make use tree pattern reuse

- Parameters: path length, number of return arguements, query selectivity, data materilization cost
- GTP outperforms NAV and BASE for every query by a magnitude of 1 or 2
- All algorithms effected by path length, Nav is mostly
- Query selectivity, Number of return arguements does not effect all algoritms, NAV will do same iteration
- Data materilization cost affects both GTP and BASE, but not much NAV

- Used 24 MB, 47 MB, 239 MB, 479 MB, 2397 MB documents (Factor 1-5). Results:
- GTP scales linearly with size of database

- In come case greatly enhance performance, but very little in others.
- Well when data materilization is not the dominating cost.
- Beneficial when path is of the form many/many/many and converted to many//many.

- Navigation-based XQuery processing systems : Galax, Natix, Tamino, TIMBER
- No optimization and plan generation systems for XQueries for native systems as a whole
- GTP is 3-20 times faster than TIMBER system
- Resech is going on optimizing XPath expressions by using TPQs and schema knowledge

- A novel structure called GTP is proposed
- GTPs are used as a a basis for physical plan generation and query optimization
- Compared GTP with other methods with extensive set of tests and observed that GTP win by at least an order of magnitude.
- Presented an algorithm for schema-based simplification of GTP
- Evaluation of GTP on relational XML systems as well as native systems

Questions ?