Optimizing Cursor Movement in Holistic Twig Joins
250 likes | 273 Views
This study presents TwigOptimal, a holistic twig join algorithm for efficient XQuery processing, with focus on cursor movement optimization and extraction points. Experimental results show significant performance improvements over existing algorithms.
Optimizing Cursor Movement in Holistic Twig Joins
E N D
Presentation Transcript
Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford) CIKM’2005
for $a in //article[year = “2005” or keyword = “XML”] for $s in $a/section return $s/title In an index-based method, 7 tags and text elements need to be verified to process this query Running time is dominated by the I/O for manipulating this cursors Twig join Algorithms are not optimized for I/O and do not exploit the query’s extraction points article AND OR section year keyword title 2005 XML Motivation
Our Contributions • TwigOptimal, a new holistic twig join algorithm that supports a large fraction of XQuery (including AND/OR branches) • Description of how extraction points improve query performance • Experimental evaluation that shows how TwigOptimal outperforms current algorithms
Agenda • Background • TwigOptimal algorithm • Experimental results • Conclusions
(0,7,0) R (1,5,1) B3 A1 (6,7,1) (3,5,2) (7,7,2) B2 B1 C2 (2,2,2) (4,4,3) C1 D1 (5,5,3) XML Indexing • Begin/End/Level encoding • Begin: preorder position of tag/text • End: preorder position of last descendent • Level: depth • Containment: X contains Y iff X.begin < Y.begin <= X.end (assuming well-formed)
R B3 A1 B1 B2 C2 C1 D1 B1 B2 B3 C1 C2 Basic Access Path • Inverted lists • Posting: <Token, Location> • Token = <term/tag> • Location = <DocumentID, Position> • Supported method on cursor: • CB.fowardTo(Position p)
A || B || || C D Joins in XML • Structural (Containment) Joins • Twig Joins A || B B || C B || D A || B || C
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 LocateExtension • “Extension” (w.r.t. query node q) – a solution for the subquery rooted at q • Input: q • Result: the cursors of all descendants of q point to an extension for q
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 LocateExtension While (not end(q) && not hasExtension(q)) { (p, c) = PickBrokenEdge(q); ZigZagJoin(p, c); }
TwigOptimal Algorithm • Tests if the cursor with the minimal location has an extension • If not, try to virtually move cursors until they form an extension • Only move cursors physically if no more virtual move is possible • A virtual move just sets the begin value of the cursor, therefore no I/O is involved: • Cq.begin = new begin value for Cq; • Cq.virtual = true; //indicates that the cursor is virtual
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 Checking Extension • We have an extension for cursor q if: • All cursors underneath q are properly aligned • All cursors underneath q have physical locations Return false
A B3 A || B B1 D1 X2 D2 C1 X1 || || C D C2 Checking Extension • We have an extension for cursor q if: • All cursors underneath q are properly aligned • All cursors underneath q have physical locations Return true
Moving Cursors • Two passes over the query tree • Bottom-up: move each parent cursor forward so it contains the children cursors • Top-down: move the children cursors forward so they are contained by their parents
Move Cursors Example Query = //x[.//y and .//z] = physical move = virtual move 5 1 x1 x2 6 4 2 y1 y2 y3 y4 y5 7 3 z1 z2
Comparing with TSGeneric+ = current cursor position = physical move = virtual move Query = //w//x//y//z w1 w2 x1 x2 x3 x4... x49 x50 y1 y2 y3… y49 y50 y51 y52 ... y98 y100 y99 z1 z2
Comparing with TSGeneric+ = current cursor position Query = //w//x//y//z = physical move w1 w2 x1 x2 x3 x4... x49 x50 y1 y52... y2 y3… y49 y50 y51 y98 y100 y99 z1 z2
A || || B C Extraction Points Optimization • If neither q or its descendants in the query are extraction points we can virtually move these cursors within q’s parent A1 A2 C1 B1 B2 B3 C100 C99
Prototype • Implemented over Berkeley DB B-tree • Inverted lists • Posting: <Token, Location> • Token = <term/tag> • Location = <DocumentID, Position> • Position is BEL
Data Sets • Xmark • 10 documents of size ~ 100MB each • Synthetic • 4 tags: W, X, Y, Z • Uncorrelated, no self-nesting • Same frequency
Conclusion • TwigOptimal algorithm outperforms existing twig join algorithms by more than 40%, especially for larger queries • Optimized for I/O, which is the performance bottleneck • Extraction points optimization improve performance