160 likes | 175 Views
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents. Songting Chen, Hua-Gang Li * , Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara. Background. XML
E N D
Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara
Background • XML • Hierarchical (tree) structured data • Provide flexibility to model semi-structured data • Widely accepted as universal data exchange format • Query over XML • XPath, XQuery [W3C] • Extensively used by many applications • Adopted by a number of commercial systems VLDB' 2006. Seoul, Korea
State-of-the-art: XML Query Processing Algebraic Approach Binary Structure Joins [Timber] – Large intermediate results Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing Holistic Approach ? TwigStack [Bruno, et. al] PathStack [Bruno, et. al] Twig2Stack VLDB' 2006. Seoul, Korea
Processing Generalized Tree Pattern (GTP) Queries Structural Joins Structural Outer Joins – Grouping Duplication Elimination a1 A //A//B a2 B b1 Our goal: Avoid ALL these! D C Sort a1 XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d //A/B a2 b2 b1 VLDB' 2006. Seoul, Korea
Motivation: PathStack [Bruno et.al] a1 a2 a2 b2 • Query: //A//B; Data: • Key observation: minimize intermediate results through compact representation of path matches, by • Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2 • Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 • TwigStack [Bruno et.al] minimizes intermediate results through: • Output only those path matches that are in final twig results • However, such optimality cannot be guaranteed [Choi, et.al] • Not helpful for processing GTP queries • Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? • Useful for processing GTP queries as well? a1 b1 b1 S[A] S[B] b2 VLDB' 2006. Seoul, Korea
Hierarchical Stack Encoding a1 a1 • Inter-node: //A//B • Can still use explicit edges • Intra-node: A • Matching elements forms a tree structure as well • Associate each query node with a hierarchical stack • Push element einto hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E • Matching can be determined when entire sub-tree of e seen • Require post-order document traversal a2 a2 a3 a4 a3 a4 HS[A] VLDB' 2006. Seoul, Korea
Twig2Stack: Running Example [1,20], 1 a1 A [2,15], 2 [16,19], 2 a2 b3 B a2 [17,18], 3 [3,14], 3 D C d3 HS[A] b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 b1 [8, 9], 6 b2 [6,7], 6 c1 d2 HS[B] Merging Stacks TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. Twig2Stack requires neither path joins nor path enumeration! d1 d2 d3 c1 c2 HS[D] HS[C] VLDB' 2006. Seoul, Korea
GTP Result Enumeration a4 • Bottom-up Computation .vs. Top-down Enumeration • Visit Only those that are in the twig matches • Handling grouping results • Automatic grouping through Inter-node edges • Handling duplicates and out-of-order results • Problems coming from non-return nodes • If D is return node while B is not • b1 → d1, d2, d3 and b2 →d2, d3 (duplicates) • Observation: Intra-node hierarchy provides hints b1 b2 d2 c1 c2 d1 d3 VLDB' 2006. Seoul, Korea
Experiment Setup • Implementation • Twig2Stack: Java 1.4.2 • TwigStack, TJFast: Java 1.4.2 • Kindly provided by Jiaheng Lu from National University of Singapore (NUS) • Datasets • XMark, DBLP, TreeBank • Metrics • Query processing time • IO time VLDB' 2006. Seoul, Korea
Processing Full Twig Queries Optimization of Query Processing: TwigStack Twig2Stack Optimization of IO: TJFast VLDB' 2006. Seoul, Korea
Not yet done: Memory Usage • Hierarchical Stack Encoding could hold entire document in memory in the worst case • Unlike DOM approach, only matches need to be stored • Tag match • (Partial) twig match • Predicate evaluation • Early result enumeration dramatically reduces the memory usage • Enumerate query results before the end of document and release buffer • Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack) approaches VLDB' 2006. Seoul, Korea
S[A] HS[A] S[B] HS[B] S[D] S[C] b1 HS[D] HS[C] b2 c2 c1 d2 d3 d1 Early Result Enumeration (ERM) • Enumerate results and release buffer when elements in top-branch node are popped from PathStack A [1,20], 1 a1 a2 a1 B [2,15], 2 [16,19], 2 a2 b3 D C [17,18], 3 [3,14], 3 d3 b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 [8, 9], 6 [6,7], 6 c1 d2 VLDB' 2006. Seoul, Korea
Memory Usage dblp Small sub-tree article title year site open_auctions Huge sub-tree bid reserve bidder increase VLDB' 2006. Seoul, Korea
Conclusions and Future Work • Proposed a bottom-up GTP processing solution • A twig encoding scheme • A GTP enumeration algorithm that avoids any post-processing operations • A hybrid scheme to reduce memory usage • Future directions • Handling worst case memory issues • Optimizing IO cost by exploiting indexes • Handling other axes, full XQuery, graph input • Handling XML streams • … VLDB' 2006. Seoul, Korea
Processing GTP Optimization of non-return nodes Automatic grouping VLDB' 2006. Seoul, Korea