1 / 16

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents. Songting Chen, Hua-Gang Li * , Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara. Background. XML

lwarf
Download Presentation

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara

  2. Background • XML • Hierarchical (tree) structured data • Provide flexibility to model semi-structured data • Widely accepted as universal data exchange format • Query over XML • XPath, XQuery [W3C] • Extensively used by many applications • Adopted by a number of commercial systems VLDB' 2006. Seoul, Korea

  3. State-of-the-art: XML Query Processing Algebraic Approach Binary Structure Joins [Timber] – Large intermediate results Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing Holistic Approach ? TwigStack [Bruno, et. al] PathStack [Bruno, et. al] Twig2Stack VLDB' 2006. Seoul, Korea

  4. Processing Generalized Tree Pattern (GTP) Queries Structural Joins Structural Outer Joins – Grouping Duplication Elimination a1 A //A//B a2 B b1 Our goal: Avoid ALL these! D C Sort a1 XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d //A/B a2 b2 b1 VLDB' 2006. Seoul, Korea

  5. Motivation: PathStack [Bruno et.al] a1 a2 a2 b2 • Query: //A//B; Data: • Key observation: minimize intermediate results through compact representation of path matches, by • Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2 • Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 • TwigStack [Bruno et.al] minimizes intermediate results through: • Output only those path matches that are in final twig results • However, such optimality cannot be guaranteed [Choi, et.al] • Not helpful for processing GTP queries • Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? • Useful for processing GTP queries as well? a1 b1 b1 S[A] S[B] b2   VLDB' 2006. Seoul, Korea

  6. Hierarchical Stack Encoding a1 a1 • Inter-node: //A//B • Can still use explicit edges • Intra-node: A • Matching elements forms a tree structure as well • Associate each query node with a hierarchical stack • Push element einto hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E • Matching can be determined when entire sub-tree of e seen • Require post-order document traversal a2 a2 a3 a4 a3 a4 HS[A] VLDB' 2006. Seoul, Korea

  7. Twig2Stack: Running Example [1,20], 1 a1 A [2,15], 2 [16,19], 2 a2 b3 B a2 [17,18], 3 [3,14], 3 D C d3 HS[A] b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 b1 [8, 9], 6 b2 [6,7], 6 c1 d2 HS[B] Merging Stacks TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. Twig2Stack requires neither path joins nor path enumeration! d1 d2 d3 c1 c2 HS[D] HS[C] VLDB' 2006. Seoul, Korea

  8. GTP Result Enumeration a4 • Bottom-up Computation .vs. Top-down Enumeration • Visit Only those that are in the twig matches • Handling grouping results • Automatic grouping through Inter-node edges • Handling duplicates and out-of-order results • Problems coming from non-return nodes • If D is return node while B is not • b1 → d1, d2, d3 and b2 →d2, d3 (duplicates) • Observation: Intra-node hierarchy provides hints b1 b2 d2 c1 c2 d1 d3 VLDB' 2006. Seoul, Korea

  9. Experiment Setup • Implementation • Twig2Stack: Java 1.4.2 • TwigStack, TJFast: Java 1.4.2 • Kindly provided by Jiaheng Lu from National University of Singapore (NUS) • Datasets • XMark, DBLP, TreeBank • Metrics • Query processing time • IO time VLDB' 2006. Seoul, Korea

  10. Processing Full Twig Queries Optimization of Query Processing: TwigStack Twig2Stack Optimization of IO: TJFast VLDB' 2006. Seoul, Korea

  11. Not yet done: Memory Usage • Hierarchical Stack Encoding could hold entire document in memory in the worst case • Unlike DOM approach, only matches need to be stored • Tag match • (Partial) twig match • Predicate evaluation • Early result enumeration dramatically reduces the memory usage • Enumerate query results before the end of document and release buffer • Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack) approaches VLDB' 2006. Seoul, Korea

  12. S[A] HS[A] S[B] HS[B] S[D] S[C] b1 HS[D] HS[C] b2 c2 c1 d2 d3 d1 Early Result Enumeration (ERM) • Enumerate results and release buffer when elements in top-branch node are popped from PathStack A [1,20], 1 a1 a2 a1 B [2,15], 2 [16,19], 2 a2 b3 D C [17,18], 3 [3,14], 3 d3 b1 [12,13], 4 [4,11], 4 c2 d1 [5,10], 5 b2 [8, 9], 6 [6,7], 6 c1 d2 VLDB' 2006. Seoul, Korea

  13. Memory Usage dblp Small sub-tree  article title year site open_auctions Huge sub-tree  bid reserve bidder increase VLDB' 2006. Seoul, Korea

  14. Conclusions and Future Work • Proposed a bottom-up GTP processing solution • A twig encoding scheme • A GTP enumeration algorithm that avoids any post-processing operations • A hybrid scheme to reduce memory usage • Future directions • Handling worst case memory issues • Optimizing IO cost by exploiting indexes • Handling other axes, full XQuery, graph input • Handling XML streams • … VLDB' 2006. Seoul, Korea

  15. Processing GTP Optimization of non-return nodes Automatic grouping VLDB' 2006. Seoul, Korea

More Related