1 / 43

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques. Ting Chen, Jiaheng Lu , Tok Wang Ling. Outline. Background XML Twig Pattern Query Previous Twig Join algorithms Limit of the original holistic method TwigStack

selah
Download Presentation

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

  2. Outline • Background • XML Twig Pattern Query • Previous Twig Join algorithms • Limit of the original holistic method TwigStack • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching theory • iTwigJoin: a generalized holistic matching algorithm • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  3. Background: XML and Region coding • XML document is modeled as a tree in our work • Region Coding for XML document tree • <start, end, level> label for each element • Containment Property: a.start < b.start AND a.end > b.end if and only if a is an ancestor of b On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  4. Background: XML twig pattern queries • An XML twig query is a small tree, whose edges include parent-child or ancestor-descendant relationships. • Given an XML document D, and an XML twig query Q, our problem is to find all occurrences of Q on D. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  5. Previous XML Twig Join algorithms Techniques Edge Based Binary Structural Join [Al-Khalifa et al ICDE02] Join Order Selection [Wu et al ICDE03] Path Based BLAS [Chen et al SIGMOD04] Tree (Holistic) Based TwigStack [Bruno et al SIGMOD02] TwigStackList [Lu et al CIKM04] Index Based B tree [[Chien et al VLDB02] XR tree[Jiang et al ICDE02] TSGeneric+[Jiang et al VLDB03] On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  6. Holistic Twig Matching • TwigStack [Bruno et al SIGMOD02]A holistic twig join algorithm • E.g: For query A[.//C]//B, there may be many matches only to A//B. But TwigStack only output results for A with descendants B and C. • No join order selection required • TwigStack is optimal for only ancestor-descendant twig patterns. • Reordering of elements in a stream does not help.[Choi et al DEXA03] On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  7. Sub-optimality of TwigStack • Not optimal for twigs with parent-child edge a1 a1 a2 … an A a2 an cn b1 B C b1 b2 … bn c1 c2 … cn … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  8. Two Refined Streaming Schemes(1) • To enlarge the optimality of TwigStack, in our paper we proposed two refined streaming schemes. • Tag + Level: elements with the same tag and level are grouped together a1 A a1 … a2 an cn b1 b1 a2 a3 … an cn B C … b2 b3 … bn c1 c2 … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  9. Two Refined Streaming Schemes(1) • For this query, tag+level streaming scheme can guarantee the optimality. a1 A a1 … a2 an cn b1 b1 a2 a3 … an cn B C … b2 b3 … bn c1 c2 … b2 c1 bn cn-1 Document Query On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  10. Two Refined Streaming Schemes(1) • But given a more complex query and document, tag+level cannot guarantee the optimality.For example: a1 A a1 a2 b2 e1 a2 b2 D B d3 d1 d2,d3 d1 d2 b1 b1 C c1 c2 Query c1 c2 Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  11. Two Refined Streaming Schemes(2) • Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together Every element in the document is stored as an individual stream in this example. D: a1 a1 a2 b2 e1 e1 a2 b2 d1 d2 b1 d3 d3 d1 d2 b1 c1 c2 Document c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  12. Two Refined Streaming Schemes(2) • PPS is optimal for the following example. d1,d2,c1,c2 are separated to different streams a1 A a1 a2 b2 e1 a2 b2 D B d3 d1 d2 d1 d2 b1 b1 C c1 c2 Query c1 c2 Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  13. Two Refined Streaming Schemes(2) • A natural question : Can PPS guarantee to be optimal for all queries and data? On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  14. Two Refined Streaming Schemes(2) • A natural question : Can PPS guarantee to be optimal for all queries and data? • The answer isNO. • For example: c1, c2 are in the same stream. Similarly, e1, e2 are also in the same stream. A a1 b1 b2 b3 C B a2 a3 a4 d2 E D c1 c2 b4 b5 e1 d1 e2 Query : head element Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  15. A general algorithm: iTwigJoin • We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes. • Our key idea is to classify all current head elements to three classes: • Subtree-matching • Useless • Blocked On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  16. Classifying Head Elements • Subtree-Matching Element • Element e of tag E is called a subtree-matching element for queryQ • e is in a match to QE (QE is the sub-tree of Q rooted at E); and • NOT in any future match to QP where P is the parent of E in Q • Useless Element • Element e is called a useless element if e is not in any future match to QE. • Blocked Element • An element which is neither subtree-matching nor useless On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  17. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  18. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  19. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  20. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 a2 b2 d1 d2d3 … b1 c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  21. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  22. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  23. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  24. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  25. Example: Classifying Head Elements (Tag+Level Streaming) a1 D: A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 : head element a1 Q2: A a2 b2 D B d1 d2d3 … b1 c1 c2 C On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  26. Example: Classifying Head Elements (Tag+Level Streaming) a1 A D B a2 b2 e1 C A d1 d2 b1 d3 B D C c1 c2 • Useless element can be discarded safely • sub-tree Matching element is pushed to the corresponding stack • Blocked element causes problem • CANNOT be discarded because it may cause loss of results • CANNOT be pushed to stack because it may cause useless results • When all head elements are blocked; optimal holistic matching CANNOT be guaranteed On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  27. iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  28. iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  29. iTwigJoin • In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. Tag+Level Streaming a1 Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). A Q1: a2 b2 e1 D B d1 d2 b1 d3 C c1 c2 If there is no c2, then (a1,d1) is a useless path solution. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  30. iTwigJoin • Two Main Components • Stream Manager: Control the advance operation of streams and send elements for temporary storage • Temporary Storage: Push elements to stack and output intermediate paths. Stream Manager Temporary Storage a1 SA a2 b2 SB SC c1 c2c3 … b1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  31. Flowchart of iTwigJoin Labelcurrent head elements as either subtree-Matching, Useless or Blocked If useless element is found Discard Useless elements If not all streams end Select a subtree-Matching or blocked element e Pop some elements from stack Push e to the stack and output intermediate paths if e is the leaf On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  32. Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  33. Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  34. Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only or 1-Branch Prefix Path Streaming A-D/P-C only or 1-Branch node A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  35. Optimal classes of iTwigJoin for three streaming schemes Streaming scheme Optimal class Optimal class:Larger More refined Tag Streaming A-D only pattern Tag+Level Streaming A-D/P-C only pattern A-D/P-C only or 1-Branch Prefix Path Streaming A-D/P-C only or 1-Branch node A-D/P-C only A-D only On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  36. Experiments • Benchmarks • XMark: Synthetic Data • Treebank: Real Data from Wall Street Journal On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  37. Experiments: I/O Performance Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode By pruning irrelevant streams, PPS usually scan the fewest number of elements. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  38. Experiments: Number of Intermediate Paths Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  39. Experiments: Running Time XMark1: Path Pattern, XMark2: A-D only, XMark3: P-C only, XMark4: 1-branchnode, XMark5: Non-optimal, Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  40. Experiments: Summary • Both PPS and Tag+Level help to reduce I/O costs. while PPS saves more. • PPS may result in too many streams for deep XML data; Tag+Level seems to be a good compromise. • PPS and Tag+Level completely avoid the output of redundant intermediate paths in all cases we tested, though they cannot guarantee the optimality in theory. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  41. Conclusions • We develop a general algorithm to perform holistic twig join on Tag+Level and PPS streaming schemes. • We identify two I/O optimal classes for Tag+Level and PPS streaming schemes. • Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  42. END • Thank you! • Q & A On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

  43. Backup iTwigJoin Algorithm • While(not all streams end) • Label current head elements as either Matching, Useless or Blocked • If any head element is Useless, discard it and continue • Let e1 be the matching element with the smallest startPos; • Let e2 be the blocked element with the smallest endPos; • If e2.endPos < e1.startPos, let e be the blocked element with • the smallest startPos; else let e be e1 • Advance the stream e belongs to • Pop out elements from e’s stack whose endPos < e.startPos • Pushe into its stackif e has a parent/ancestor in the temporary storage system, • Output all paths involving eIf the tag of e is a leaf node in Q On Boosting Holism in XML Twig Pattern Matching using Structural Indexing

More Related