860 likes | 1.06k Views
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques. Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen Ting, Ling Tok Wang. Outline. Background Define our problem: XML twig pattern matching
E N D
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen Ting, Ling Tok Wang
Outline • Background • Define our problem: XML twig pattern matching • Previous two algorithms: TwigStack and TwigStackList • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching algorithm: iTwigJoin • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching • An XML document is commonly modeled as a rooted, ordered and taggedtree. book chapter preface chapter …………. “Intro” section section paragraph title section title paragraph figure paragraph “Data” figure figure “XML” On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Regional Coding • Node Label1: (startPos: endPos, LevelNum) • E.g. book (0: 32, 1) preface (1:3, 2) chapter (4:29, 2) chapter(30:31, 2) section (5:28, 3) “Intro” (2:2, 3) section(18:23, 4) title: (6:8, 4) section(9:17, 4) paragraph(24:27, 4) paragraph(19:22, 5) title: (10:12, 5) “Data” (7:7, 3) figure (25:26, 5) paragraph(13:16, 5) figure (20:21, 6) “XML” (11:11, 3) figure (14:15, 6) M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
What is a Twig Pattern? • A twig pattern is a small tree whose nodes are tags, attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges. • E.g. Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title • XPath: Section[Title]/Paragraph//Figure • Twig pattern : Section Paragraph Title Figure On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Query and Document: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Query: Section Document: s1 t1 s2 title figure t2 p1 f1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Query and Document: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Query: Section Document: s1 t1 s2 title figure t2 p1 f1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Query and Document: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Query: Section Document: s1 t1 s2 title figure t2 p1 f1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Outline • Background • Define our problem: XML twig pattern matching • Previous two algorithms: TwigStack and TwigStackList • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching algorithm: iTwigJoin • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Previous work: TwigStack • TwigStack2: a holistic approach • Each element in the document is labeled with region encoding labeling scheme. • The input data is the labels of all elements whose tags occur in the query twig. The output data is the matching solutions with the format of n-tuple, where n is the number of nodes in query. • For each node in the query, there exists a corresponding input stream. • Each label in a stream is scanned only once. That is, the cursor of each stream is not allowed to go back in any time. 2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Previous work: TwigStack • TwigStack2: a holistic approach • Two-phase algorithm: • Phase 1 TwigJoin: intermediate root-leaf paths are outputted • Phase 2 Merge: merge the intermediate paths to get the final results 2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Previous work: TwigStack • A node q in a twig pattern Q is associated with a stack Sq • Insertion and deletion in a stack Sq • Insertion: An element eq from stream Tq is pushed into its stack Sq if and only if • eq has a descendanteqi in each Tqi , where qi is a child of q • Each node eqi recursively has the first property • Deletion: An element eqis popped out from its stack if all matches involving it have been output. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: Query: s1 t1 Section s2 f2 title figure t2 f1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 2:3,2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: 4:9,2 s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: 4:9,2 s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: 4:9,2 s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: 4:9,2 s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 7:8,3 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: 4:9,2 s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, <s1,f1>,<s2,f1>, title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 10:11,2 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, <s1,f1>,<s2,f1> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 1:12,1 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, <s1,f1>,<s2,f1>,<s1,f2> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
XML Twig Pattern Matching Document: 1:12,1 Query: s1 2:3,2 10:11,2 t1 4:9,2 Section s2 f2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output path solutions: <s1, t1>, <s1,t2>,<s2,t2>, <s1,f1>,<s2,f1>,<s1,f2> Merge: <s1,t1,f1>,<s1,t1,f2>, <s1,t2,f1>,<s1,t2,f2>,<s2,t2,f1> title (2:3,2), (5:6,3) figure (7:8,3), (10:11,2) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Sub-optimality of TwigStack • If the query contains any parent-child relationship, TwigStack may output some intermediate path solutions that cannot contribute to final results. • We call that TwigStack is sub-optimal for queries with parent-child relationships. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: sub-optimality of TwigStack Document: 1:12,1 Query: s1 2:3,2 t1 4:9,2 Section s2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: sub-optimality of TwigStack Document: 1:12,1 Query: s1 2:3,2 1:12,1 t1 4:9,2 Section s2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title Because f1 and t1 are descendants of s1 , s1 is pushed to the stack. Note that f1 is not a child of s1. (2:3,2), (5:6,3) figure (7:8,3) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: sub-optimality of TwigStack Document: 1:12,1 Query: s1 2:3,2 1:12,1 t1 4:9,2 Section s2 2:3,2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section title (2:3,2), (5:6,3) figure (7:8,3) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example: sub-optimality of TwigStack Document: 1:12,1 Query: s1 2:3,2 1:12,1 t1 4:9,2 Section s2 5:6,3 7:8,3 title figure t2 f1 (1:12,1), (4:9,2) Section Output solution: <s1,t1>. But it is a useless intermediate solution and do not contribute to any final solution. title (2:3,2), (5:6,3) figure (7:8,3) On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
TwigStackList • The main problem of TwigStack is to assume all edges are ancestor-descendant relationship in the first phase. So it is not efficient for queries with parent-child relationships. • Alternative: TwigStackList3 [CIKM 2004] • TwigStackList3 is an improvement algorithm for TwigStack, which consider parent-child relationships in the first phase and identify a large query class to be optimal than TwigStack. 3. J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533- 542, 2004. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Optimal class of TwigStack and TwigStackList O :optimal S: sub-optimal On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Challenges (1) • Although TwigStackList enlarges the optimal query class of TwigStack, it still shows sub-optimal for a large class of twig query. • For example: two sub-optimal twig queries for TwigStackList : Section Section title figure title figure On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Challenges (2) • In algorithms TwigStack and TwigStackList, to answer a twig query, they need to read labels for all elements whose tags occur in the query. • Can we accelerate the query processing by reading only parts of them ? Query: Document : Level 1: s1 Section Level 2: t1 title figure Level 3: …… f1 f2 fn There is no answerin the document,since no figure elements in level 2. But previous algorithms still need to read all figure elements in Level 3. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Outline • Background • Define our problem: XML twig pattern matching • Previous two algorithms: TwigStack and TwigStackList • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching algorithm: iTwigJoin • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Our solution • We proposed two data streaming schemes: tag+leveland prefix path streaming. • Basic idea: Separate the elements with the same tag name to different streams • Tag+level: elements with the same tag and level are grouped together • Prefix path: elements with the same root-to-node path are grouped together On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(1) • Tag + Level: elements with the same tag and level are grouped together. Level1: a1 a Level2: a2 , a3 Level 1: a1 Level2: b2 b a3 b2 a2 Level3: 2: b1 c Level4: C1, C2 d3 d1 3: d2 b1 4: c2 c1 d d1 ,d2,d3 Level3: Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two Refined Streaming Schemes(2) • Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together. a: a1 a a/a: a2 , a3 Level 1: a1 a/b: b2 b a3 b2 a2 2: a/a/b: b1 a/a/b/c: C1 c d3 d1 3: d2 b1 a/b/d/c: C2 4: c2 c1 d1 , d2 a/a/d: d a/b/d: d3 Document On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two benefits of refined streaming schemes(1) • (1)Enlarge the optimal query classes • For example, considering the document and query, previous algorithms: TwigStack and TwigStackList will output one useless solution <s1,t1>. • But based on tag+level, <s1,t1> is not output, since we know there is no figure elements in level 2. S1 Level1: Level 1: Section s1 S2 Level2: Section s2 t1 2: t1 Level2: title title figure 3: f1 Level3: t2 t2 Document Query figure f1 Level2: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Two benefits of refined streaming schemes(2) • (2) Skip irrelevant elements • For the document and query, since there is no title elements in level 3, we may skip reading all figure elements in level 3. Document : Query: Level 1: s1 Section Level 2: t1 figure title Level 3: …… f1 f2 fn On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Outline • Background • Define our problem: XML twig pattern matching • Previous two algorithms: TwigStack and TwigStackList • Our holistic Twig Pattern Matching algorithms • Two Refined Indexing Schemes: Tag+Level and PPS • A generalized holistic matching algorithm: iTwigJoin • Experiments • Conclusion On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
A general algorithm: iTwigJoin • We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes. • Our key idea is to classify all current head elements to three classes: • Subtree-matching • Useless • Blocked On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Classifying Head Elements • Subtree-Matching Element • Element e of tag E is called a subtree-matching element for queryQ • e is in a match to QE (QE is the sub-tree of Q rooted at E); and • NOT in any future match to QP where P is the parent of E in Q • Useless Element • Element e is called a useless element if e is not in any future match to QE. • Blocked Element • An element which is neither subtree-matching nor useless On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example 1: Classifying Head Elements (Tag+Level) a1 D: A Level1: a1 a Q1: Level2: a2 , a3 a3 b2 a2 D B Level2: b2 b d1 d2 b1 d3 Level3: b1 C c1 c2 c Level4: C1, C2 : head element d d1 ,d2,d3 Level3: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example 1: Classifying Head Elements (Tag+Level) a1 D: A Level1: a1 a Q1: Level2: a2 , a3 a3 b2 a2 D B Level2: b2 b d1 d2 b1 d3 Level3: b1 C c1 c2 c Level4: C1, C2 : head element d d1 ,d2,d3 Level3: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example 1: Classifying Head Elements (Tag+Level) a1 D: A Level1: a1 a Q1: Level2: a2 , a3 a3 b2 a2 D B Level2: b2 b d1 d2 b1 d3 Level3: b1 C c1 c2 c Level4: C1, C2 : head element d d1 ,d2,d3 Level3: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example 2: Classifying Head Elements (Tag+Level) a1 D: A Level1: a1 a Q1: Level2: a2 , a3 a3 b2 a2 D B Level2: b2 b d1 d2 b1 d3 Level3: b1 C c1 c2 c Level4: C1, C2 : head element d d1 ,d2,d3 Level3: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
Example 2: Classifying Head Elements (Tag+Level) a1 D: A Level1: a1 a Q1: Level2: a2 , a3 a3 b2 a2 D B Level2: b2 b d1 d2 b1 d3 Level3: b1 C c1 c2 c Level4: C1, C2 : head element d d1 ,d2,d3 Level3: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing