1 / 55

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. Jiaheng Lu , Tok Wang Ling , Chee-Yong Chan , Ting Chen National University of Singapore. Outline. Background Define our problem: XML twig pattern matching Previous work and problems

baka
Download Presentation

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National University of Singapore

  2. Outline • Background • Define our problem: XML twig pattern matching • Previous work and problems • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast • Experimental results • Conclusion

  3. XML basics • Short for Extensible Markup Language • A language for defining the syntax and semantics of structured data • An XML document is commonly modeled as arooted, ordered and taggedtree. book chapter preface chapter …………. “Intro” section section paragraph title section title paragraph paragraph “…” “Data” “XML” “…” “…”

  4. Querying XML Data • Major standards for querying XML data • XPath and XQuery • XML twig pattern matching is a core operation in XPath and XQuery • Definition of XML twig pattern : An XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either Parent-Child edges or Ancestor-Descendant edges

  5. An XML twig pattern example • Create a flat list of all the title-author pairs for every book in bibliography. XQuery: <results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results> To answer the XQuery, we need to first match the following XML twig pattern: bib Ancestor-descendant relationship $b: book $t: title $a: author Parent-child relationship

  6. Our research problem • Problem Statement • Given an XML twig pattern Q, and an XML database D, weneed to find ALL the matches of Q on D. • E.g. Consider the following twig pattern and document: An XML tree: • Query answers: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Twig pattern: s1 Section t1 s2 Title Figure t2 p1 f1

  7. Our research problem • Problem Statement • Given an XML twig pattern Q, and an XML database D, weneed to find ALL the matches of Q on D. • E.g. Consider the following twig pattern and document: An XML tree: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Twig pattern: s1 Section t1 s2 Title Figure t2 p1 f1

  8. Our research problem • Problem Statement • Given an XML twig pattern Q, and an XML database D, weneed to find ALL the matches of Q on D. • E.g. Consider the following twig pattern and document: An XML tree: • Query solutions: • (s1, t1, f1) • (s2, t2, f1) • (s1, t2, f1) Twig pattern: s1 Section t1 s2 Title Figure t2 p1 f1

  9. Outline • Background • Define our problem: XML twig pattern matching • Previous work and challenge • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast • Experiments • Conclusion

  10. Related work • TreeMerge and Stack-tree [Al-Khalifa ICDE 2002] • A stack-based binary join algorithm • But large intermediate results • TwigStack [ Bruno SIGMOD 2002] • A holistic twig join algorithm. • Sub-optimal for queries with parent-child relationships • TwigStackList [ Lu CIKM 2004] • A new holistic twig join algorithm, which produces less useless intermediate results than TwigStack does for queries with parent-child relationship

  11. Our research goal • In this research, we want to design a new holistic twig join algorithm which is more efficient than previous work. • Two aspects to achieve this goal: • (1) Input: reduce the input I/O cost • (2) Output: reduce the size of intermediate results

  12. Outline • Background • Define our problem: XML twig pattern matching • Previous work and challenges • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast • Experiments • Conclusion

  13. Original Dewey Labeling Scheme • In Dewey labeling scheme, each element is presented by a vector: • (i) the root is labeled by an empty stringε • (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. • For example: ε s1 2 1 3 t1 s2 f2 2.1 2.2 t2 f1

  14. Main problem of the original Dewey • If we use the original Dewey labeling scheme to answer a twig query, we need to read labels for all query nodes. Thus, we have no performance benefit compared to pervious methods. • Our idea: Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path ofe from this label alone.

  15. Modulo function • We need to know some schema information: DTD (Document Type Definitions ) or XML schema • Given DTD information: book → author, title, chapter* • Our solution: using modulo function, we create a match between an element tag and a integer number. • We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2; where Xt is the last component of the label of tag t. Why not 3 as the original Dewey ? ε book 0 5 2 1 author chapter chapter title

  16. Derive element tag • From a label , we can derive its tag name. • book → author, title, chapter* • Recall that we define: Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2. ε book 0 5 2 1 author chapter chapter title ? ? ? ?

  17. Derive the path from a label • By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. • For example: FST: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* Mod 3=0 author Mod 3=1 book title paragraph Mod 2=0 Mod 3=2 Mod 2=0 book chapter Document: section Mod 2=1 Mod 2=1 chapter chapter title author section Question: Given a label 5.1.0 for an element, what is the corresponding path ? section section paragraph

  18. Derive the path from a label • By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. • For example: FST: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* Mod 3=0 author Mod 3=1 book paragraph title Mod 2=0 Mod 3=2 Mod 2=0 book chapter Document: section Mod 2=1 Mod 2=1 chapter chapter Following the above red path, we get 5.1.0 denotes : title author section section book/ chapter/section/paragraph section paragraph

  19. Two properties of extended Dewey • Find Ancestor Label • From a label of any element, we can derive the labels of its all ancestors. • Find Ancestor Name • From a label of any element, we can derive the tag names of its all ancestors. • Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

  20. Outline • Background • Define our problem: XML twig pattern matching • Previous work and challenges • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast (a Fast Twig Join algorithm) • Experiments • Conclusion

  21. A new algorithm: TJFast • For each node n in the query, there exists a corresponding input stream Tn. • Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. • For each branching node b of the twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStack, what difference? ) • During any point of computing, the size of set Sb is bounded by the depth of the XML document.

  22. A new algorithm: TJFast • Two-phase algorithm: • Phase 1 : parts of intermediate root-leaf paths are output • Insert elements that possibly involve in query answers to sets • Output intermediate paths according to elements in sets • Phase 2 :the intermediate paths are merge-joined to get the final results

  23. An example for TJFast algorithm A set for the branching node A ε Document: Root { } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 DTD: a -> a*,d*, b* b -> d*, c* d -> c* c1 c2 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 Why do we not need TA, TB streams? TC: 0.3.2.1, 0.5.0.0

  24. An example for TJFast algorithm ε Document: Root { } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 derive 0.0.1 a1/a2/d1 c1 c2 0.3.2.1 0.5.0.0 derive 0.3.2.1 a1/a3/b1/c1 TD: 0.0.1 , 0.3.1, 0.5.0 By finite state transducer of extended Dewey labeling scheme TC: 0.3.2.1, 0.5.0.0

  25. An example for TJFast algorithm ε Document: Root { } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 Both a1 and a3 possibly involve in query answers. (Why not a2 ?) TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  26. An example for TJFast algorithm ε Document: Root { } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 Then we insert a1 to the set, since a1 is an ancestor of a3. 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  27. An example for TJFast algorithm ε Document: Root {a1 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 TD: Move the cursor of TD from d1 to d2 and output one path solution <a1, d1> 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  28. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 derive 0.3.1 a1/a3/d2 0.3.2.1 0.5.0.0 TD: We insert a3 to the set, since a3 definitely involves in query answers. 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  29. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 TD: Move the cursor of stream TD fromd2 to d3 and output <a1,d2> and <a3,d2>. 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  30. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 Move the cursor of stream TC from c1 to c2and output the path <a3,b1,c1> TD: 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  31. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 TD: • Move the cursor TD of to the end and output path solution <a1,d3> 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  32. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 TD: • Move the cursor of TC of to the end and output <a1,b2,c2> 0.0.1 , 0.3.1, 0.5.0 TC: 0.3.2.1, 0.5.0.0

  33. An example for TJFast algorithm ε Document: Root {a1,a3 } Query: 0 A a1 … 0.0 0.3 0.5 D B a3 b2 a2 0.3.2 C d1 d2 b1 d3 0.5.0 0.0.1 0.3.1 c1 c2 0.3.2.1 0.5.0.0 TD: 0.0.1 , 0.3.1, 0.5.0 Now all five elements has been scanned, in the second phase we merge-join all output path solutions. TC: 0.3.2.1, 0.5.0.0

  34. An example for TJFast algorithm A Document: a1 Query: D B a3 b2 a2 C d1 d2 b1 d3 c1 c2 Phase 1. Intermediate paths Phase 2. Final solutions A// D: <a1, d1>, <a1, d2>, <a1, d3>, <a3, d2> A/B//C: <a1,b2, c2>, <a3, b1,c1> <A, D, B,C> Join <a1,d1,b2,c2>,<a1,d2, b2,c2>, <a1,d3,b2,c2>,<a3,d2, b1,c1>,

  35. Outline • Background • Define our problem: XML twig pattern matching • Previous work and challenges • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast • Experimental results • Conclusion

  36. Experiments • Benchmarks • XMark: Synthetic Data • DBLP: Real Data for DBLP database • Treebank: Real Data from Wall Street Journal

  37. Path query We compared PathStack[1] and TJFast on the following four path queries on XMark data.

  38. Experiments: Number of elements read and input file size for path queries Observation:TJFast scans less elements than PathStack does. Explanation: TJFast only scans labels for leaf nodes in queries, but PathStack scans all nodes in the query.

  39. Experiments: Execution time for path queries Observation:TJFast has better performance for all four path queries than PathStack. Explanation: TJFast reduces I/O cost by reading less elements.

  40. Twig queries We compared TwigStack, TwigStackList and TJFast on the following five twig queries on DBLP and TreeBank data.

  41. Experiments: Number of elements read and input file size for twig queries Observation: TJFast scans far less elements thanTwigStack and TwigStackList do in two twig queries. Explanation: TJFast only scans elements for leaf nodes in queries. But TwigStack/TwigStackList needs to scan elements for all nodes. And the number of elements for non-leaf nodes is much more than that of leaf nodes.

  42. Experiments: Execution time for twig queries TW-SS and TJ-SS denote the sequential scan time of input data for TwigStack/TwigStacklist and TJFast, respectively. Observation:For DBLP data, TJFast has much better performance than that of TwigStack/TwigStackList. Explanation:TJFast reduces I/O cost by reading less elements.

  43. Outline • Background • Define our problem: XML twig pattern matching • Previous work and challenges • Our new twig matching algorithms • A new labeling scheme: extended Dewey • A new holistic algorithm: TJFast • Experimental results • Conclusion

  44. Conclusions • Efficient processing of twig queries is a core operation in XPath and XQuery • We have proposed a new labeling scheme, extended Dewey and a new holistic twig pattern matching algorithm: TJFast. • Compared to previous work • TJFast reduces the input I/O cost • TJFast reduces the output I/O cost for intermediate results.

  45. Reference • [1] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D. Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE 2002 141- 152 • Propose StackTree algorithm • [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002. • Propose TwigStack algorithm • [3] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching using structural indexingtechniques. In SIGMOD, 2005. • Propose two new data streaming techniques • [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An efficient XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004. • Propose a new algorithm for XPath query

  46. Reference • [5] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML documents VLDB 2003 • Propose TSGeneric algorithm • [6] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004. Propose TwigStackList algorithm • [7] P. Rao and B. Moon PRIX: Indexing and querying XML using prufer sequences In ICDE pages 288-300 2004 Propose PRIX system • [8] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic index method for querying XML data by tree structures In SIGMOD 2003 Propose ViST system • [9] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S. Beyer Virtual Corsors for XML joins CIKM pages 523-532 2004 Propose Virtual cursor algorithm

  47. END • Thank you! • Q & A

  48. Related work • Comparison between Virtual Cursor (VC) [Yang CIKM 2004] and our work • Develop independently • Finite state transducer in TJFast, path table in VC • Size of path table depends on the distinct paths, but that of FST depends on the distinct elements types. • TJFast reduces the number of useless intermediate path when queries with parent-child edges, but VC has not this property

  49. Backup a1 Document Query: a b1 f1 b c a2 e d c1 d1 f2 c2 TwigStackList outputs <a1,b1> . But TJFast does not output this path solution. e1

  50. Labels size

More Related