1 / 43

Tree-Pattern Queries on a Lightweight XML Processor

Tree-Pattern Queries on a Lightweight XML Processor. MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras. Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks. Outline. Motivation and Contributions Background Method Categorization

kevork
Download Presentation

Tree-Pattern Queries on a Lightweight XML Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks

  2. Outline • Motivation and Contributions • Background • Method Categorization • Experimental Evaluation • Conclusions Tree-Pattern Queries on a Lightweight XML Processor

  3. Motivation • XML query languages: selection on both value and structure • “Tree-pattern” queries (TPQ) very common in XML • Many promising holistic solutions • None in lightweight XML engines • Without optimization module (e.g. eXist, Galax) •  Effective, robust processing method • Reasons: • No systematic comparison of query methods under a common storage model • No integration of all methods under such storage model • Context: XPath semantics, stored data (indexed at will) Tree-Pattern Queries on a Lightweight XML Processor

  4. Contributions • TPQ methods over unified environment • Method Categorization: data access patterns and matching algorithm • Common storage model + integration of all methods • Capture the access features • Permit clustering data with off-the-shelf access methods (e.g. B+tree) • Novel variations of methods using index structures + Handle TPQ • Extensive comparative study • Synthetic, benchmark and real datasets • Decision in the applicability, robustness and efficiency Tree-Pattern Queries on a Lightweight XML Processor

  5. article procs author Bib conf last (1,20) article (2,19) title author procs (3,5) (6,13) (14,18) article t1 last first conf (2,19) (4) (7,9) (10,12) (15,17) last DeWitt David J. (7,9) VLDB (8) (11) (16) 2<7<9<19 Background TPQ • XML database = forest of unranked, ordered, node-labeled trees, one tree per document Tree-Pattern Queries on a Lightweight XML Processor

  6. Common Storage Model bib (1,26) B+ Tree on ( tag, initial ) author (3,8) (11,16) (19,24) bib (1,16) book (2,9) paper (18,25) address (6,7) (14,15) (22,23) author (3,8) author (19,24) name (4,5) (12,13) (20,21) name (4,5) address (6,7) name (20,21) address (22,23) paper (18,25) book (2,9) (10,17) book (10,17) author(11,16) • Input = sequence (list) of elements • One list per document tag = element list • Node clustering by index structures • Numbering scheme name (12,13) address (14,15) Tree-Pattern Queries on a Lightweight XML Processor

  7. Method Categorization • Parameters: access pattern and matching algorithm (1) set based techniques (2) query driven (3) input driven (4) structural summaries Tree-Pattern Queries on a Lightweight XML Processor

  8. Cat 1: Set-based Techniques • Input: sequences of elements, one list per query node element, possibly indexed (set-based) • Major representative: TwigStack • Optimal XML pattern matching algorithm (ancestor/descendant) • Stack-based processing • Set of stacks = compact encoding of partial and total results in linear space (possibly exponential number of answers) Tree-Pattern Queries on a Lightweight XML Processor

  9. TwigStack + Indexes • B+tree, built on the left attribute • From ancestor: probe descendants: skip initial nodes • Ancestor skipping not effective (up to 1st element that follows) • XB-tree: on (left,right) bounding segment • XR-tree: on (left,right), B+tree with complex index key + stab lists • A comparative study* shows that • Skipping ancestors: XBTree better (XBTree size is smaller) • Recursive level of ancestors: XBTree better again • Searching on stab lists of XR-tree is less efficient • Plain B+tree: skips descendants, BUT not ancestors • XBTwigStack is our choice * H.Li et al. “An Evaluation of XML Indexes for Structural Joins”. Sigmod Record, 33(3), Sept 04 Tree-Pattern Queries on a Lightweight XML Processor

  10. Cat 2: Query Driven Techniques • Processing: the query defines the way input is probed • Major representatives: ViST and PRIX • Specific details: significantly different • Same strategy • Convert both document and query to sequences • Processing query = subsequence matching Tree-Pattern Queries on a Lightweight XML Processor

  11. ViST and PRIX • Recursively identify matches = quadratic time • Optimize the naïve solution: • Identify candidate nodes for each matching step • Index structures to cluster those candidates • Subsequence matching process = a plan consisting of INLJ among relations, each of which groups document nodes with the same label • For a given query, joins sequence statically defined by the sequencing of the query • INLJ plans are a superset of the static plans that PRIX and VIST use Tree-Pattern Queries on a Lightweight XML Processor

  12. ViST x PRIX x INLJ • Percentage of nodes processed by each algorithm • INLJ: best plan Tree-Pattern Queries on a Lightweight XML Processor

  13. a 1,52 b b b 2,31 42,51 32,41 b elem. list a 33 33,40 b c 38,39 34,37 2,31 32,41 34,41 42,51 c 35,36 INLJ : improved B+tree Consider b//c Starting from c • TPQ  evaluation of relational plan • Independence of the ordered XML model • Total avoidance of false positives Tree-Pattern Queries on a Lightweight XML Processor

  14. Cat 3: Input Driven Techniques • Processing: at each point, the flow of computation is guided entirely by the input through a Finite State Machine (DFA/NFA) • Advantages • Each node processed only once • Simplicity, sequential access pattern • Problem: skipping elements Tree-Pattern Queries on a Lightweight XML Processor

  15. SingleDFA and IdxDFA • SingleDFA • <element> triggers the DFA, choosing next state • </element>: execution backtracks to when start processed • TPQ matching: intermediate results compacted on stacks • Experiments show reading whole input = not enough • Speeding up navigation: IdxDFA • Instead of reading sequentially: use indexes and skip descendants Tree-Pattern Queries on a Lightweight XML Processor

  16. a b c d IdxDFA: example c1 b2 a3 d11 a12 c22 c4 d6 b9 b13 c16 d6 b21 c5 d7 d9 c10 d14 c15 Tree-Pattern Queries on a Lightweight XML Processor

  17. a b c d c4 d6 b9 c22 a12 d11 c5 d7 b13 c16 d6 d9 c10 d14 c15 IdxDFA: example c1 b2 a3 d11 a12 c22 c4 d6 b9 b13 c16 d6 b21 b21 c5 d7 d9 c10 d14 c15 Tree-Pattern Queries on a Lightweight XML Processor

  18. Cat 4: Graph Summary Evaluation • Structural summary: index node identifies a group of nodes in the document • Processing: identify index nodes that satisfy the query + post processing filtering • Beneficial: when there is a reasonable structural index, much smaller than document • Problem: graph size comparable/larger than original document Tree-Pattern Queries on a Lightweight XML Processor

  19. Categories Summary Tree-Pattern Queries on a Lightweight XML Processor

  20. Experimental Evaluation • Experiments with real datasets • Experiments with synthetic datasets • Further analyze each method • Characterize the methods according to specific features available in each custom dataset • More sets of experiments • Closely verify XBTWIGSTACK versus INLJ Tree-Pattern Queries on a Lightweight XML Processor

  21. Setup • Algorithms using the same API • Analysis varying structure and selectivity • Performance measure = total time required to compute a query • Number of nodes as secondary information • Intel Pentium 4 2.6GHz, 1Gb ram • Berkeley DB: 100 buffers, page size 8Kb, B+ tree • Real/benchmark datasets • XMark (Internet auction, 1.4 GB raw data, ± 17 million nodes), Protein Sequence Database Tree-Pattern Queries on a Lightweight XML Processor

  22. XMark Tree-Pattern Queries on a Lightweight XML Processor

  23. Custom Data • Goal: isolate important features • Query //a//b[.//c]//d • Simple enough for detailed investigation • Complex enough to provide large number of different data access possibilities • Vary selectivity of each element separately • Add recursion to key elements (root, leaf) a b c d Tree-Pattern Queries on a Lightweight XML Processor

  24. Custom Data a b c d Tree-Pattern Queries on a Lightweight XML Processor

  25. Custom Data a b c d Tree-Pattern Queries on a Lightweight XML Processor

  26. XBTwigStack x INLJ • On large dataset, 40mi nodes, 1Gb, 1% selectivity • Difference of 40s between XBTwig and INLJ best plan Tree-Pattern Queries on a Lightweight XML Processor

  27. XBTwigStack x INLJ Tree-Pattern Queries on a Lightweight XML Processor

  28. Conclusions • Categorization of TPQ processing algorithms • Adaptations for processing TPQ • DFA + accessing nodes from B+tree • INLJ + ancestor skipping • DFA-based improved, IdxDFA, not enough • Structural summary available and smaller than document: StrIdx • XBTwigStack: most robust and predictable • INLJ when high selectivity: no guarantee about chosen plan without optimizer module Tree-Pattern Queries on a Lightweight XML Processor

  29. Questions?

  30. EXTRA SLIDES

  31. article procs author conf last Background Bib (1,36) article article (2,19) (20,35) title author procs title author procs (3,5) (6,13) (14,18) (21,23) (24,31) (32,34) t1 last first conf t2 last first conf (4) (7,9) (10,12) (15,17) (22) (25,27) (28,30) (33) DeWitt David J. Lu Hongjun VLDB (8) (11) (26) (29) (16) Region numbering scheme : (left, right) Tree-Pattern Queries on a Lightweight XML Processor

  32. a 1 b a2 b2 1 a1 b1 c1 c2 a a 2 Sa Sb Sc b c b 2 2 c c 1 TwigStack a1 b1 c1 a1 b1 c2 a1 b2 c1 a2 b2 c1 results 1) solutions individual root-to-leaf paths 2) merge-join those partial solutions → before adding element to stack: (i) the node has a descendant on each of the query children streams (ii) each of those descendant nodes recursively satisfies this property → optimized by indexes doc query Tree-Pattern Queries on a Lightweight XML Processor

  33. b 1 b 2 a 1 b c 3 2 c 1 TwigStack + Indexes • B+-tree: built on the left attribute • Access ancestor then probe descendant stream to skip unmatchable initial nodes • Ancestor skipping not effective: • Skip only up to the first element following a given one • XB-tree: index on (left,right) bounding segment • Pointer to children (region completely included in parent) • Leaves sorted on left • Region: ancestor access effective • XR-tree: index on (left,right) = B+tree with complex index key + stab lists • Ancestor skipping: elements stabbed by left Tree-Pattern Queries on a Lightweight XML Processor

  34. a 1 b 1 a 2 b c 2 2 c 1 ViST, Virtual Suffix Tree • Input: sequence of (symbol, path) pairs (a1,)(b1,a1)(a2,a1b1)(b2,a1b1a2)(c1,a1b1a2b2)(c2, a1b1a2) • Document and query translated • Virtual suffix tree (B+-tree) indexed left • Processing • Structural query = find (non-contiguous) subsequence matches → suffix tree • Benefit: query as a whole instead of merging parts • One query path per time • Efficient when query top defines the results Tree-Pattern Queries on a Lightweight XML Processor

  35. b 1,13 a a a (b,) 1,13 5,7 2,4 8,12 2,4 (a,b) 5,7 b c c 8,12 6 9,11 3 B+ c (b,ba) 3 10 (c,bac) 10 6 9,11 (c,ba) S-Ancestor ViST, index • Virtual suffix tree • B+tree, nodes indexed on the left position • D- ancestor and S-ancestor D-Ancestor Tree-Pattern Queries on a Lightweight XML Processor

  36. A A 18 B B B B B 4 11 14 17 C D C D C D F C D 1 3 6 8 10 13 16 N N v v v v v v v 5 2 12 15 9 0 7 Query Sequence Document (A, ε ) (B,A)(C,B)(D,B) ViST Query ViST Final Filtering ViST Subsequence Matching Tree-Pattern Queries on a Lightweight XML Processor

  37. 1,13 (b,) 2,4 3 (a,b) 5,7 8,12 10 B+ (b,ba) 6 9,11 (c,bac) b 1,13 b (c,ba) a a a 5,7 2,4 8,12 a b c c 6 9,11 3 c c 10 ViST, algorithm Q = (b,) (a,b) (c,ba) Q = q1, … qk, query sequence D-tree B+ index of (symbol, prefix) S-tree B+ index of region labels function Search (region, i) if i < |Q| T = retrieve qi S-tree from D-tree N = retrieve from S-tree all nodes in the range region foreach node c(left,right)  N Search ( (left,right), i+1) else return result Search ( (1,13), (b,) ) Search ( (1,13), (a,b) ) Search ( (2,4), (c,ba) ) Search ( (5,7), (c,ba) ) Search ( (8,12), (c,ba) ) Tree-Pattern Queries on a Lightweight XML Processor

  38. 2,4 5,7 8,12 ViST, access order (b,) 1,13 (a,b) Q = (b,) (a,b) (c,ba) B+ (b,ba) 3 b 1 (c,bac) 10 a a a 6 9,11 (c,ba) 4 2 6 b c c Search ( (1,13), (b,) ) 5 7 X Search ( (1,13), (a,b) ) Search ( (2,4), (c,ba) ) c Search ( (5,7), (c,ba) ) Search ( (8,12), (c,ba) ) Tree-Pattern Queries on a Lightweight XML Processor

  39. D1 = (a,) (b,a) (d,ab) (e,ab) (c,a) (f,ac) (d,ac) D2 = (a,) (b,a) (d,ab) (b,a) (e,ab) Q = (a,) (b,a) (d,ab) (e,ab) a a b c c b a a a b c b b b d e f d d e d e ViST, discussion  • Worst-case storage requirement for D-Ancestor is > linear in #elements • E.g. unary tree with n nodes, sequence O(n2) • False alarms • Our implementation: no false alarms • //a[//b]//c unordered • Vist: (a, )(b,a)(c,a) & (a, )(c,a)(b,a) • Our implementation: run the twig query only once   Tree-Pattern Queries on a Lightweight XML Processor

  40. PRIX, PRüfer seqs. for Indexing XML • Input: sequence of labels • Document & query mapped by Prüfer’s method • Tree → sequence: remove one node at a time • Processing • Sequence matching against indexed db: filter non-matches • Refinement phases: filter twig-matches, the results: • Form a tree, satisfy the twig query, include the leaf nodes LPS = A C B C C B A C A E E E D A NPS = 15 3 7 6 6 7 15 9 15 13 13 13 14 15 (Any numbering scheme, here is post-order) Bottom-up approach  Tree-Pattern Queries on a Lightweight XML Processor

  41. PRIX, Processing • Problems • Complex solution • //a[//b]//c unordered: same problem as ViST • What we do • Region based numbering scheme and XB-tree • Bottom-up traversal of the query + subtwig merging • Access nodes in the same order • Efficient when query bottom defines results Tree-Pattern Queries on a Lightweight XML Processor

  42. 1 1 A A 1 1 A A 2 3 2 3 B B C C 2 3 2 3 B B C C 4 5 4,5 D D D 4 5 4 5 D D D D 6 7 6,7 E E E 6,7 6 7 E E E A(0) A(1) A(2) A(k) index • A(k) → k is the degree of similarity, “size of common path” • k k-bisimilarity 1) for any two nodes u and v, u 0 v iff u and v have same label 2) u k v iff u k-1 v and for every parent u’ of v’, there is a parent v’ of v s.t. u’ k-1 v’ and vice-versa Original document Tree-Pattern Queries on a Lightweight XML Processor

  43. Protein Tree-Pattern Queries on a Lightweight XML Processor

More Related