1 / 32

On the Optimality of the Holistic Twig Join Algorithm

On the Optimality of the Holistic Twig Join Algorithm. Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and Derick Wood (HKUST). A Scenario. Small Devices. XML Doc. Server. Limited computing resources. Memory. Picking up useful elements on the fly.

ornice
Download Presentation

On the Optimality of the Holistic Twig Join Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson (Upenn), Malika Mahoui (Upenn) and Derick Wood (HKUST) DIMACS Streaming Data Working Group II

  2. A Scenario Small Devices XML Doc. Server Limited computing resources Memory Picking up useful elements on the fly Streams of elements Memory is shared by many Concurrent apps.

  3. Background • The Model, Data Representation and Assumptions

  4. The Model • Data Streaming Model • Spend constant time to process each element • An element in a stream is either discarded or stored in the main memory once it is processed • See the element in streams only once

  5. Node Representation • 4-ary tuple: <preorder #, postorder #, depth, label> • Complexity of Desc, Child, Ances, Parent: O(1) • Desc(n1, n2) = true if n1.preorder < n2.preorder ^ n1.postorder > n2.postorder • Child(n1, n2) = true if n1.preorder < n2.preorder ^ n1.postorder > n2.postorder ^ n1.depth + 1 = n2.depth

  6. Example Document a1 (1, 9, 1, A) (2, 7, 2, B) c2 (8, 8, 2, C) b1 (3, 6, 3, A) a2 (4, 4, 4, B) (5, 5, 4, C) b2 c1

  7. Twig Queries • Syntax: Step ::= / | // NodeTest ::= symbol Path ::= Step NodeTest | Step NodeTest Path Twig ::= Path | Path (Twig, Twig, …, Twig) • Example • // A (//B, //C) • In English: Want to find the A nodes which has a B descendent and a C descendent A B C

  8. Twig Join Algorithms • Containment Join [Jiang et al.] • Decompose a twig query into a set of steps • Apply relational join algor. to join the nodes of each step • Use customized traditional indexes and estimation methods [SIGMOD03] • Path Join [Zhang et al.] • Decompose a twig query into a set of paths • Apply relational join algor. to join the nodes of each path • Holistic Twig Join [Bruno et al.] • Evaluate the twig query as a whole

  9. Twig Join Algorithms (cont’) • The first two approaches may compute large intermediate results and not suitable for data streaming • In this talk we will focus on the third approach. • The TwigStack Algor. (Bruno et al. SIGMOD 02)

  10. The TwigStack Algor. (Overview) • Associate a stream to each NodeTest • The nodes in the stream satisfy the NodeTest • Asymptotically optimal among the algorithms that read the entire input • Scan the streams only once • Spend constant memory only on the nodes that are useful, i.e. participate in at least one solution • Guarantee the optimality when the query contains descendent edges only. • Suboptimal when the query contains some child edges • Memory is spent on possibly useless nodes.

  11. Problem Statement • Given a twig query and the associated streams, is it possible to find all solutions … • By using a single forward scan of the streams • By paying constant memory only to the useful nodes • By spending constant time on processing each node in the streams

  12. Main Results So Far • Assume the data streaming model… • There is no optimal holistic twig join algorithm – Theorem 1. • The evaluation of the twig queries is not memory bounded – Theorem 1. • By relaxing some restrictions on the data streaming model, we showed… • The lower bounds of such relaxed models are still quite high – Theorem 2 and Theorem 3.

  13. Outline • TwigStack By Examples • Offline Sorting • Multiple Scans • Discussion • Conclusion

  14. TwigStack By Examples a1 • Query: //A (//B, //C) • Document: • Streams: • TA = [a1, a2], TB = [b1, b2], TC = [c1, c2] • pA, pB, pC are the anchor pointing to the “top” of the streams • Useful nodes are stored in the main memory and can be read later c2 b1 a2 b2 c1

  15. TwigStack By Examples • Step 0 • pA -> a1, pB -> b1, pC -> c1 • a1 is useful, TA is advanced, pA->a2 • Step 1 • b1 is useful, TB is advanced, pB->b2 a1 c2 b1 a2 b2 c1 a1 a1 c2 b1 a2 b2 c1

  16. TwigStack By Examples b1 a1 a1 • Step 2 • a2 is useful, TA is advanced, pA -> null • Step 3 • b2 is useful, TB is advanced, pB -> null c2 b1 a2 b2 c1 b1 a1 a1 a2 c2 b1 a2 b2 c1

  17. TwigStack By Examples b1 a1 a1 • Step 4 • c1 is useful, TC is advanced, pC -> c2 • Step 5 • Printing • Step 6 • c2 is useful, TC is advanced, pC-> null b2 a2 c2 b1 a2 b2 c1 b1 a1 a1 c2 b1 a2 b2

  18. TwigStack By Examples • Query: //A (/B, /C) • Document: • Streams: TA = [a1, a2], TB = [b1, b2], TC = [c1, c2] a1 c2 b1 a2 b2 c1

  19. TwigStack By Examples a1 • Computation 1 • pA -> a1, pB -> b1, pC -> c1 • TA is advanced, pA->a2, TB is advanced, pB -> b2 • a2 is useful (a1 is discarded) • Computation 2 • TC is advanced, pC->c2 • a1 is useful • a2 is useless because c1 is discarded b1 a2 b2 c1 a1 b1 c2 a2 b2 c1

  20. TwigStack By Examples • The Extreme Case • O(stream size) a1 b1 c4 a1 b2 c3 a1 b3 c2 a1 b4 c1

  21. TwigStack Pseudo Code We’ve only walked through the red boxes

  22. Twig Queries over Streams • Theorem 1 • There is no optimal holistic twig join algorithms, no matter how the nodes are sorted. • Memory must be spent on possibly useless nodes • Given arbitrary streams, memory requirement of exact algorithms is unbounded.

  23. Proof of Theorem 1 (Sketch) • Fix a document • Issue a few queries: //A//B, /A (/A, /A) and /A/A • Optimality implies certain constraints on the streams • No single stream can satisfy all the constraints

  24. Proof of Theorem 1 (cont’) • Reduce a twig query to a SPJ query • the twig query is memory bounded iff the SPJ query is memory bounded. • Babcock et al PODS 02

  25. Outline • TwigStack By Examples • Offline Sorting • Multiple Scans • Discussion • Conclusion

  26. Variation 1: Offline Sorting • Pre-compute some intermediate results and collect the results in a scan • Allow offline sorting on the nodes and keep all the necessary sorted nodes • Allow the algorithm to scan the nodes in the correct orderings

  27. Motivation • The anchors are performing a depth first transversal • But why? How about an ordering in which recursions are removed? a1 a1 a2 c2 b1 c2 b1 b2 c1 a2 b2 c1

  28. The Lower Bound • The number of necessary sorting performed offline is high • Data redundancy • m is the number of structurally recursive label in the doc. DTD. d is the doc. depth. • The lower bound is m • We identify a restricted case that DTDs help to lower the lower bound d

  29. Variation 2: Multiple Scans • Massive storage (tapes, disks) naturally produces a stream of items. • Sequential scans is a vital requirement of such storage • Can only allow a small number of scans due to the high volume of data

  30. The Lower Bound • Allow P scans on the data streams. • The lower bound of P is high • d where d is the doc. depth and t is the number of simple child-edge query in a twig query t

  31. Discussion • Bruno et al. assigns memory to possible useless nodes and illustrates that such computation model is practical by experiments • No work on approximating the twig queries with provable guarantees • Constraints expressed in DTDs • Our work assumes certain representation of the node: ancestor, descendent, parent, child relationship can be determined in O(1)

  32. Conclusion • The evaluation of twig queries in data streaming context is tricky. • It is not memory bounded. • Optimal memory constraint cannot be satisfied in a pass of streams. • Need to look for other solutions.

More Related