1 / 35

Evaluating Window Joins over Unbounded Streams

Evaluating Window Joins over Unbounded Streams. Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, viglas}@cs.wisc.edu Univ. of Wisconsin-Madison. ICDE’03 Bangalore, India. Outline of the talk. Introduction: Continuous Queries over Unbounded Streams

noble-myers
Download Presentation

Evaluating Window Joins over Unbounded Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, viglas}@cs.wisc.edu Univ. of Wisconsin-Madison ICDE’03 Bangalore, India

  2. Outline of the talk • Introduction: Continuous Queries over Unbounded Streams • Measuring the Cost of Sliding Window Joins • On Maximizing the Efficiency of Processing Joins • Summary

  3. Sliding Windows • Handling internal states is big challenge. • Approximate answers • Sliding windows – toss out expired tuples • Synopses – resort to reduced answer precision

  4. A B λbTb λaTa λa λb A Simple Sliding Window Query On arrival of a new tuple to window A • Scan window B and propagate matching tuples • Insert new tuple into window A • Invalidate all expired tuples in window A

  5. Some interesting questions • How should we measure the efficiency of a sliding window join evaluation strategy? • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

  6. Interesting questions (Cont’d) • How should we allocate computing resources between the two windows to maximize join efficiency? • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

  7. Interesting questions • How should we measure the efficiency of a sliding window join evaluation strategy? • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

  8. Outline of the talk • Introduction: Continuous Queries over Unbounded Streams • Measuring the Cost of Sliding Window Joins • On Maximizing the Efficiency of Processing Joins • Summary

  9. A B λbTb λaTa λa λb Cost Model • Unit-time basis cost model • Aggregate cost of processing tuples arriving in each window in a time unit

  10. A B λbTb λaTa λa λb Cost Model (Cont’d) • Cost formula can be divided into two independent groups, one for each input stream • Thus, can evaluate join algorithms for each join direction independently

  11. Cost of One-way NLJ • P(D) - cost of accessing one tuple in data structure D during search operation • I(D) - cost of accessing one tuple in data structure D during update operation • Total number of tuples processed in a time unit multiplied by the tuple access cost

  12. Cost of One-way HJ • |B| -- #of hash buckets in window B • B/|B| -- #of tuples in a hash bucket • Implement hash bucket to preserve tuple arrival order – avoid invalidation overhead.

  13. Cost of One-way T-tree INLJ • N – size of a T-tree node (#of tuples) • B/N – total #of nodes in a T-tree

  14. Implementation • Implemented: • Four join algorithms: NLJ, HJ, BJ, and TJ. • Asymmetric join operator • Stream emulator • System: • Java HotSpot VM 1.4 • AMD Athlon XP 1533Mhz, 1GB memory • Windows XP Professional

  15. Fitting Parameters in the Model • Process 60 seconds worth of tuples without intermittent delays, at 20 different points with increasing workload rates. • Then, equate the measured values with the cost formula, and solve the equation. • Hash bucket size = 10, T-tree node size = 100 used • P(N) = 3x10-4 P(H) = 5.5x10-4 • P(BT) = 2.6x10-4 P(TT) = 2.6x10-4 • I(N) = 1x10-4 I(H) = 7.8x10-4 • I(BT) = 2.6x10-4 I(N) = 2.7x10-4

  16. Outline of the talk • Introduction: Continuous Queries over Unbounded Streams • Measuring the Cost of Sliding Window Joins • On Maximizing the Efficiency of Processing Joins • Summary

  17. Interesting questions • How should we measure the efficiency of a sliding window join evaluation strategy? • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

  18. Taking Advantage of Asymmetry • There are cases where an asymmetric combination of join algorithms outperforms symmetric counterparts! • E.g. for some A, B

  19. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  20. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  21. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  22. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  23. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  24. Join Cost Estimation using Cost Model • Size of window A = 5000 • Size of window B = 5000 • Five winning combinations: TN, TH, HH, HT, NT

  25. Measured Join Cost (CPU Time) • A=5000, B=5000 • Memory utilization: HJ (h=10) consumed 5% more than TJ (n=100). • Same five winners: TN, TH, HH, HT, NT • Cost model prediction was accurate for both overall shape and crossover points. • What if we increase window A and decrease window B? • (e.g. A=7000, B=3000 as opposed to current 5000:5000)

  26. A B λbTb λaTa λa λb Cross-over Point TN-TH • TN-TH only dependent on window size B • TN-TH = 0.0094 (B=500), meaning TNJ will outperform THJ when stream B is more than 106 times faster than stream A. • TN-TH = 0.0555 (B=100), 18 times.

  27. Cross-over Point TH-HH • TH-HH only dependent on the size of window A • If the size of window A increases the crossover point TH-HH will move toward left, and vice versa.

  28. Join Performance • A=9500, B=500, λa=2, λb=998 • A=7000, B=3000, λa=800,λb=200

  29. Interesting questions (Cont’d) • How should we allocate computing resources between the two windows to maximize join efficiency? • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

  30. Resource Allocation & Join Performance • Focus on cases where system resources are insufficient to fully support queries and workloads. • Input streams are simply too fast to keep up with. • Evaluating expensive join operator and its service rate is lower than the input rates. • System memory cannot hold both windows.

  31. Resource Allocation & Join Performance (Cont’d) • Approximate answers may be acceptable • E.g. query involving aggregate (e.g. average) over join • Question is how to maximize the accuracy of the approximate answers, given the limited resources. • We use insight that larger samples produce better answers • Goal is to maximize the #of join result tuples • Care must be taken to ensure that the result produced is statistically comparable to a random sample of the full join result.

  32. Limited Computing Resources • λa=800, λb=200 • A=100, B=200 • =0.01, μ=100 Window Join Output Rate : w/ Effective Rates =

  33. Limited Memory Resources • λa=10, λb=50 • M=1000, =0.005 Window Join Output Rate =

  34. Limited Memory & Computing Resources • μ=10, M=100 • =0.01 • Best performers are groups that allocate maximum computing resources to one stream and maximum memory to the another.

  35. Summary • Introduced unit-time basis cost model and experimentally validated it. • Extended traditional join framework to include asymmetric combinations of join algorithms. • Investigated resource allocation strategies for improving the accuracy of approximate answers. • Developed powerful optimization framework for sliding window join queries by addressing these issues in a unified manner.

More Related