Evaluating window joins over unbounded streams
Download
1 / 35

Evaluating Window Joins over Unbounded Streams - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

Evaluating Window Joins over Unbounded Streams. Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, viglas}@cs.wisc.edu Univ. of Wisconsin-Madison. ICDE’03 Bangalore, India. Outline of the talk. Introduction: Continuous Queries over Unbounded Streams

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Evaluating Window Joins over Unbounded Streams' - noble-myers


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Evaluating window joins over unbounded streams

Evaluating Window Joins over Unbounded Streams

Jaewoo Kang Jeffrey F. Naughton

Stratis D. Viglas

{jaewoo, naughton, viglas}@cs.wisc.edu

Univ. of Wisconsin-Madison

ICDE’03 Bangalore, India


Outline of the talk
Outline of the talk

  • Introduction: Continuous Queries over Unbounded Streams

  • Measuring the Cost of Sliding Window Joins

  • On Maximizing the Efficiency of Processing Joins

  • Summary


Sliding windows
Sliding Windows

  • Handling internal states is big challenge.

  • Approximate answers

    • Sliding windows – toss out expired tuples

    • Synopses – resort to reduced answer precision


A simple sliding window query

A

B

λbTb

λaTa

λa

λb

A Simple Sliding Window Query

On arrival of a new tuple to window A

  • Scan window B and propagate matching tuples

  • Insert new tuple into window A

  • Invalidate all expired tuples in window A


Some interesting questions
Some interesting questions

  • How should we measure the efficiency of a sliding window join evaluation strategy?

  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?


Interesting questions cont d
Interesting questions (Cont’d)

  • How should we allocate computing resources between the two windows to maximize join efficiency?

  • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?


Interesting questions
Interesting questions

  • How should we measure the efficiency of a sliding window join evaluation strategy?

  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?


Outline of the talk1
Outline of the talk

  • Introduction: Continuous Queries over Unbounded Streams

  • Measuring the Cost of Sliding Window Joins

  • On Maximizing the Efficiency of Processing Joins

  • Summary


Cost model

A

B

λbTb

λaTa

λa

λb

Cost Model

  • Unit-time basis cost model

  • Aggregate cost of processing tuples arriving in each window in a time unit


Cost model cont d

A

B

λbTb

λaTa

λa

λb

Cost Model (Cont’d)

  • Cost formula can be divided into two independent groups, one for each input stream

  • Thus, can evaluate join algorithms for each join direction independently


Cost of one way nlj
Cost of One-way NLJ

  • P(D) - cost of accessing one tuple in data structure D during search operation

  • I(D) - cost of accessing one tuple in data structure D during update operation

  • Total number of tuples processed in a time unit multiplied by the tuple access cost


Cost of one way hj
Cost of One-way HJ

  • |B| -- #of hash buckets in window B

  • B/|B| -- #of tuples in a hash bucket

  • Implement hash bucket to preserve tuple arrival order – avoid invalidation overhead.


Cost of one way t tree inlj
Cost of One-way T-tree INLJ

  • N – size of a T-tree node (#of tuples)

  • B/N – total #of nodes in a T-tree


Implementation
Implementation

  • Implemented:

    • Four join algorithms: NLJ, HJ, BJ, and TJ.

    • Asymmetric join operator

    • Stream emulator

  • System:

    • Java HotSpot VM 1.4

    • AMD Athlon XP 1533Mhz, 1GB memory

    • Windows XP Professional


Fitting parameters in the model
Fitting Parameters in the Model

  • Process 60 seconds worth of tuples without intermittent delays, at 20 different points with increasing workload rates.

  • Then, equate the measured values with the cost formula, and solve the equation.

  • Hash bucket size = 10, T-tree node size = 100 used

  • P(N) = 3x10-4 P(H) = 5.5x10-4

  • P(BT) = 2.6x10-4 P(TT) = 2.6x10-4

  • I(N) = 1x10-4 I(H) = 7.8x10-4

  • I(BT) = 2.6x10-4 I(N) = 2.7x10-4


Outline of the talk2
Outline of the talk

  • Introduction: Continuous Queries over Unbounded Streams

  • Measuring the Cost of Sliding Window Joins

  • On Maximizing the Efficiency of Processing Joins

  • Summary


Interesting questions1
Interesting questions

  • How should we measure the efficiency of a sliding window join evaluation strategy?

  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?


Taking advantage of asymmetry
Taking Advantage of Asymmetry

  • There are cases where an asymmetric combination of join algorithms outperforms symmetric counterparts!

    • E.g. for some A, B


Join cost estimation using cost model
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Join cost estimation using cost model1
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Join cost estimation using cost model2
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Join cost estimation using cost model3
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Join cost estimation using cost model4
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Join cost estimation using cost model5
Join Cost Estimation using Cost Model

  • Size of window A = 5000

  • Size of window B = 5000

  • Five winning combinations: TN, TH, HH, HT, NT


Measured join cost cpu time
Measured Join Cost (CPU Time)

  • A=5000, B=5000

  • Memory utilization: HJ (h=10) consumed 5% more than TJ (n=100).

  • Same five winners: TN, TH, HH, HT, NT

  • Cost model prediction was accurate for both overall shape and crossover points.

  • What if we increase window A and decrease window B?

    • (e.g. A=7000, B=3000 as opposed to current 5000:5000)


Cross over point tn th

A

B

λbTb

λaTa

λa

λb

Cross-over Point TN-TH

  • TN-TH only dependent on window size B

  • TN-TH = 0.0094 (B=500), meaning TNJ will outperform THJ when stream B is more than 106 times faster than stream A.

  • TN-TH = 0.0555 (B=100), 18 times.


Cross over point th hh
Cross-over Point TH-HH

  • TH-HH only dependent on the size of window A

  • If the size of window A increases the crossover point TH-HH will move toward left, and vice versa.


Join performance
Join Performance

  • A=9500, B=500, λa=2, λb=998

  • A=7000, B=3000, λa=800,λb=200


Interesting questions cont d1
Interesting questions (Cont’d)

  • How should we allocate computing resources between the two windows to maximize join efficiency?

  • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?


Resource allocation join performance
Resource Allocation & Join Performance

  • Focus on cases where system resources are insufficient to fully support queries and workloads.

    • Input streams are simply too fast to keep up with.

    • Evaluating expensive join operator and its service rate is lower than the input rates.

    • System memory cannot hold both windows.


Resource allocation join performance cont d
Resource Allocation & Join Performance (Cont’d)

  • Approximate answers may be acceptable

    • E.g. query involving aggregate (e.g. average) over join

  • Question is how to maximize the accuracy of the approximate answers, given the limited resources.

  • We use insight that larger samples produce better answers

    • Goal is to maximize the #of join result tuples

    • Care must be taken to ensure that the result produced is statistically comparable to a random sample of the full join result.


Limited computing resources
Limited Computing Resources

  • λa=800, λb=200

  • A=100, B=200

  • =0.01, μ=100

Window Join Output Rate :

w/ Effective Rates =


Limited memory resources
Limited Memory Resources

  • λa=10, λb=50

  • M=1000, =0.005

Window Join Output Rate =


Limited memory computing resources
Limited Memory & Computing Resources

  • μ=10, M=100

  • =0.01

  • Best performers are groups that allocate maximum computing resources to one stream and maximum memory to the another.


Summary
Summary

  • Introduced unit-time basis cost model and experimentally validated it.

  • Extended traditional join framework to include asymmetric combinations of join algorithms.

  • Investigated resource allocation strategies for improving the accuracy of approximate answers.

  • Developed powerful optimization framework for sliding window join queries by addressing these issues in a unified manner.


ad