evaluating window joins over unbounded streams n.
Download
Skip this Video
Download Presentation
Evaluating Window Joins over Unbounded Streams

Loading in 2 Seconds...

play fullscreen
1 / 35

Evaluating Window Joins over Unbounded Streams - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

Evaluating Window Joins over Unbounded Streams. Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, viglas}@cs.wisc.edu Univ. of Wisconsin-Madison. ICDE’03 Bangalore, India. Outline of the talk. Introduction: Continuous Queries over Unbounded Streams

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Evaluating Window Joins over Unbounded Streams' - noble-myers


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
evaluating window joins over unbounded streams

Evaluating Window Joins over Unbounded Streams

Jaewoo Kang Jeffrey F. Naughton

Stratis D. Viglas

{jaewoo, naughton, viglas}@cs.wisc.edu

Univ. of Wisconsin-Madison

ICDE’03 Bangalore, India

outline of the talk
Outline of the talk
  • Introduction: Continuous Queries over Unbounded Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary
sliding windows
Sliding Windows
  • Handling internal states is big challenge.
  • Approximate answers
    • Sliding windows – toss out expired tuples
    • Synopses – resort to reduced answer precision
a simple sliding window query

A

B

λbTb

λaTa

λa

λb

A Simple Sliding Window Query

On arrival of a new tuple to window A

  • Scan window B and propagate matching tuples
  • Insert new tuple into window A
  • Invalidate all expired tuples in window A
some interesting questions
Some interesting questions
  • How should we measure the efficiency of a sliding window join evaluation strategy?
  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?
interesting questions cont d
Interesting questions (Cont’d)
  • How should we allocate computing resources between the two windows to maximize join efficiency?
  • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?
interesting questions
Interesting questions
  • How should we measure the efficiency of a sliding window join evaluation strategy?
  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?
outline of the talk1
Outline of the talk
  • Introduction: Continuous Queries over Unbounded Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary
cost model

A

B

λbTb

λaTa

λa

λb

Cost Model
  • Unit-time basis cost model
  • Aggregate cost of processing tuples arriving in each window in a time unit
cost model cont d

A

B

λbTb

λaTa

λa

λb

Cost Model (Cont’d)
  • Cost formula can be divided into two independent groups, one for each input stream
  • Thus, can evaluate join algorithms for each join direction independently
cost of one way nlj
Cost of One-way NLJ
  • P(D) - cost of accessing one tuple in data structure D during search operation
  • I(D) - cost of accessing one tuple in data structure D during update operation
  • Total number of tuples processed in a time unit multiplied by the tuple access cost
cost of one way hj
Cost of One-way HJ
  • |B| -- #of hash buckets in window B
  • B/|B| -- #of tuples in a hash bucket
  • Implement hash bucket to preserve tuple arrival order – avoid invalidation overhead.
cost of one way t tree inlj
Cost of One-way T-tree INLJ
  • N – size of a T-tree node (#of tuples)
  • B/N – total #of nodes in a T-tree
implementation
Implementation
  • Implemented:
    • Four join algorithms: NLJ, HJ, BJ, and TJ.
    • Asymmetric join operator
    • Stream emulator
  • System:
    • Java HotSpot VM 1.4
    • AMD Athlon XP 1533Mhz, 1GB memory
    • Windows XP Professional
fitting parameters in the model
Fitting Parameters in the Model
  • Process 60 seconds worth of tuples without intermittent delays, at 20 different points with increasing workload rates.
  • Then, equate the measured values with the cost formula, and solve the equation.
  • Hash bucket size = 10, T-tree node size = 100 used
  • P(N) = 3x10-4 P(H) = 5.5x10-4
  • P(BT) = 2.6x10-4 P(TT) = 2.6x10-4
  • I(N) = 1x10-4 I(H) = 7.8x10-4
  • I(BT) = 2.6x10-4 I(N) = 2.7x10-4
outline of the talk2
Outline of the talk
  • Introduction: Continuous Queries over Unbounded Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary
interesting questions1
Interesting questions
  • How should we measure the efficiency of a sliding window join evaluation strategy?
  • Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?
taking advantage of asymmetry
Taking Advantage of Asymmetry
  • There are cases where an asymmetric combination of join algorithms outperforms symmetric counterparts!
    • E.g. for some A, B
join cost estimation using cost model
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
join cost estimation using cost model1
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
join cost estimation using cost model2
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
join cost estimation using cost model3
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
join cost estimation using cost model4
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
join cost estimation using cost model5
Join Cost Estimation using Cost Model
  • Size of window A = 5000
  • Size of window B = 5000
  • Five winning combinations: TN, TH, HH, HT, NT
measured join cost cpu time
Measured Join Cost (CPU Time)
  • A=5000, B=5000
  • Memory utilization: HJ (h=10) consumed 5% more than TJ (n=100).
  • Same five winners: TN, TH, HH, HT, NT
  • Cost model prediction was accurate for both overall shape and crossover points.
  • What if we increase window A and decrease window B?
      • (e.g. A=7000, B=3000 as opposed to current 5000:5000)
cross over point tn th

A

B

λbTb

λaTa

λa

λb

Cross-over Point TN-TH
  • TN-TH only dependent on window size B
  • TN-TH = 0.0094 (B=500), meaning TNJ will outperform THJ when stream B is more than 106 times faster than stream A.
  • TN-TH = 0.0555 (B=100), 18 times.
cross over point th hh
Cross-over Point TH-HH
  • TH-HH only dependent on the size of window A
  • If the size of window A increases the crossover point TH-HH will move toward left, and vice versa.
join performance
Join Performance
  • A=9500, B=500, λa=2, λb=998
  • A=7000, B=3000, λa=800,λb=200
interesting questions cont d1
Interesting questions (Cont’d)
  • How should we allocate computing resources between the two windows to maximize join efficiency?
  • If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?
resource allocation join performance
Resource Allocation & Join Performance
  • Focus on cases where system resources are insufficient to fully support queries and workloads.
    • Input streams are simply too fast to keep up with.
    • Evaluating expensive join operator and its service rate is lower than the input rates.
    • System memory cannot hold both windows.
resource allocation join performance cont d
Resource Allocation & Join Performance (Cont’d)
  • Approximate answers may be acceptable
    • E.g. query involving aggregate (e.g. average) over join
  • Question is how to maximize the accuracy of the approximate answers, given the limited resources.
  • We use insight that larger samples produce better answers
    • Goal is to maximize the #of join result tuples
    • Care must be taken to ensure that the result produced is statistically comparable to a random sample of the full join result.
limited computing resources
Limited Computing Resources
  • λa=800, λb=200
  • A=100, B=200
  • =0.01, μ=100

Window Join Output Rate :

w/ Effective Rates =

limited memory resources
Limited Memory Resources
  • λa=10, λb=50
  • M=1000, =0.005

Window Join Output Rate =

limited memory computing resources
Limited Memory & Computing Resources
  • μ=10, M=100
  • =0.01
  • Best performers are groups that allocate maximum computing resources to one stream and maximum memory to the another.
summary
Summary
  • Introduced unit-time basis cost model and experimentally validated it.
  • Extended traditional join framework to include asymmetric combinations of join algorithms.
  • Investigated resource allocation strategies for improving the accuracy of approximate answers.
  • Developed powerful optimization framework for sliding window join queries by addressing these issues in a unified manner.