1 / 17

Progressive Merge Join : A Generic and Non-Blocking Sort-Based Join Algorithm

Progressive Merge Join : A Generic and Non-Blocking Sort-Based Join Algorithm. Jens-Peter Dittrich Bernhard Seeger David Scot Taylor Peter Widmayer. Outline. Motivation Genericity Invariants Applications Non-Blocking Run Generation Merge Phase Experiments Conclusions. 1. Motivation.

odessa
Download Presentation

Progressive Merge Join : A Generic and Non-Blocking Sort-Based Join Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progressive Merge Join:A Generic and Non-Blocking Sort-Based Join Algorithm Jens-Peter Dittrich Bernhard Seeger David Scot Taylor Peter Widmayer

  2. Outline • Motivation • Genericity • Invariants • Applications • Non-Blocking • Run Generation • Merge Phase • Experiments • Conclusions

  3. 1. Motivation • Genericity • positive aspects • separating algorithm from the domain problem • reuse of code • increase flexibility • extensible database systems • Non-Blocking algorithms • production of results before the entire input is read • pipelined processing through iterators

  4. Sort-Merge Join • Sort-Merge Join • sketch of the algorithm • sort phase • join phase by merging • frequent usage of sort-merge paradigm • equi joins • temporal joins • spatial joins • similarity joins • blocking operator #results runtime

  5. 2. Generictity Basic idea • based on the sweep-line paradigm known from CG • sweep area refers to the processing window of the join • insertions • queries • deletions • a sweep area for each input Notation • R and S input relations • R‘ and S‘ sorted input relations • Pjoin binary join predicate • Prm binary removal predicate

  6. insert(r) Prm(?,r) Pjoin (?,r) Sweep-area R‘ sweep area of R r s ³ r S‘ s sweep area of S

  7. Use-cases • equi join • order by attribute A • Pjoin (u,v) := u.A == v.A • Prm(u,v) := u.A < v.A • temporal join • data type I = [min, max] • order by I.min • Pjoin (u,v) := u.I intersects v.I • Prm(u,v) := u.I.max < v.I.min • spatial join • data type R = [I0,I1] • order by R[0].min • Pjoin (u,v) := u.R intersects v.R • Prm(u,v) := u.R[0].max < v.R[0].min

  8. Conditions When is the method applicable? • "w ³ v: Prm(u,v) Þ¬Pjoin (u,w) When an element is removed from a sweep area, it does not contribute anymore to the result of the join. • " w ≤ v: Prm(u,v) Þ Prm(u,w) When an item removes an element from a sweep area, it also will remove all smaller elements.

  9. R S S 3. Early results • Basic ideas • sort both inputs simoultaneously in main memory • interleave sorting and joining: join immediately after • initial runs are generated • a bunch of runs are merged

  10. 3. Early results • Basic ideas • sort both inputs simultaneously in main memory • Interleave sorting and joining: join immediately after • initial runs are generated • a bunch of runs are merged

  11. Duplicates Problem • early results are produced again during merge Avoidance of duplicates • a sweep area is used for each input run and an item has to send a query to each of these sweep areas • fits to our generic framework • considerable CPU-time overhead • Each item has to check every sweep area • currently used in our implementation • each item contains the label of the run where it is coming from • requires a special sweep area • moderate CPU-time overhead • each item has to check only one sweep area • early results are retrieved, but identified

  12. Correctness Theorem If the conditions • "w ³v:Prm(u,v) Þ¬Pjoin (u,w) • " w ≤ v: Prm(u,v) Þ Prm(u,w) are satisfied, Progressive Merge Join returns each result of the join at most once.

  13. 4. Experiments • Implementation of PMJ • in XXL (eXtendible and fleXible Library), but not yet in the most recent release • Examined methods • Methods • strict sort-merge join • semi-strict merge join: join during final merge • PMJ • Join Predicates • equi • spatial (Arge et al, VLDB 1998) • similarity (Dittrich & Seeger, KDD 2001)

  14. Results (I/O performance) • Similarity Join • 16-dim. Fourier vectors from CAD data • about 650.000 vectors per set • # results as a function of #I/O • page size: 4K PMJ semi-strict SMJ strict SMJ #results #I/O

  15. 200.000 150.00 #results 100.000 50000 0 0 350 runtime (sec) Results (runtime) • Experimental setting • 700 MHz Athlon, 256 MB memory • Java HotSpot Server Virtual Machine • # results as a function of #I/O PMJ semi-strict SMJ strict SMJ

  16. 5. Related Work • Sort-merge join • Join during final merge: [NP 85] • Plane-sweep paradigm • [PS 85] • Symmetric Hash Join • [RS 86], [WA 91] • Xjoin • [UF 00] • Hash-Ripple Join • [HH 99] • Scalable version [LEHN 02]

  17. 5. Conclusions • PMJ • Generic join processing • is not limited to equi-predicate, but supports also others • necessary conditions for applying PMJ to a join predicate • Non-blocking • join during run generation • join during merge • PMJ is an efficient non-blocking similarity join • Extensions • multi-relation joins • adaptive memory assignment • large main memories • sorted order of output

More Related