1 / 0

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

PR-Join:. A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. Shimin Chen * Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh + Microsoft Research. Online Aggregation. Data warehouse and business intelligence

kassia
Download Presentation

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath+ *Intel Labs Pittsburgh +Microsoft Research
  2. Online Aggregation Data warehouse and business intelligence Fast growing multi-billion dollar market Interactive ad-hoc queries on big data Important for detecting new trends Fast response time hard to achieve One promising approach:Online Aggregation (OLA) Provides early representative results for aggregate queries (sum, count, avg, etc.) For example, “average is 123.4 ± 5.6 with 95% confidence” Essential to OLA: non-blocking join algorithm [Hellerstein et al. 97] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  3. Non-Blocking Join for OLA OLA assumption: relations are in random order Estimates based on current results Relation A Main memory Relation B Spill Read back Temporary storage PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  4. Design Goals of Non-Blocking Joins Wrong query: stop early Accurate enough: stop early Slow convergence: wait longer High variance, high selectivity, high group counts, data skews … Need the full, accurate result: finish query Fast, representative early results Good end-to-end performance User may find Design Goals PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  5. Two Metrics in Algorithm Analysis Total I/Os Good end-to-end performance: Fast early results: records from B new Newly covered area x selectivity Join: check all pairs of records from A and B Result Rate = I/Os for covering the new area records from A new Early : before completely reading A and B PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  6. Design Space Ideal  High PR-Join targets Early Representative Result Rate Ripple DBO [Jermaine, et al’07] [Haas & Hellerstein’99] SMS [Jermaine, et al’05] Hash Ripple [Luo, et al’02] Low GRACE [Kitsuregawa, et al ’83]  Low High Total I/O Cost PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  7. Performance Result Preview Near-optimal total I/O cost Higher early result rate PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  8. Outline Introduction PR-Join (Partitioned expanding Ripple Join) Algorithm Evaluation Conclusion PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  9. records from B records from A Background: Ripple Join [Haas & Hellerstein’99] For each ripple: Read new records from A and B; check for matches Read spilled records; check for matches withnew records Spill new to disk spilled new spilled new PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  10. Observations of Ripple Join Total I/Os: O(N2) N = total # of input pages in A and B I/Os of ripples form an arithmetic series Result rate of a ripple is higher if wider ripple Increase ripple width But ripple width limited by the memory size Newly covered area x selectivity Super linear growth Result Rate = I/Os for covering the new area Grows linearly PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  11. PR-Join Idea 1: Multiplicatively Expanding Ripples Total I/Os: O(N) linear I/Os of ripples form a geometric series Higher result rate: Wider ripple leads to higher result rate But must overcome memory size limitation PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  12. PR-Join Idea 2: Hash Partitioning Each partition < memory Every join invocation performs a ripple on a partition Estimation is updated after every join invocation Much faster user responses  Statistically sound empty empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  13. Statistical Guarantees Idea: hash partitioning disjoint sub-spaces Stratified sampling in statistics Statistical estimate: Ripple join formula for every partition Stratified sampling formula to combine estimates from partitioned ripples empty empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  14. Comparing Analytical Performance (Parameter setting details in paper) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  15. Outline Introduction PR-Join Algorithm Evaluation Conclusion PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  16. Non-Blocking Join for OLA Estimates based on current results Relation A Main memory Relation B Hard disks Spill Read back Temporary storage Hard disk or SSD PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  17. Disk as Temp Storage 10GB joins 10GB 500MB memory PR-Join achieves much better end-to-end performance than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  18. Marginal Result Rate Disk as temp storage PR-Join achieves an order of magnitude higher result rate than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  19. SSD as Temp Storage 10GB joins 10GB 500MB memory Temp I/Os are almost completely overlapped with I/Os to read input Using SSD, PR-Join achieves near optimal I/O costs PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  20. More Details in Paper Joining finite data streams: PR-Join can be easily used for joining finite data streams Compared with state-of-the-art algorithm (RPJ [Tao et al.’05]) PR-Join achieves better performance Analysis of non-blocking join algorithms for OLA PR-Join parameter choices Handling skews More experimental results (see us at the plenary session) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  21. Conclusions In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join) PR-Join for Online Aggregation: Provides statistical guarantee An order of magnitude higher result rate than prior approach Near optimal total I/O cost PR-Join for finite data streams: Better performance than state-of-the-art algorithm PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
  22. Thank you! shimin.chen@intel.com PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath
More Related