A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath+ *Intel Labs Pittsburgh +Microsoft Research

Online Aggregation Data warehouse and business intelligence Fast growing multi-billion dollar market Interactive ad-hoc queries on big data Important for detecting new trends Fast response time hard to achieve One promising approach:Online Aggregation (OLA) Provides early representative results for aggregate queries (sum, count, avg, etc.) For example, “average is 123.4 ± 5.6 with 95% confidence” Essential to OLA: non-blocking join algorithm [Hellerstein et al. 97] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Non-Blocking Join for OLA OLA assumption: relations are in random order Estimates based on current results Relation A Main memory Relation B Spill Read back Temporary storage PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Design Goals of Non-Blocking Joins Wrong query: stop early Accurate enough: stop early Slow convergence: wait longer High variance, high selectivity, high group counts, data skews … Need the full, accurate result: finish query Fast, representative early results Good end-to-end performance User may find Design Goals PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Two Metrics in Algorithm Analysis Total I/Os Good end-to-end performance: Fast early results: records from B new Newly covered area x selectivity Join: check all pairs of records from A and B Result Rate = I/Os for covering the new area records from A new Early : before completely reading A and B PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Design Space Ideal  High PR-Join targets Early Representative Result Rate Ripple DBO [Jermaine, et al’07] [Haas & Hellerstein’99] SMS [Jermaine, et al’05] Hash Ripple [Luo, et al’02] Low GRACE [Kitsuregawa, et al ’83]  Low High Total I/O Cost PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Performance Result Preview Near-optimal total I/O cost Higher early result rate PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Outline Introduction PR-Join (Partitioned expanding Ripple Join) Algorithm Evaluation Conclusion PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

records from B records from A Background: Ripple Join [Haas & Hellerstein’99] For each ripple: Read new records from A and B; check for matches Read spilled records; check for matches withnew records Spill new to disk spilled new spilled new PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Observations of Ripple Join Total I/Os: O(N2) N = total # of input pages in A and B I/Os of ripples form an arithmetic series Result rate of a ripple is higher if wider ripple Increase ripple width But ripple width limited by the memory size Newly covered area x selectivity Super linear growth Result Rate = I/Os for covering the new area Grows linearly PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

PR-Join Idea 1: Multiplicatively Expanding Ripples Total I/Os: O(N) linear I/Os of ripples form a geometric series Higher result rate: Wider ripple leads to higher result rate But must overcome memory size limitation PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

PR-Join Idea 2: Hash Partitioning Each partition < memory Every join invocation performs a ripple on a partition Estimation is updated after every join invocation Much faster user responses  Statistically sound empty empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Statistical Guarantees Idea: hash partitioning disjoint sub-spaces Stratified sampling in statistics Statistical estimate: Ripple join formula for every partition Stratified sampling formula to combine estimates from partitioned ripples empty empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Comparing Analytical Performance (Parameter setting details in paper) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Outline Introduction PR-Join Algorithm Evaluation Conclusion PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Non-Blocking Join for OLA Estimates based on current results Relation A Main memory Relation B Hard disks Spill Read back Temporary storage Hard disk or SSD PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Disk as Temp Storage 10GB joins 10GB 500MB memory PR-Join achieves much better end-to-end performance than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Marginal Result Rate Disk as temp storage PR-Join achieves an order of magnitude higher result rate than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

SSD as Temp Storage 10GB joins 10GB 500MB memory Temp I/Os are almost completely overlapped with I/Os to read input Using SSD, PR-Join achieves near optimal I/O costs PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

More Details in Paper Joining finite data streams: PR-Join can be easily used for joining finite data streams Compared with state-of-the-art algorithm (RPJ [Tao et al.’05]) PR-Join achieves better performance Analysis of non-blocking join algorithms for OLA PR-Join parameter choices Handling skews More experimental results (see us at the plenary session) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Conclusions In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join) PR-Join for Online Aggregation: Provides statistical guarantee An order of magnitude higher result rate than prior approach Near optimal total I/O cost PR-Join for finite data streams: Better performance than state-of-the-art algorithm PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Thank you! shimin.chen@intel.com PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

Presentation Transcript

Progress with Progress Guarantees

Async IO, Non Blocking IO, Blocking IO and Multithreading

ACHIEVING A HIGHER CREDIT SCORE SFC Gamio

A Non-Blocking, Contention-Friendly Skip List

Concurrent Tries with Efficient Non-blocking Snapshots

Concurrent Non-blocking BVH Creation

Minimum Complexity Non-blocking Switching

Non-blocking I/O

Non-Blocking Communications

Winning BIG With Guarantees

Progressive Merge Join : A Generic and Non-Blocking Sort-Based Join Algorithm

Achieving a 75% Conversion Rate at a Non-Transplant Hospital

Higher Early Music

Foreign Guarantees (Guarantees favoring Non-Resident Beneficiaries)

A Scalable, Non-blocking Approach to Transactional Memory

Non-blocking I/O

Blocking / Non-Blocking Send and Receive Operations

Pertemuan 10 Non Blocking

Non-blocking Caches

Non-Blocking Concurrent Data Objects With Abstract Concurrency

A Novel Directory-Based Non-Busy, Non-Blocking Cache Coherence

Non-Blocking Communications