Revisiting Pipelined Parallelism in Multi-Join Query Processing

Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute (binliu|rundenst)@cs.wpi.edu http://www.davis.wpi.edu/dsrg VLDB 2005

Multi-Join Queries • Data Integration over Distributed Data Sources • i.e., Extract Transform Load (ETL) Services Data Source Data Warehouse Data Source … … Data Warehouse Data Source Persistent Storage (1) High IO costs given large intermediate results (2) Disk access undesirable since one time process VLDB 2005

Applying Parallelism • Processed in Main Memory of a PC Cluster • Make use of aggregated resources (main memory, CPU) Network Clusters of Machines VLDB 2005

Three Types of Parallelism Pipelined: Operators be composed into producer and consumer relationship Independent: Independent operators run simultaneously on distinct machines Partitioned: Single operator replicated and run on multiple machines VLDB 2005

key value ID Date … OID Item … 0011 5/13 … 0011 IPC … (2) Probe hash tables and output results 0012 5/14 … 0012 HPS … (1) Build hash tables of Orders based on ID … … … … … … Orders LineItems OID Item … ID Date … 0011 IPC … 0011 5/13 … 0012 HPS … 0012 5/14 … … … … … … … LineItems Orders Basics of Hash Join • Two-Phase Hash Join [SD89, LTS90] • Demonstrated High Performance • Potential High Degree of Parallelism VLDB 2005

key value key value key value (2) Probe hash tables and output results Split (1) Build hash tables of Orders based on ID OID Item … ID Date … 0011 IPC … 0011 5/13 … 0012 HPS … 0012 5/14 … … … … … … … LineItems Orders Partitioned Hash Join • Partition (Inputs) Hash Tables across Processors • Have Each Processing Node Run in Parallel VLDB 2005

Left-Deep Query Tree Steps: B8 P8 (9) Scan R9 – Probe P8 – Output B7 P7 R9 (8) Scan R8 – Probe P7 – Build B8 R8 B2 P2 (3) Scan R3 – Probe P2 – Build B3 B1 P1 (2) Scan R2 – Probe P1 – Build B2 R3 R1 R2 (1) Scan R1 – Build R1 Left-Deep Tree [SD90] Example Join Graph R8 R2 R9 R6 R1 R3 R7 R5 R4 … VLDB 2005

Right-Deep Query Tree B8 P8 R9 B7 P7 (2) Scan R1, Probe P1, Probe P2, … , Probe P8 R8 B2 P2 • Scan R2 – Build R1, • Scan R3 – Build R2, • …, • Scan R9 – Build R8 B1 P1 R3 R2 R1 Right-Deep Tree [SD90] Example Join Graph R8 R2 R9 R6 R1 R3 R7 R5 R4 VLDB 2005

Tradeoffs Between Left and Right Trees • Right-Deep Good potential for pipelined parallelism.  Intermediate results exist only as a stream.  Size of building relations can be predicted accurately.  Large memory consumption. • Left-Deep  Less memory consumption  Less pipelined parallelism VLDB 2005

B8 P8 R9 B7 P7 R8 B4 P4 R5 B3 P3 B2 P2 R4 R3 B1 P1 R1 R2 State-of-the-Art Solutions • Implicit Assumption : Prefer Maximal Pipelined Parallelism !!! VLDB 2005

B8 P8 B8 P8 R9 R9 B7 P7 B7 P7 R8 R8 B4 P4 B4 P4 R5 R5 B3 P3 B3 P3 B2 P2 B2 P2 R4 R4 R3 R3 B1 P1 B1 P1 R1 R1 R2 R2 State-of-the-Art Solutions • Break tree into several pieces, and • Process one piece at a time (as pipeline) • I.e., Static Right-Deep[SD90], • ZigZag [ZZBS93], • Segmented Right-Deep [CLYY92]. • What if : Memory Constrained Environments ? • Strategy : Pipeline ! VLDB 2005

t P12 Computation Machines P12 P13 P14 P22 P23 P24 P32 P33 P34 t R4 Probing Building R3 Partition Partition Partition Partition R2 R1 R1 R2 R3 R4 Pipelined Execution Redirection Cost: The intermediate results generated may need to be partitioned to a different machine. Optimal Degree of Parallelism? I.e., It may not be necessary to partition R2 over a large number of machines if it only has 1000 tuples? VLDB 2005

Pipelined Cost Model • Compute n-way join over k machines • Probing relation R0, building relations, R1, R2, …, Rn • Ii represents the intermediate results after joining with Ri • Total Work (Wb+Wp) & Total Processing Time (Tb+Tp) VLDB 2005

P4 R9 P2 P3 P1 R7 R1 R0 R3 R2 R7 R6 R4 R5 R1 R0 Break Pipelined Parallelism • To Break Long Pipeline and Introduce Independent Parallelism • Large number of small pipelines • High interdependence between pipelined segments • i.e., P1 > P2, P3 > P4, P2 > P4, VLDB 2005

P3 R8 R3 R2 R9 R7 R4 R1 P1 I1 R0 R5 I2 P2 R6 R5 R0 R1 R8 R7 R4 R3 R2 R6 R9 Segmented Bushy Tree • Basic Idea • Compose large pipelined segment • Run pipelined segments independently • Compose bushy tree with minimal interdependency To balance pipelined and independent parallelism VLDB 2005

Heuristics Cost-Based Composing Segmented Tree Input: A connected join graph G with n nodes. Number m specifies maximum number of nodes in each graph. Output: Segmented bushy tree with at least n/m subtrees. completed = false; WHILE (!completed) { Choose node V with largest cardinality that has not yet been grouped as probing relation; Enumerate all subgraphs starting from V with at most m nodes; Choose best subgraph, mark nodes in this group as having been selected in original join graph; IF !(exist K, K is a connected subgraph of G with unselected nodes) && (K.size() >= 2) { completed = true; } } Compose segmented bushy tree from all groups; VLDB 2005

G2 G1 G1 R8 R3 R2 R8 R3 R2 R8 R3 R2 R9 R7 R4 R1 R9 R7 R4 R1 R9 R7 R4 R1 R6 R5 R0 R6 R5 R0 R6 R5 R0 Example (1) R7, R8, R9, R6 (2) R7, R9, R6, R8 (3) R7, R4, R8, R5 ... (1) R1, R0, R2, R3 (2) R1, R2, R0, R3 (3) R1, R2, R3, R4 ... VLDB 2005

G2 G1 R8 R3 R2 R9 R7 R4 R1 I1 R0 R5 I2 R6 R5 R0 G3 R1 R8 R7 R4 R3 R2 R6 R9 Example : Segmented Bushy Tree VLDB 2005

Machine Allocation • Based on building relation sizes of each segment • Nb: total amount of building work. • ki: number of machines allocated to pipeline i k3 Nb = k1 I1 R0 R5 I2 k2 R1 R8 R7 R4 R3 R2 R6 R9 VLDB 2005

Insufficient Main Memory • Break query based on main memory availability • Compose segmented bushy tree for each part R14 R19 R12 R8 R3 R2 R13 R15 R18 R10 R9 R7 R4 R1 R16 R17 R11 R6 R5 R0 VLDB 2005

Experimental Setup • 10 Machine Cluster • Each machine has 2 2.4GHz Xeon CPUs, 2G Memory. • Connect by gigabit ethernet switch PIII 800M Hz PC, 256M Memory Application Oracle 8i 10 Machine Cluster ... 2 PIII 1G CPUs, 1G Memory PIII 800M Hz PC, 256M Memory Controller VLDB 2005

Experimental Setup (cont.) • Generated Data Set with Integer Join Values • Around 40 bytes per tuple • Randomly Generated Join Queries • Acyclic join graph with 8, 12, 16 nodes • Each node represents one join relation • Each edge represents one join condition • Average join ratio is 1 • Cardinality of each relation is from 1K ~ 100K • Up to 600MB per query VLDB 2005

Pipelined vs. Segmented (I) VLDB 2005

Pipelined vs. Segmented (II) VLDB 2005

Insufficient Main Memory VLDB 2005

Related Work • [SD90] Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. VLDB 1990. • [CLYY92] Using segmented right deep trees for execution of pipelined hash joins. VLDB 1992. • [MLD94] Parallel hash based join algorithms for a shared everything environment. TKDE 1994. • [MD97] Data placement in shared nothing parallel database systems. VLDB 1997. • [WFA95] Parallel evaluation of multi-join queries. SIGMOD 1995. • [HCY94] On parallel execution of multiple pipelined hash joins. SIGMOD 1994. • [DNSS92] Practical skew handling in parallel joins. VLDB 1992. • [SHCF03] Flux: an adaptive partitioning operator for continuous query systems. ICDE, 2003. VLDB 2005

Conclusions • Observation: Maximal pipelined hash join processing • Redirection costs? optimal degree of parallelism? • Hypothesis: Worthwhile to incorporate independent parallelism into processing • Both, so several shorter pipelines in parallel • Solution: Segmented bushy tree processing • Heuristics and cost-driven algorithm developed • Validation : Extensive experimental studies • Achieve around 50% improvement over pure pipelined processing VLDB 2005

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Presentation Transcript

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Query Processing

Revisiting Parallelism

Query Processing

Query processing

Parallelism (Multi-threaded)

Exploiting Vector Parallelism in Software Pipelined Loops

Query Processing

Query Processing

Exploiting Vector Parallelism in Software Pipelined Loops