Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

Sort vs. Hash Revisited:Fast Join Implementation on Modern Multi-Core CPUs Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1, Tim Kaldewey2, Anthony D. Nguyen1, Andrea Di Blas2, Victor W. Lee1, Nadathur Satish1 and Pradeep Dubey1 Intel Corporation and Oracle Corporation • Throughput Computing Lab, Intel Corporation • Special Projects Group, Oracle Corporation

Introduction • Join is a widely used database operation • Two common join algorithms • Hash join, Sort-merge join • Increasing memory capacity and faster I/O • Emphasize importance of exploiting modern CPU features (Multi-core + SIMD) • 4-core CPUs with 128-bit wide SIMD are commonplace • Revisit these two join algorithms in the context of modern CPUs and compare them w.r.t. future CPU trends

Summary • In-memory hash join to exploit features of modern CPUs • Fastest reported hash join performance • Stable across a wide range of input sizes and data skews • An analytical model to estimate compute and bandwidth usage • Compare hash join and sort-merge join and project future performance • Sort-merge join has better scalable potential for future CPUs • Wider SIMD, limited per-core memory bandwidth

Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Sort-merge join for Modern CPUs • Results • Conclusions

CPU Architecture Trends and Optimization • Memory Subsystem • Latency • Cache, TLB, Prefetch • Bandwidth • Cache • Thread-level Parallelism (TLP) • Increasing number of cores and threads • Memory bandwidth • Data-level Parallelism (DLP) • SIMD execution units perform operation on multiple data • Currently 128-bit wide (SSE) and growing wider (256-bit : Intel AVX, 512-bit: Intel Larrabee)

Related Work • Hash Join • Grace hash join [Kitsuregawa et al.], Hybrid hash join [Zeller et al.] • Partitioned hash [Shatdal et al.] • Radix-clustering [Manegold et al.] • Join on New Architectures • Cell [Gedik et al.] • GPU [He et al.] • Various Database Operations on CMP • Aggregation [Cieslewicz et al.] • Sort [Chhugani et al.] • Sort versus Hash [Graefe et al.]

Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Sort-merge join for Modern CPUs • Results • Conclusions

Hash Join • Basic operation Q: SELECT … FROM R, S WHERE R.key = S.key • Build the hash table with keys of the inner relation (S) • Iterate the outer tuples (R) and find matching keys in the hash table • Append matching tuples to the output table • Challenges of large hash table operation • Cache line unfriendly access • Waste memory bandwidth • TLB unfriendly access • Increase memory latency => Can we make the problem compute bound?

Overview • Perform join operations inside caches • Partition two input tables into sub-tables (Partition phase) • Partition so that a sub-table pair (Rn, Sn) fits in L2 caches • Join between the corresponding sub-tables (Join phase) • Build the hash table with the inner sub-table (Build phase) • Probe the outer sub-table against the hash table (probe phase) Key Rid Key Rid Join R1 S1 R2 S2 R3 S3 Rn Sn Relations R Partitioned R Partitioned S Relations S

Handling Various Size of Keys/Payloads • Prolog • # of distinct keys and records are less than 4 billion (232) • Read keys and compact into (4B partial key, 4B record id) pairs • Use a hash function to generate partial keys • Join with (4B key, 4B Rid) tuples • Epilog • Read record id pair (Rid1, Rid2), Access the tuples • Compare full keys to check false positives • Write output result (key, payload1, payload2)

Partition Phase • Step P1: Build a histogram • Step P2: Perform prefix sum to compute the addresses of scatter • Step P3: Permute (scatter) the tuples • The number of partitions? • Level1: 64-entry fully associative TLB • Level2: 512-entry 4-way set associative TLB • TLB misses slow down partition phase when more than 128 partitions used • Multi-pass partitioning is necessary Key Rid Key Rid Hist [] Hist [] Step P1 Step P2 Step P3 11 … 00 0 11 … 00 0 3 0 10 … 01 1 00 … 10 2 1 3 00 … 10 2 01 … 00 3 01 … 00 3 10 … 01 1

Build Phase • Build the hash table of the inner sub-table • Collided/Duplicated elements in the hash table are located consecutively • Speed up comparing join keys in the probe phase • Similar to the partition phase Inner Sub-table Permuted Inner Sub-table Hist[j] Hist[j] Key Rid Key Rid Hash Func ( ) 2 11 … 00 0 0 11 … 00 0 1 10 … 01 1 2 01 … 00 3 1 00 … 10 2 3 00 … 10 2 0 01 … 00 3 3 10 … 01 1

Probe Phase • Probe the outer sub-table against the hash table of inner sub-table • For the outer tuple, apply the hash function to find the histogram index (j) • Compare the outer key with the inner keys between Hist[j] and Hist[j+1] • Duplicated/Collided keys are potential matches • When matched keys found, append tuples to the output table Outer Sub-table Permuted Inner Sub-table Hash Func ( ) Key Rid Key Rid Hist[j] 11 … 00 11 … 00 0 0 11 … 00 0 Hist[j+1] 00 … 00 1 2 01 … 00 3 11 … 10 2 3 00 … 10 2 00 … 01 3 3 10 … 01 1 Histogram

Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Exploiting TLP • Exploiting DLP • Sort-merge join for Modern CPUs • Results • Conclusions

Parallelizing Partition Phase Step1 Step2 Step3 • Equally divide the input tuples among threads, each thread maintains local histogram • Explicit barrier at each end of partition steps • Use task queuing for dynamic load balancing • An idle thread steals tasks from other threads LocalHist1 LocalHist1 Key Rid Key Rid 1 0 11 … 00 0 Tuple0 11 … 00 0 1 Thread0 3 10 … 01 1 Tuple1 P0 00 … 10 2 00 … 10 2 Tuple2 01 … 00 3 Thread1 2 1 01 … 00 3 Tuple3 10 … 01 1 P1 0 4 LocalHist2 LocalHist2 Barrier Barrier Barrier

Parallelizing the join phase • Input distribution can cause severe load-imbalance in this phase • Propose three-phase fine-grain parallelization • Phase I • Each thread picks a pair of sub-tables and performs join independently • Sub-tables with size larger than a threshold are handled in the next phase • Phase II • All threads work together to join a pair of “large” sub-tables • Build the hash table in parallel • Equally divide the outer sub-table and probe in parallel • Tuples with large number of potential matches are handled in the next phase • Phase III • All threads work together to compare a key with potentially matched inner tuples • When data are heavily skewed (Zipf distribution) • Equally divide potential matches among threads, and compare in parallel

Data-level Parallelism • Perform the same operation on K elements (K = 4 in the current SSE) • Operate on K consecutive tuples simultaneously • Challenges • Data Gather • Pack the elements together from K distinct memory locations • Compare K join keys to check match • Data Scatter • Write the elements to non-contiguous memory locations • Permute K tuples in the partition phase • SIMD update Conflict • Elements in different SIMD lanes are written to the same memory location • Histogram update • Current CPU SSE architectures do not support the above features • Negligible SIMD benefit for hash join

Sort-merge Join • Base on the “merge sort” by Chhugani et al.[VLDB’08] • Extend to handle (key, rid) pair • Exploiting DLP • Use a bitonic merge network • Good SIMD benefit • Exploiting TLP • Perform a parallel merge • Bandwidth-friendly multiway merge • Data is read/written only once from/to the main memory

Experimental Setup • Tuples consist of (4B key, 4B rid) pair • Vary the tuple size from 64K to 128M • Various Distribution • Change percentage of tuples with matches • Change cardinalities • Use the heavily skewed data using Zipf distribution • Intel Core i7 • 3.2GHz, Quad-Core, 2 SMT threads per core • 32KB L1, 256KB L2, 6MB L3 • 64-entry L1 TLB, 512-entry L2 TLB • Metric: Cycles Per Tuples • 32 cycles per tuple = 100M tuples per second in 3.2Ghz

Benefit of Partitioning 1-Pass 2-Pass 3-Pass • 128M tuples, Uniform distribution • 16K-partition shows the best performance • 7X better than non-partitioning join • Fastest reported hash join performance • 100M tuples per second • 8X faster than the best reported GPU (8800GTX) number [SIGMOD’08 by He et al.]

Various Input Distributions => Stable across various input distributions

Current Sort VS. Hash • Sort-merge join • 2.3X SIMD speedup • 3.6X parallel speedup • Fastest reported (key, rid) sorting performance • 1.25 (Cycles Per Element Per Iteration) * N * log N (N: # of elements) • 50M tuples per second (two tables of 128M tuples) • Hash join is still up to 2X faster than sort-merge join

Future Sort VS. Hash (Projected) • Future architecture trends favor sort-merge join • Wider SIMD • 256-bit wide SIMD, 512-bit wide SIMD • Limited per-core memory bandwidth • Merge sort-based join has fewer memory accesses than hash-based join

Conclusions • Optimized hash join and sort-merge join for modern CPUs • Exploit TLP/DLP • Fastest reported hash join and sort-merge join performance • Future architecture trends shows better performance potential for sort-merge join • Optimizing join algorithms for upcoming Intel Larrabee architectures

Thank you!Questions?

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

Presentation Transcript

With Recommendations for a Multi-Core Power TSC

Multi-core architectures

Optimizing the Fast Fourier Transform on a Multi-core Architecture

Hash Functions

The Hash Cookbook

Microprocessors

sap hana admin training classes@www.magnifictraining.com

Capstone - Fall 2012

マルチコア /Multi-Core

b1110 Multi-Cycle CPUs

Optimal Fast Hashing

Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

Very Large Fast DFT (VL FFT) Implementation on KeyStone

Multi-Way Hash Join Effectiveness

Multi-Core Computing

David Le CS257, ID: 126 Feb 28, 2013

Parallel Implementation of Fast Fourier Transform on a Multi-core System

Performance Comparison between OpenCL and OpenMP on Multi-core CPUs

Choosing an Order for Joins

Fast Communication for Multi – Core SOPC

CS261 Data Structures