1 / 28

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. Changkyu Kim 1 , Eric Sedlar 2 , Jatin Chhugani 1 , Tim Kaldewey 2 , Anthony D. Nguyen 1 , Andrea Di Blas 2 , Victor W. Lee 1 , Nadathur Satish 1 and Pradeep Dubey 1. Intel Corporation and Oracle Corporation.

barbara
Download Presentation

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sort vs. Hash Revisited:Fast Join Implementation on Modern Multi-Core CPUs Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1, Tim Kaldewey2, Anthony D. Nguyen1, Andrea Di Blas2, Victor W. Lee1, Nadathur Satish1 and Pradeep Dubey1 Intel Corporation and Oracle Corporation • Throughput Computing Lab, Intel Corporation • Special Projects Group, Oracle Corporation

  2. Introduction • Join is a widely used database operation • Two common join algorithms • Hash join, Sort-merge join • Increasing memory capacity and faster I/O • Emphasize importance of exploiting modern CPU features (Multi-core + SIMD) • 4-core CPUs with 128-bit wide SIMD are commonplace • Revisit these two join algorithms in the context of modern CPUs and compare them w.r.t. future CPU trends

  3. Summary • In-memory hash join to exploit features of modern CPUs • Fastest reported hash join performance • Stable across a wide range of input sizes and data skews • An analytical model to estimate compute and bandwidth usage • Compare hash join and sort-merge join and project future performance • Sort-merge join has better scalable potential for future CPUs • Wider SIMD, limited per-core memory bandwidth

  4. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Sort-merge join for Modern CPUs • Results • Conclusions

  5. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Sort-merge join for Modern CPUs • Results • Conclusions

  6. CPU Architecture Trends and Optimization • Memory Subsystem • Latency • Cache, TLB, Prefetch • Bandwidth • Cache • Thread-level Parallelism (TLP) • Increasing number of cores and threads • Memory bandwidth • Data-level Parallelism (DLP) • SIMD execution units perform operation on multiple data • Currently 128-bit wide (SSE) and growing wider (256-bit : Intel AVX, 512-bit: Intel Larrabee)

  7. Related Work • Hash Join • Grace hash join [Kitsuregawa et al.], Hybrid hash join [Zeller et al.] • Partitioned hash [Shatdal et al.] • Radix-clustering [Manegold et al.] • Join on New Architectures • Cell [Gedik et al.] • GPU [He et al.] • Various Database Operations on CMP • Aggregation [Cieslewicz et al.] • Sort [Chhugani et al.] • Sort versus Hash [Graefe et al.]

  8. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Sort-merge join for Modern CPUs • Results • Conclusions

  9. Hash Join • Basic operation Q: SELECT … FROM R, S WHERE R.key = S.key • Build the hash table with keys of the inner relation (S) • Iterate the outer tuples (R) and find matching keys in the hash table • Append matching tuples to the output table • Challenges of large hash table operation • Cache line unfriendly access • Waste memory bandwidth • TLB unfriendly access • Increase memory latency => Can we make the problem compute bound?

  10. Overview • Perform join operations inside caches • Partition two input tables into sub-tables (Partition phase) • Partition so that a sub-table pair (Rn, Sn) fits in L2 caches • Join between the corresponding sub-tables (Join phase) • Build the hash table with the inner sub-table (Build phase) • Probe the outer sub-table against the hash table (probe phase) Key Rid Key Rid Join R1 S1 R2 S2 R3 S3 Rn Sn Relations R Partitioned R Partitioned S Relations S

  11. Handling Various Size of Keys/Payloads • Prolog • # of distinct keys and records are less than 4 billion (232) • Read keys and compact into (4B partial key, 4B record id) pairs • Use a hash function to generate partial keys • Join with (4B key, 4B Rid) tuples • Epilog • Read record id pair (Rid1, Rid2), Access the tuples • Compare full keys to check false positives • Write output result (key, payload1, payload2)

  12. Partition Phase • Step P1: Build a histogram • Step P2: Perform prefix sum to compute the addresses of scatter • Step P3: Permute (scatter) the tuples • The number of partitions? • Level1: 64-entry fully associative TLB • Level2: 512-entry 4-way set associative TLB • TLB misses slow down partition phase when more than 128 partitions used • Multi-pass partitioning is necessary Key Rid Key Rid Hist [] Hist [] Step P1 Step P2 Step P3 11 … 00 0 11 … 00 0 3 0 10 … 01 1 00 … 10 2 1 3 00 … 10 2 01 … 00 3 01 … 00 3 10 … 01 1

  13. Build Phase • Build the hash table of the inner sub-table • Collided/Duplicated elements in the hash table are located consecutively • Speed up comparing join keys in the probe phase • Similar to the partition phase Inner Sub-table Permuted Inner Sub-table Hist[j] Hist[j] Key Rid Key Rid Hash Func ( ) 2 11 … 00 0 0 11 … 00 0 1 10 … 01 1 2 01 … 00 3 1 00 … 10 2 3 00 … 10 2 0 01 … 00 3 3 10 … 01 1

  14. Probe Phase • Probe the outer sub-table against the hash table of inner sub-table • For the outer tuple, apply the hash function to find the histogram index (j) • Compare the outer key with the inner keys between Hist[j] and Hist[j+1] • Duplicated/Collided keys are potential matches • When matched keys found, append tuples to the output table Outer Sub-table Permuted Inner Sub-table Hash Func ( ) Key Rid Key Rid Hist[j] 11 … 00 11 … 00 0 0 11 … 00 0 Hist[j+1] 00 … 00 1 2 01 … 00 3 11 … 10 2 3 00 … 10 2 00 … 01 3 3 10 … 01 1 Histogram

  15. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Exploiting TLP • Exploiting DLP • Sort-merge join for Modern CPUs • Results • Conclusions

  16. Parallelizing Partition Phase Step1 Step2 Step3 • Equally divide the input tuples among threads, each thread maintains local histogram • Explicit barrier at each end of partition steps • Use task queuing for dynamic load balancing • An idle thread steals tasks from other threads LocalHist1 LocalHist1 Key Rid Key Rid 1 0 11 … 00 0 Tuple0 11 … 00 0 1 Thread0 3 10 … 01 1 Tuple1 P0 00 … 10 2 00 … 10 2 Tuple2 01 … 00 3 Thread1 2 1 01 … 00 3 Tuple3 10 … 01 1 P1 0 4 LocalHist2 LocalHist2 Barrier Barrier Barrier

  17. Parallelizing the join phase • Input distribution can cause severe load-imbalance in this phase • Propose three-phase fine-grain parallelization • Phase I • Each thread picks a pair of sub-tables and performs join independently • Sub-tables with size larger than a threshold are handled in the next phase • Phase II • All threads work together to join a pair of “large” sub-tables • Build the hash table in parallel • Equally divide the outer sub-table and probe in parallel • Tuples with large number of potential matches are handled in the next phase • Phase III • All threads work together to compare a key with potentially matched inner tuples • When data are heavily skewed (Zipf distribution) • Equally divide potential matches among threads, and compare in parallel

  18. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Exploiting TLP • Exploiting DLP • Sort-merge join for Modern CPUs • Results • Conclusions

  19. Data-level Parallelism • Perform the same operation on K elements (K = 4 in the current SSE) • Operate on K consecutive tuples simultaneously • Challenges • Data Gather • Pack the elements together from K distinct memory locations • Compare K join keys to check match • Data Scatter • Write the elements to non-contiguous memory locations • Permute K tuples in the partition phase • SIMD update Conflict • Elements in different SIMD lanes are written to the same memory location • Histogram update • Current CPU SSE architectures do not support the above features • Negligible SIMD benefit for hash join

  20. Outline • CPU Architecture Trends and Related Work • Hash Join for Modern CPUs • Exploiting Parallelism in Hash Join • Exploiting TLP • Exploiting DLP • Sort-merge join for Modern CPUs • Results • Conclusions

  21. Sort-merge Join • Base on the “merge sort” by Chhugani et al.[VLDB’08] • Extend to handle (key, rid) pair • Exploiting DLP • Use a bitonic merge network • Good SIMD benefit • Exploiting TLP • Perform a parallel merge • Bandwidth-friendly multiway merge • Data is read/written only once from/to the main memory

  22. Experimental Setup • Tuples consist of (4B key, 4B rid) pair • Vary the tuple size from 64K to 128M • Various Distribution • Change percentage of tuples with matches • Change cardinalities • Use the heavily skewed data using Zipf distribution • Intel Core i7 • 3.2GHz, Quad-Core, 2 SMT threads per core • 32KB L1, 256KB L2, 6MB L3 • 64-entry L1 TLB, 512-entry L2 TLB • Metric: Cycles Per Tuples • 32 cycles per tuple = 100M tuples per second in 3.2Ghz

  23. Benefit of Partitioning 1-Pass 2-Pass 3-Pass • 128M tuples, Uniform distribution • 16K-partition shows the best performance • 7X better than non-partitioning join • Fastest reported hash join performance • 100M tuples per second • 8X faster than the best reported GPU (8800GTX) number [SIGMOD’08 by He et al.]

  24. Various Input Distributions => Stable across various input distributions

  25. Current Sort VS. Hash • Sort-merge join • 2.3X SIMD speedup • 3.6X parallel speedup • Fastest reported (key, rid) sorting performance • 1.25 (Cycles Per Element Per Iteration) * N * log N (N: # of elements) • 50M tuples per second (two tables of 128M tuples) • Hash join is still up to 2X faster than sort-merge join

  26. Future Sort VS. Hash (Projected) • Future architecture trends favor sort-merge join • Wider SIMD • 256-bit wide SIMD, 512-bit wide SIMD • Limited per-core memory bandwidth • Merge sort-based join has fewer memory accesses than hash-based join

  27. Conclusions • Optimized hash join and sort-merge join for modern CPUs • Exploit TLP/DLP • Fastest reported hash join and sort-merge join performance • Future architecture trends shows better performance potential for sort-merge join • Optimizing join algorithms for upcoming Intel Larrabee architectures

  28. Thank you!Questions?

More Related