Sort vs hash revisited fast join implementation on modern multi core cpus
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on
  • Presentation posted in: General

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. Changkyu Kim 1 , Eric Sedlar 2 , Jatin Chhugani 1 , Tim Kaldewey 2 , Anthony D. Nguyen 1 , Andrea Di Blas 2 , Victor W. Lee 1 , Nadathur Satish 1 and Pradeep Dubey 1. Intel Corporation and Oracle Corporation.

Download Presentation

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sort vs hash revisited fast join implementation on modern multi core cpus

Sort vs. Hash Revisited:Fast Join Implementation on Modern Multi-Core CPUs

Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1,

Tim Kaldewey2, Anthony D. Nguyen1, Andrea Di Blas2,

Victor W. Lee1, Nadathur Satish1 and Pradeep Dubey1

Intel Corporation and Oracle Corporation

  • Throughput Computing Lab, Intel Corporation

  • Special Projects Group, Oracle Corporation


Introduction

Introduction

  • Join is a widely used database operation

  • Two common join algorithms

    • Hash join, Sort-merge join

  • Increasing memory capacity and faster I/O

    • Emphasize importance of exploiting modern CPU features (Multi-core + SIMD)

    • 4-core CPUs with 128-bit wide SIMD are commonplace

  • Revisit these two join algorithms in the context of modern CPUs and compare them w.r.t. future CPU trends


Summary

Summary

  • In-memory hash join to exploit features of modern CPUs

    • Fastest reported hash join performance

    • Stable across a wide range of input sizes and data skews

  • An analytical model to estimate compute and bandwidth usage

    • Compare hash join and sort-merge join and project future performance

  • Sort-merge join has better scalable potential for future CPUs

    • Wider SIMD, limited per-core memory bandwidth


Outline

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Outline1

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Cpu architecture trends and optimization

CPU Architecture Trends and Optimization

  • Memory Subsystem

    • Latency

      • Cache, TLB, Prefetch

    • Bandwidth

      • Cache

  • Thread-level Parallelism (TLP)

    • Increasing number of cores and threads

    • Memory bandwidth

  • Data-level Parallelism (DLP)

    • SIMD execution units perform operation on multiple data

    • Currently 128-bit wide (SSE) and growing wider (256-bit : Intel AVX, 512-bit: Intel Larrabee)


Related work

Related Work

  • Hash Join

    • Grace hash join [Kitsuregawa et al.], Hybrid hash join [Zeller et al.]

    • Partitioned hash [Shatdal et al.]

    • Radix-clustering [Manegold et al.]

  • Join on New Architectures

    • Cell [Gedik et al.]

    • GPU [He et al.]

  • Various Database Operations on CMP

    • Aggregation [Cieslewicz et al.]

    • Sort [Chhugani et al.]

  • Sort versus Hash [Graefe et al.]


Outline2

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Hash join

Hash Join

  • Basic operation

    Q: SELECT … FROM R, S WHERE R.key = S.key

    • Build the hash table with keys of the inner relation (S)

    • Iterate the outer tuples (R) and find matching keys in the hash table

    • Append matching tuples to the output table

  • Challenges of large hash table operation

    • Cache line unfriendly access

      • Waste memory bandwidth

    • TLB unfriendly access

      • Increase memory latency

=> Can we make the problem compute bound?


Overview

Overview

  • Perform join operations inside caches

  • Partition two input tables into sub-tables (Partition phase)

    • Partition so that a sub-table pair (Rn, Sn) fits in L2 caches

  • Join between the corresponding sub-tables (Join phase)

    • Build the hash table with the inner sub-table (Build phase)

    • Probe the outer sub-table against the hash table (probe phase)

Key

Rid

Key

Rid

Join

R1

S1

R2

S2

R3

S3

Rn

Sn

Relations R

Partitioned R

Partitioned S

Relations S


Handling various size of keys payloads

Handling Various Size of Keys/Payloads

  • Prolog

    • # of distinct keys and records are less than 4 billion (232)

    • Read keys and compact into (4B partial key, 4B record id) pairs

    • Use a hash function to generate partial keys

  • Join with (4B key, 4B Rid) tuples

  • Epilog

    • Read record id pair (Rid1, Rid2), Access the tuples

    • Compare full keys to check false positives

    • Write output result (key, payload1, payload2)


Partition phase

Partition Phase

  • Step P1: Build a histogram

  • Step P2: Perform prefix sum to compute the addresses of scatter

  • Step P3: Permute (scatter) the tuples

  • The number of partitions?

    • Level1: 64-entry fully associative TLB

    • Level2: 512-entry 4-way set associative TLB

    • TLB misses slow down partition phase when more than 128 partitions used

      • Multi-pass partitioning is necessary

Key

Rid

Key

Rid

Hist []

Hist []

Step P1

Step P2

Step P3

11 … 00

0

11 … 00

0

3

0

10 … 01

1

00 … 10

2

1

3

00 … 10

2

01 … 00

3

01 … 00

3

10 … 01

1


Build phase

Build Phase

  • Build the hash table of the inner sub-table

  • Collided/Duplicated elements in the hash table are located consecutively

    • Speed up comparing join keys in the probe phase

  • Similar to the partition phase

Inner Sub-table

Permuted Inner Sub-table

Hist[j]

Hist[j]

Key

Rid

Key

Rid

Hash Func ( )

2

11 … 00

0

0

11 … 00

0

1

10 … 01

1

2

01 … 00

3

1

00 … 10

2

3

00 … 10

2

0

01 … 00

3

3

10 … 01

1


Probe phase

Probe Phase

  • Probe the outer sub-table against the hash table of inner sub-table

    • For the outer tuple, apply the hash function to find the histogram index (j)

    • Compare the outer key with the inner keys between Hist[j] and Hist[j+1]

    • Duplicated/Collided keys are potential matches

    • When matched keys found, append tuples to the output table

Outer Sub-table

Permuted Inner Sub-table

Hash Func ( )

Key

Rid

Key

Rid

Hist[j]

11 … 00

11 … 00

0

0

11 … 00

0

Hist[j+1]

00 … 00

1

2

01 … 00

3

11 … 10

2

3

00 … 10

2

00 … 01

3

3

10 … 01

1

Histogram


Outline3

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

    • Exploiting TLP

    • Exploiting DLP

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Parallelizing partition phase

Parallelizing Partition Phase

Step1

Step2

Step3

  • Equally divide the input tuples among threads, each thread maintains local histogram

  • Explicit barrier at each end of partition steps

  • Use task queuing for dynamic load balancing

    • An idle thread steals tasks from other threads

LocalHist1

LocalHist1

Key

Rid

Key

Rid

1

0

11 … 00

0

Tuple0

11 … 00

0

1

Thread0

3

10 … 01

1

Tuple1

P0

00 … 10

2

00 … 10

2

Tuple2

01 … 00

3

Thread1

2

1

01 … 00

3

Tuple3

10 … 01

1

P1

0

4

LocalHist2

LocalHist2

Barrier

Barrier

Barrier


Parallelizing the join phase

Parallelizing the join phase

  • Input distribution can cause severe load-imbalance in this phase

  • Propose three-phase fine-grain parallelization

  • Phase I

    • Each thread picks a pair of sub-tables and performs join independently

    • Sub-tables with size larger than a threshold are handled in the next phase

  • Phase II

    • All threads work together to join a pair of “large” sub-tables

    • Build the hash table in parallel

    • Equally divide the outer sub-table and probe in parallel

    • Tuples with large number of potential matches are handled in the next phase

  • Phase III

    • All threads work together to compare a key with potentially matched inner tuples

    • When data are heavily skewed (Zipf distribution)

    • Equally divide potential matches among threads, and compare in parallel


Outline4

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

    • Exploiting TLP

    • Exploiting DLP

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Data level parallelism

Data-level Parallelism

  • Perform the same operation on K elements (K = 4 in the current SSE)

    • Operate on K consecutive tuples simultaneously

  • Challenges

    • Data Gather

      • Pack the elements together from K distinct memory locations

      • Compare K join keys to check match

    • Data Scatter

      • Write the elements to non-contiguous memory locations

      • Permute K tuples in the partition phase

    • SIMD update Conflict

      • Elements in different SIMD lanes are written to the same memory location

      • Histogram update

  • Current CPU SSE architectures do not support the above features

    • Negligible SIMD benefit for hash join


Outline5

Outline

  • CPU Architecture Trends and Related Work

  • Hash Join for Modern CPUs

  • Exploiting Parallelism in Hash Join

    • Exploiting TLP

    • Exploiting DLP

  • Sort-merge join for Modern CPUs

  • Results

  • Conclusions


Sort merge join

Sort-merge Join

  • Base on the “merge sort” by Chhugani et al.[VLDB’08]

    • Extend to handle (key, rid) pair

  • Exploiting DLP

    • Use a bitonic merge network

    • Good SIMD benefit

  • Exploiting TLP

    • Perform a parallel merge

  • Bandwidth-friendly multiway merge

    • Data is read/written only once from/to the main memory


Experimental setup

Experimental Setup

  • Tuples consist of (4B key, 4B rid) pair

  • Vary the tuple size from 64K to 128M

  • Various Distribution

    • Change percentage of tuples with matches

    • Change cardinalities

    • Use the heavily skewed data using Zipf distribution

  • Intel Core i7

    • 3.2GHz, Quad-Core, 2 SMT threads per core

    • 32KB L1, 256KB L2, 6MB L3

    • 64-entry L1 TLB, 512-entry L2 TLB

  • Metric: Cycles Per Tuples

    • 32 cycles per tuple = 100M tuples per second in 3.2Ghz


Benefit of partitioning

Benefit of Partitioning

1-Pass

2-Pass

3-Pass

  • 128M tuples, Uniform distribution

  • 16K-partition shows the best performance

  • 7X better than non-partitioning join

  • Fastest reported hash join performance

    • 100M tuples per second

    • 8X faster than the best reported GPU (8800GTX) number [SIGMOD’08 by He et al.]


Various input distributions

Various Input Distributions

=> Stable across various input distributions


Current sort vs hash

Current Sort VS. Hash

  • Sort-merge join

    • 2.3X SIMD speedup

    • 3.6X parallel speedup

  • Fastest reported (key, rid) sorting performance

    • 1.25 (Cycles Per Element Per Iteration) * N * log N (N: # of elements)

    • 50M tuples per second (two tables of 128M tuples)

  • Hash join is still up to 2X faster than sort-merge join


Future sort vs hash projected

Future Sort VS. Hash (Projected)

  • Future architecture trends favor sort-merge join

    • Wider SIMD

      • 256-bit wide SIMD, 512-bit wide SIMD

    • Limited per-core memory bandwidth

      • Merge sort-based join has fewer memory accesses than hash-based join


Conclusions

Conclusions

  • Optimized hash join and sort-merge join for modern CPUs

    • Exploit TLP/DLP

  • Fastest reported hash join and sort-merge join performance

  • Future architecture trends shows better performance potential for sort-merge join

  • Optimizing join algorithms for upcoming Intel Larrabee architectures


Thank you questions

Thank you!Questions?


  • Login