Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München

Hardware trends … • Hugemainmemory • Massive processingparallelism • Non-uniform Memory Access (NUMA) • Ourserver: • 4 CPUs • 32 cores • 1 TB RAM • 4 NUMA partitions CPU 0

Main memorydatabasesystems • VoltDB, Hana, MonetDB • HyPer: real-time businessintelligencequeries on transactionaldata* * http://www-db.in.tum.de/research/projects/HyPer/

How to exploitthesehardwaretrends? • Parallelizealgorithms • Exploit fast mainmemoryaccess  Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09  Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11 • AND beaware of fast local vs. slow remote NUMA access

Ignoring NUMA NUMA partition 1 NUMA partition 3 core 1 core 5 hashable core 2 core 6 NUMA partition 2 NUMA partition 4 core 3 core 7 core 4 core 8

Howmuchdifferencedoes NUMA make? 100% 22756 ms 417344 ms 1000 ms 837 ms scaledexecution time 7440 ms 12946 ms remote local local remote synchronized sequential mergejoin(sequentialread) sort partitioning

The three NUMA commandments C2 Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency. C1 Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally. C3 Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads.

Basic idea of MPSM chunk R R R chunks S chunks S chunk S

Basic idea of MPSM • C1: Work locally: sort • C3: Work independently: sortandmergejoin • C2: Access neighbor‘sdataonlysequentially chunk R R chunks sort R chunkslocally mergejoinchunks MJ MJ MJ MJ S chunks sort S chunkslocally chunk S

Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers

Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers R chunks rangepartition R rangepartitionedR chunks

Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers  S isimplicitlypartitioned rangepartitionedR chunks sort R chunks S chunks sort S chunks

Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers  S isimplicitlypartitioned rangepartitionedR chunks sort R chunks MJ MJ MJ MJ mergejoinonlyrelevant parts S chunks sort S chunks

Range partitioning of private input • Time efficient • branch-free • comparison-free • synchronization-free and • Space efficient  denselypacked • in-place byusingradix-clusteringandprecomputedtargetpartitions to scatterdata to

Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 19 9 W1 <16 0 4 7 = 00111 7 1 ≥16 0 3 chunk of worker W1 3 21 21 2 17 = 10001 W2 1 17 17 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 23 5 <16 4 3 4 ≥16 3 4 chunk of worker W2 31 31 8 W2 20 20 26 26

Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 9 9 7 W1 <16 0 4 7 = 00111 7 3 1 ≥16 0 3 chunk of worker W1 3 1 21 2 17 = 10001 W2 1 4 17 8 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 21 5 <16 4 3 4 17 ≥16 3 4 chunk of worker W2 31 23 8 31 W2 20 20 26 26

Real C hackeratwork …

Skewresilience of MPSM • Locationskewisimplicitlyhandled • Distribution skew: • Dynamicallycomputedpartitionbounds • Determinedbased on the global datadistributions of R and S • Costbalancingforsorting R andjoining R and S

Skewresilience 1. Global S datadistribution • Localequi-heighthistograms (forfree) • Combined to CDF CDF # tuples 1 2 7 12 16 10 17 13 15 25 22 33 31 42 66 78 90 81 S1 S2 keyvalue 50

Skewresilience 2. Global R datadistribution • Localequi-widthhistograms as before • More fine-grainedhistograms histogram 2 2 = 00010 13 13 <8 3 4 [8,16) 2 31 31 8 = 01000 [16,24) 1 8 8 ≥24 1 20 20 6 R1

Skewresilience 3. Computesplitters so thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced CDF # tuples histogram 2 + = 4 3 6 2 1 13 1 31 8 20 keyvalue * Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09

Performance evaluation • MPSM performance in a nutshell: • 160 miotuplesjoined per second • 27 biotuplesjoined in lessthan 3 minutes • scaleslinearlywiththenumber of cores • Platform HyPer1: • Linux server • 1 TB RAM • 4 CPUs with 8 physicalcoreseach • Benchmark: • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]} • Dataset sizes ranging from 50GB to 400GB

Execution time comparison • MPSM, Vectorwise (VW), andBlanashashjoin* • 32 workers • |R| = 1600 mio(25 GB), varyingsize of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011

Scalability in thenumber of cores • MPSM andVectorwise (VW) • |R| = 1600 mio(25 GB),|S|=4*|R|

Locationskew • Locationskew in Rhasnoeffectbecause of repartitioning • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)

Distribution skew: anti-correlateddata withoutbalancedpartitioningwithbalancedpartitioning

Distribution skew : anti-correlateddata

Conclusions • MPSM is a sort-based parallel joinalgorithm • MPSM is NUMA-aware & NUMA-oblivious • MPSM isspaceefficient (works in-place) • MPSM scaleslinearly in thenumber of cores • MPSM isskewresilient • MPSM outperformsVectorwise (4X) andBlanas et al‘shashjoin (18X) • MPSM isadaptablefordisk-basedprocessing • See details in paper

Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Presentation Transcript

More Merge Sort

The Merge Sort

Bubble Sort Merge Sort

Merge Sort

Merge Sort

Merge Sort

Merge sort

Quick Sort and Merge Sort

Merge Sort

Merge sort

Merge Sort

Merge Sort

Merge Sort

Merge Sort

Merge Sort (Overview)

Merge Sort

Merge Sort

Bubble Sort Merge Sort

Merge Sort

Merge Sort