310 likes | 612 Views
Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems. Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München. Hardware trends …. Huge main memory Massive processing parallelism Non-uniform Memory Access (NUMA) Our server :
E N D
Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München
Hardware trends … • Hugemainmemory • Massive processingparallelism • Non-uniform Memory Access (NUMA) • Ourserver: • 4 CPUs • 32 cores • 1 TB RAM • 4 NUMA partitions CPU 0
Main memorydatabasesystems • VoltDB, Hana, MonetDB • HyPer: real-time businessintelligencequeries on transactionaldata* * http://www-db.in.tum.de/research/projects/HyPer/
How to exploitthesehardwaretrends? • Parallelizealgorithms • Exploit fast mainmemoryaccess Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11 • AND beaware of fast local vs. slow remote NUMA access
Ignoring NUMA NUMA partition 1 NUMA partition 3 core 1 core 5 hashable core 2 core 6 NUMA partition 2 NUMA partition 4 core 3 core 7 core 4 core 8
Howmuchdifferencedoes NUMA make? 100% 22756 ms 417344 ms 1000 ms 837 ms scaledexecution time 7440 ms 12946 ms remote local local remote synchronized sequential mergejoin(sequentialread) sort partitioning
The three NUMA commandments C2 Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency. C1 Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally. C3 Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads.
Basic idea of MPSM chunk R R R chunks S chunks S chunk S
Basic idea of MPSM • C1: Work locally: sort • C3: Work independently: sortandmergejoin • C2: Access neighbor‘sdataonlysequentially chunk R R chunks sort R chunkslocally mergejoinchunks MJ MJ MJ MJ S chunks sort S chunkslocally chunk S
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers R chunks rangepartition R rangepartitionedR chunks
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers S isimplicitlypartitioned rangepartitionedR chunks sort R chunks S chunks sort S chunks
Range partitioning of private input R • To constrainmergejoinwork • To providescalability in thenumber of parallel workers S isimplicitlypartitioned rangepartitionedR chunks sort R chunks MJ MJ MJ MJ mergejoinonlyrelevant parts S chunks sort S chunks
Range partitioning of private input • Time efficient • branch-free • comparison-free • synchronization-free and • Space efficient denselypacked • in-place byusingradix-clusteringandprecomputedtargetpartitions to scatterdata to
Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 19 9 W1 <16 0 4 7 = 00111 7 1 ≥16 0 3 chunk of worker W1 3 21 21 2 17 = 10001 W2 1 17 17 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 23 5 <16 4 3 4 ≥16 3 4 chunk of worker W2 31 31 8 W2 20 20 26 26
Range partitioning of private input prefixsum of worker W1 histogram of worker W1 19=10011 19 9 9 7 W1 <16 0 4 7 = 00111 7 3 1 ≥16 0 3 chunk of worker W1 3 1 21 2 17 = 10001 W2 1 4 17 8 2=00010 histogram of worker W2 prefixsum of worker W2 2 19 W1 23 21 5 <16 4 3 4 17 ≥16 3 4 chunk of worker W2 31 23 8 31 W2 20 20 26 26
Skewresilience of MPSM • Locationskewisimplicitlyhandled • Distribution skew: • Dynamicallycomputedpartitionbounds • Determinedbased on the global datadistributions of R and S • Costbalancingforsorting R andjoining R and S
Skewresilience 1. Global S datadistribution • Localequi-heighthistograms (forfree) • Combined to CDF CDF # tuples 1 2 7 12 16 10 17 13 15 25 22 33 31 42 66 78 90 81 S1 S2 keyvalue 50
Skewresilience 2. Global R datadistribution • Localequi-widthhistograms as before • More fine-grainedhistograms histogram 2 2 = 00010 13 13 <8 3 4 [8,16) 2 31 31 8 = 01000 [16,24) 1 8 8 ≥24 1 20 20 6 R1
Skewresilience 3. Computesplitters so thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced CDF # tuples histogram 2 + = 4 3 6 2 1 13 1 31 8 20 keyvalue * Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09
Performance evaluation • MPSM performance in a nutshell: • 160 miotuplesjoined per second • 27 biotuplesjoined in lessthan 3 minutes • scaleslinearlywiththenumber of cores • Platform HyPer1: • Linux server • 1 TB RAM • 4 CPUs with 8 physicalcoreseach • Benchmark: • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]} • Dataset sizes ranging from 50GB to 400GB
Execution time comparison • MPSM, Vectorwise (VW), andBlanashashjoin* • 32 workers • |R| = 1600 mio(25 GB), varyingsize of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011
Scalability in thenumber of cores • MPSM andVectorwise (VW) • |R| = 1600 mio(25 GB),|S|=4*|R|
Locationskew • Locationskew in Rhasnoeffectbecause of repartitioning • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)
Distribution skew: anti-correlateddata withoutbalancedpartitioningwithbalancedpartitioning
Conclusions • MPSM is a sort-based parallel joinalgorithm • MPSM is NUMA-aware & NUMA-oblivious • MPSM isspaceefficient (works in-place) • MPSM scaleslinearly in thenumber of cores • MPSM isskewresilient • MPSM outperformsVectorwise (4X) andBlanas et al‘shashjoin (18X) • MPSM isadaptablefordisk-basedprocessing • See details in paper
Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!