Massively parallel sort merge joins mpsm in main memory multi core database systems
Download
1 / 27

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems. Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München. Hardware trends …. Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server :

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems' - kuri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Massively parallel sort merge joins mpsm in main memory multi core database systems

Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems

Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München


Hardware trends
Hardware trends …

  • Massive processingparallelism

  • Hugemainmemory

  • Non-uniform Memory Access (NUMA)

  • Ourserver:

    • 4 CPUs

    • 32 cores

    • 1 TB RAM

    • 4 NUMA partitions

CPU 0


Main memory database systems
Main memorydatabasesystems

MPSM forefficient OLAP queryprocessing

  • VoltDB, SAP HANA, MonetDB

  • HyPer*: real-time businessintelligencequerieson transactionaldata

* http://www-db.in.tum.de/HyPer/

ICDE 2011 A. Kemper, T. Neumann


How to exploit these hardware trends
How to exploitthesehardwaretrends?

  • Parallelizealgorithms

  • Exploit fast mainmemoryaccess

     Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09

     Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11

  • AND beaware of fast local vs. slow remote NUMA access


How much difference does numa make
Howmuchdifferencedoes NUMA make?

100%

22756 ms

22756 ms

417344 ms

417344 ms

1000 ms

837 ms

scaledexecution time

7440 ms

7440 ms

12946 ms

remote

local

local

remote

synchronized

sequential

mergejoin(sequentialread)

sort

partitioning


Ignoring numa
IgnoringNUMA …

NUMA partition 1

NUMA partition 3

core 1

core 5

hashtable

core 2

core 6

NUMA partition 2

NUMA partition 4

core 3

core 7

core 4

core 8


The three numa commandments
The three NUMA commandments

C2

Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency.

C3

Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads.

C1

Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally.


Basic idea of mpsm
Basic idea of MPSM

chunkprivate R

private R

R chunks

S chunks

public S

chunkpublic S


Basic idea of mpsm1
Basic idea of MPSM

  • C1: Work locally: sort

  • C3: Work independently: sortandmergejoin

  • C2: Access neighbor‘sdataonlysequentially

chunkprivate R

R chunks

sort R chunkslocally

mergejoinchunks

MJ

MJ

MJ

MJ

S chunks

sort S chunkslocally

chunkpublic S


Range partitioning of private input r
Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers


Range partitioning of private input r1
Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

R chunks

rangepartition R

rangepartitionedR chunks


Range partitioning of private input r2
Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

     S isimplicitlypartitioned

rangepartitionedR chunks

sort R chunks

S chunks

sort S chunks


Range partitioning of private input r3
Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

     S isimplicitlypartitioned

rangepartitionedR chunks

sort R chunks

MJ

MJ

MJ

MJ

mergejoinonlyrelevant parts

S chunks

sort S chunks


Range partitioning of private input r4
Range partitioning of private input R

  • Time efficient

    • branch-free

    • comparison-free

    • synchronization-free

      and

  • Space efficient

     denselypacked

    • in-place

      byusingradix-clusteringandprecomputedtargetpartitions to scatterdatato


Range partitioning of private input
Range partitioning of private input

prefixsum of worker W1

histogram of worker W1

19=10011

19

19

9

W1

<16

0

4

7 = 00111

7

1

≥16

0

3

chunk of worker W1

3

21

21

2

17 = 10001

W2

1

17

17

2=00010

histogram of worker W2

prefixsum of worker W2

2

19

W1

23

23

5

<16

4

3

4

≥16

3

4

chunk of worker W2

31

31

8

W2

20

20

26

26


Skew resilience of mpsm
Skewresilience of MPSM

  • Locationskewisimplicitlyhandled

  • Distribution skew

    • Dynamicallycomputedpartitionbounds

    • Determinedbasedon the global datadistributionsof R and S

    • CostbalancingforsortingR andjoiningR and S

2

19

19

1

2

13

9

2

4

4

7

3

5

31

31

3

5

8

8

21

21

7

10

20

20

1

10

13

6

7

17

20

20

24

24

R1

R2

28

28

25

25

S1

S2


Skew resilience
Skewresilience

1. Global S datadistribution

  • Localequi-heighthistograms (forfree)

  • Combined to Cumulative Distribution Function (CDF)

CDF

# tuples

1

2

2

4

16

3

5

12

5

8

7

10

10

13

20

24

28

25

S1

S2

keyvalue

15


Skew resilience1
Skewresilience

2. Global R datadistribution

  • Localequi-widthhistograms as before

  • More fine-grainedhistograms

histogram

2

2 = 00010

13

13

<8

3

4

[8,16)

2

31

31

8 = 01000

[16,24)

1

8

8

≥24

1

20

20

20

6

R1


Skew resilience2
Skewresilience

3. Computesplittersso thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced

CDF

# tuples

histogram

2

+

=

4

3

6

2

1

13

1

31

8

20

keyvalue

* Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09


Performance evaluation
Performance evaluation

  • MPSM performance in a nutshell:

    • 160 miorecordsjoinedper second

    • 27 biorecordsjoinedin lessthan 3 minutes

    • scaleslinearlywiththenumber of cores

  • Platform:

    • Linux server

    • 1 TB RAM

    • 32 cores

  • Benchmark:

    • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]}

    • Dataset sizes ranging from 50GB to 400GB


Execution time comparison
Execution time comparison

  • MPSM, Vectorwise (VW), andBlanashashjoin*

  • 32 workers

  • |R| = 1600 mio(25 GB), varyingsize of S

* S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011


Scalability in the number of cores
Scalability in thenumber of cores

  • MPSM andVectorwise (VW)

  • |R| = 1600 mio(25 GB),|S|=4*|R|


Location skew
Locationskew

  • Locationskew in Rhasnoeffectbecause of repartitioning

  • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)


Distribution skew anti correlated data
Distribution skew: anti-correlateddata

R: 80% of thejoinkeysarewithinthe 20% high end of thekeydomainS: 80% of thejoinkeysarewithinthe 20% low end of thekeydomain

withoutbalancedpartitioning

withbalancedpartitioning


Distribution skew anti correlated data1
Distribution skew : anti-correlateddata


Conclusions
Conclusions

  • MPSM is a sort-based parallel joinalgorithm

  • MPSM isNUMA-aware & NUMA-oblivious

  • MPSM isspaceefficient(works in-place)

  • MPSM scaleslinearlyin thenumber of cores

  • MPSM isskewresilient

  • MPSM outperformsVectorwise(4X) andBlanas et al.‘shashjoin (18X)

  • MPSM isadaptablefordisk-basedprocessing

    • See details in paper


Massively parallel sort merge joins mpsm in main memory multi core database systems1

Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems

Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!


ad