Massively parallel sort merge joins mpsm in main memory multi core database systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems. Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München. Hardware trends …. Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server :

Download Presentation

Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems

Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität München


Hardware trends …

  • Massive processingparallelism

  • Hugemainmemory

  • Non-uniform Memory Access (NUMA)

  • Ourserver:

    • 4 CPUs

    • 32 cores

    • 1 TB RAM

    • 4 NUMA partitions

CPU 0


Main memorydatabasesystems

MPSM forefficient OLAP queryprocessing

  • VoltDB, SAP HANA, MonetDB

  • HyPer*: real-time businessintelligencequerieson transactionaldata

* http://www-db.in.tum.de/HyPer/

ICDE 2011 A. Kemper, T. Neumann


How to exploitthesehardwaretrends?

  • Parallelizealgorithms

  • Exploit fast mainmemoryaccess

     Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast JoinImplementation on Modern Multi-Core CPUs. VLDB‘09

     Blanas, Li, Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD‘11

  • AND beaware of fast local vs. slow remote NUMA access


Howmuchdifferencedoes NUMA make?

100%

22756 ms

22756 ms

417344 ms

417344 ms

1000 ms

837 ms

scaledexecution time

7440 ms

7440 ms

12946 ms

remote

local

local

remote

synchronized

sequential

mergejoin(sequentialread)

sort

partitioning


IgnoringNUMA …

NUMA partition 1

NUMA partition 3

core 1

core 5

hashtable

core 2

core 6

NUMA partition 2

NUMA partition 4

core 3

core 7

core 4

core 8


The three NUMA commandments

C2

Thoushaltreadthyneighbor‘smemoryonlysequentially-- lettheprefetcherhidethe remote accesslatency.

C3

Thoushalt not waitforthyneighbors-- don‘tusefine-grainedlatchingorlockingandavoidsynchronizationpoints of parallel threads.

C1

Thoushalt not writethyneighbor‘smemoryrandomly-- chunkthedata, redistribute, andthensort/workon yourdatalocally.


Basic idea of MPSM

chunkprivate R

private R

R chunks

S chunks

public S

chunkpublic S


Basic idea of MPSM

  • C1: Work locally: sort

  • C3: Work independently: sortandmergejoin

  • C2: Access neighbor‘sdataonlysequentially

chunkprivate R

R chunks

sort R chunkslocally

mergejoinchunks

MJ

MJ

MJ

MJ

S chunks

sort S chunkslocally

chunkpublic S


Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers


Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

R chunks

rangepartition R

rangepartitionedR chunks


Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

     S isimplicitlypartitioned

rangepartitionedR chunks

sort R chunks

S chunks

sort S chunks


Range partitioning of private input R

  • To constrainmergejoinwork

  • To providescalability in thenumber of parallel workers

     S isimplicitlypartitioned

rangepartitionedR chunks

sort R chunks

MJ

MJ

MJ

MJ

mergejoinonlyrelevant parts

S chunks

sort S chunks


Range partitioning of private input R

  • Time efficient

    • branch-free

    • comparison-free

    • synchronization-free

      and

  • Space efficient

     denselypacked

    • in-place

      byusingradix-clusteringandprecomputedtargetpartitions to scatterdatato


Range partitioning of private input

prefixsum of worker W1

histogram of worker W1

19=10011

19

19

9

W1

<16

0

4

7 = 00111

7

1

≥16

0

3

chunk of worker W1

3

21

21

2

17 = 10001

W2

1

17

17

2=00010

histogram of worker W2

prefixsum of worker W2

2

19

W1

23

23

5

<16

4

3

4

≥16

3

4

chunk of worker W2

31

31

8

W2

20

20

26

26


Skewresilience of MPSM

  • Locationskewisimplicitlyhandled

  • Distribution skew

    • Dynamicallycomputedpartitionbounds

    • Determinedbasedon the global datadistributionsof R and S

    • CostbalancingforsortingR andjoiningR and S

2

19

19

1

2

13

9

2

4

4

7

3

5

31

31

3

5

8

8

21

21

7

10

20

20

1

10

13

6

7

17

20

20

24

24

R1

R2

28

28

25

25

S1

S2


Skewresilience

1. Global S datadistribution

  • Localequi-heighthistograms (forfree)

  • Combined to Cumulative Distribution Function (CDF)

CDF

# tuples

1

2

2

4

16

3

5

12

5

8

7

10

10

13

20

24

28

25

S1

S2

keyvalue

15


Skewresilience

2. Global R datadistribution

  • Localequi-widthhistograms as before

  • More fine-grainedhistograms

histogram

2

2 = 00010

13

13

<8

3

4

[8,16)

2

31

31

8 = 01000

[16,24)

1

8

8

≥24

1

20

20

20

6

R1


Skewresilience

3. Computesplittersso thatoverallworkloadsarebalanced*: greedilycombinebuckets, therebybalancingthecosts of eachthreadforsorting R andjoining R and S arebalanced

CDF

# tuples

histogram

2

+

=

4

3

6

2

1

13

1

31

8

20

keyvalue

* Ross andCieslewicz: Optimal Splitters for Database Partitioningwith Size Bounds. ICDT‘09


Performance evaluation

  • MPSM performance in a nutshell:

    • 160 miorecordsjoinedper second

    • 27 biorecordsjoinedin lessthan 3 minutes

    • scaleslinearlywiththenumber of cores

  • Platform:

    • Linux server

    • 1 TB RAM

    • 32 cores

  • Benchmark:

    • Jointables R and S withschema{[joinkey: 64bit, payload: 64bit]}

    • Dataset sizes ranging from 50GB to 400GB


Execution time comparison

  • MPSM, Vectorwise (VW), andBlanashashjoin*

  • 32 workers

  • |R| = 1600 mio(25 GB), varyingsize of S

* S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash JoinAlgorithmsfor Multi-core CPUs. SIGMOD 2011


Scalability in thenumber of cores

  • MPSM andVectorwise (VW)

  • |R| = 1600 mio(25 GB),|S|=4*|R|


Locationskew

  • Locationskew in Rhasnoeffectbecause of repartitioning

  • Locationskew in S:in the extreme caseall joinpartners ofRiarefound in onlyoneSj(eitherlocalor remote)


Distribution skew: anti-correlateddata

R: 80% of thejoinkeysarewithinthe 20% high end of thekeydomainS: 80% of thejoinkeysarewithinthe 20% low end of thekeydomain

withoutbalancedpartitioning

withbalancedpartitioning


Distribution skew : anti-correlateddata


Conclusions

  • MPSM is a sort-based parallel joinalgorithm

  • MPSM isNUMA-aware & NUMA-oblivious

  • MPSM isspaceefficient(works in-place)

  • MPSM scaleslinearlyin thenumber of cores

  • MPSM isskewresilient

  • MPSM outperformsVectorwise(4X) andBlanas et al.‘shashjoin (18X)

  • MPSM isadaptablefordisk-basedprocessing

    • See details in paper


Massively Parallel Sort-MergeJoins (MPSM) in Main Memory Multi-Core Database Systems

Martina Albutiu, Alfons Kemper, and Thomas NeumannTechnische Universität MünchenTHANK YOU FOR YOUR ATTENTION!


  • Login