Numa aware algorithms the case of data shuffling
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

NUMA-aware algorithms: the case of data shuffling PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket.

Download Presentation

NUMA-aware algorithms: the case of data shuffling

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Numa aware algorithms the case of data shuffling

NUMA-aware algorithms: the case of data shuffling

Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman

*University of Wisconsin - Madison

IBM Almaden Research Center


Hardware is a moving target

Hardware is a moving target

Intel-based

Cloud

POWER-based

2-socket

4-socket (a)

4-socket (b)

8-socket

Very difficult to optimize & maintain data management code for every HW platform

Different degrees of parallelism, # sockets and memory hierarchies

Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-way SMT, …), storage technologies …


Numa effects underutilize ram bandwidth

NUMA effects => underutilize RAM bandwidth

Socket 0

Socket 1

1

1

Memory

Memory

QPI

2

3

2

3

Memory

Memory

4

Socket 2

Socket 3

3

4

Sequential accesses are not the final solution


Use case data shuffling

Use case: data shuffling

Ignoring NUMA leaves perf. on the table

  • Each of the N threads need to send data to the N-1 other threads

  • Common operation:

    • Sort-merge join

    • Partitioned aggregation

    • MapReduce

      • Both Map and Reduce shuffle data

    • Scatter/gather


Numa aware data mgmt operations

NUMA-aware data mgmt. operations

There are many different data operations that need similar optimizations

  • Tons of work on SMPs & NUMA1.0

  • Sort-merge join [Albutiu et al. VLDB 2012]

    • Favor sequential accesses over random probes

  • OLTP on HW Islands [Porobic et al. VLDB 2012]

    • Should we treat multisocket multicores as a cluster?


Need for primitives

Need for primitives

  • Kernels used frequently on data management operations

    • E.g. sorting, hashing, data shuffling, …

  • Highly optimized software solutions

    • Similar to BLAS

    • Optimized by skilled devs per new HW platform

  • Hardware-based solutions

    • Database machines 2.0 (see Bionic DBMSs talk this afternoon)

    • If very important kernel, can be burnt into HW

    • Expensive, but orders of magnitude more efficient (perf., energy)

    • Companies like IBM and Oracle can do vertical engineering


Outline

Outline

  • Introduction

  • NUMA 2.0 and related work

  • Data shuffling

    • Ring shuffling

    • Thread migration

  • Evaluation

  • Conclusions


Data shuffling na ve implementation

Data shuffling & naïve implementation

Before

After

Shuffle

  • Naïve implementation

  • Each thread acting autonomously:

  • for (thread=0; thread<N; thread++)

  • readMyPartitionFrom(thread);

How bad can that be?

  • N threads produce N-1 partitions for all other threads

  • Each thread needs to read its partitions

    • N * (N-1) transfers

  • Assume uniform sizes of partitions


Shuffling naively in a numa system

Shuffling naively in a NUMA system

Naïve uncoordinated shuffling

Aggr. BW of all

channels

T7

T3

T5

T6

T1

T2

T4

T0

Step 1

T7

T3

T5

T6

T1

T2

T4

T0

Step 2

Need to orchestrate threads/transfers to utilize the rest

T7

T3

T5

T6

T1

T2

T4

T0

Step 3

Max mem. BW

of 1 channel

T7

T3

T5

T6

T1

T2

T4

T0

Step …

Usage of QPI and

Memory paths

BUT we bought 4 memory channels and 6 QPIs


Ring shuffling

Ring shuffling

s3.t1

s0.t0

.

s0.p0

s3.t0

s2.p3

s0.t1

.

s2.p2

s1.p0

s1.p1

s2.p0

s2.t1

s1.t0

s0.p1

s3.p0

s2.t0

s1.t1

  • Devise a global schedule and all threads follow it

    • Inner ring: partitions ordered by thread number, socket; stationary

    • Outer ring: threads ordered by socket, thread number; rotates

  • Can be executed in lock-step or loosely

  • Needs:

    • Thread binding & synchronization

    • Control location of mem. allocations


Ring shuffling in action

Ring shuffling in action

Aggr. BW of all

channels

Ring shuffling

T7

T3

T5

T6

T1

T2

T4

T0

Step 1

T7

T3

T5

T6

T1

T2

T4

T0

Step 2

T7

T3

T5

T6

T1

T2

T4

T0

Step 3

T7

T3

T5

T6

T1

T2

T4

T0

Step …

Usage of QPI and

Memory paths

Orchestrated traffic utilizes underlying QPI network


Thread migration instead of shuffling

Thread migration instead of shuffling

Aggr. BW of all

channels

  • Move computation to data instead of shuffling them

    • Convert accesses to local memory reads

  • Choice of migrating only thread or thread + state

    • But, both very sensitive to amount of thread state


Outline1

Outline

Introduction

NUMA 2.0 and related work

Data shuffling

Evaluation

Conclusions


Shuffling benchmark peak bandwidth

Shuffling benchmark – peak bandwidth

~4x

3x

IBM x3850

4 sockets x 8 cores Intel X7650 Nehalem-EX

Fully connected QPI

2x IBM x3850

8 sockets x 8 cores Intel X7650 Nehalem-EX


Exploiting ring shuffling in joins

Exploiting ring shuffling in joins

Small overall perf. improvement because dominated by sort

Implemented the algorithm of Albutiu et al.

Sort-merge-based join implementation


Shuffling vs migration for aggregation

Shuffling vs migration for aggregation

Potential of thread migration when thread state small

Partitioning-based aggregation


Conclusions

Conclusions

Questions???

  • Hardware is a moving target

  • Need for primitives for data management operations

    • Highly optimized SW or HW implementations

    • BLAS for DBMSs

  • Data shuffling can be up to 3x if NUMA-aware

    • Needs binding of memory allocations, thread scheduling …

    • Potential of thread migration

  • Improved overall performance of optimized joins and aggregations

  • Continue investigating primitives, their implementation and exploitation

  • Looking for motivated summer interns! [email to [email protected]]


Backup slides

Backup slides


Shuffling data scalability

Shuffling data - scalability

IBM x3850

4 sockets x 8 cores

Fully connected QPI


Shuffling vs migration for aggregation breakdown

Shuffling vs migration for aggregation - breakdown

Partitioning-based aggregation


Numa aware algorithms the case of data shuffling

Naïve vs ring shuffling

Naïve uncoordinated shuffling

Coordinated shuffling

T7

T3

T5

T6

T7

T1

T2

T4

T3

T5

T6

T0

T1

T2

T4

T0

Iteration 1

T7

T3

T5

T6

T7

T1

T2

T4

T0

T3

T5

T6

T1

T2

T4

T0

Iteration 2

T7

T3

T5

T6

T7

T1

T2

T4

T0

T3

T5

T6

T1

T2

T4

T0

Iteration 3

T7

T3

T5

T6

T1

T2

T4

T7

T3

T5

T6

T0

T1

T2

T4

T0

Iteration …

Usage of QPI and

Memory paths


  • Login