Numa aware algorithms the case of data shuffling
Download
1 / 21

NUMA-aware algorithms: the case of data shuffling - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' NUMA-aware algorithms: the case of data shuffling' - bayard


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Numa aware algorithms the case of data shuffling

NUMA-aware algorithms: the case of data shuffling

Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman

*University of Wisconsin - Madison

IBM Almaden Research Center


Hardware is a moving target
Hardware is a moving target

Intel-based

Cloud

POWER-based

2-socket

4-socket (a)

4-socket (b)

8-socket

Very difficult to optimize & maintain data management code for every HW platform

Different degrees of parallelism, # sockets and memory hierarchies

Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-way SMT, …), storage technologies …


Numa effects underutilize ram bandwidth
NUMA effects => underutilize RAM bandwidth

Socket 0

Socket 1

1

1

Memory

Memory

QPI

2

3

2

3

Memory

Memory

4

Socket 2

Socket 3

3

4

Sequential accesses are not the final solution


Use case data shuffling
Use case: data shuffling

Ignoring NUMA leaves perf. on the table

  • Each of the N threads need to send data to the N-1 other threads

  • Common operation:

    • Sort-merge join

    • Partitioned aggregation

    • MapReduce

      • Both Map and Reduce shuffle data

    • Scatter/gather


Numa aware data mgmt operations
NUMA-aware data mgmt. operations

There are many different data operations that need similar optimizations

  • Tons of work on SMPs & NUMA1.0

  • Sort-merge join [Albutiu et al. VLDB 2012]

    • Favor sequential accesses over random probes

  • OLTP on HW Islands [Porobic et al. VLDB 2012]

    • Should we treat multisocket multicores as a cluster?


Need for primitives
Need for primitives

  • Kernels used frequently on data management operations

    • E.g. sorting, hashing, data shuffling, …

  • Highly optimized software solutions

    • Similar to BLAS

    • Optimized by skilled devs per new HW platform

  • Hardware-based solutions

    • Database machines 2.0 (see Bionic DBMSs talk this afternoon)

    • If very important kernel, can be burnt into HW

    • Expensive, but orders of magnitude more efficient (perf., energy)

    • Companies like IBM and Oracle can do vertical engineering


Outline
Outline

  • Introduction

  • NUMA 2.0 and related work

  • Data shuffling

    • Ring shuffling

    • Thread migration

  • Evaluation

  • Conclusions


Data shuffling na ve implementation
Data shuffling & naïve implementation

Before

After

Shuffle

  • Naïve implementation

  • Each thread acting autonomously:

  • for (thread=0; thread<N; thread++)

  • readMyPartitionFrom(thread);

How bad can that be?

  • N threads produce N-1 partitions for all other threads

  • Each thread needs to read its partitions

    • N * (N-1) transfers

  • Assume uniform sizes of partitions


Shuffling naively in a numa system
Shuffling naively in a NUMA system

Naïve uncoordinated shuffling

Aggr. BW of all

channels

T7

T3

T5

T6

T1

T2

T4

T0

Step 1

T7

T3

T5

T6

T1

T2

T4

T0

Step 2

Need to orchestrate threads/transfers to utilize the rest

T7

T3

T5

T6

T1

T2

T4

T0

Step 3

Max mem. BW

of 1 channel

T7

T3

T5

T6

T1

T2

T4

T0

Step …

Usage of QPI and

Memory paths

BUT we bought 4 memory channels and 6 QPIs


Ring shuffling
Ring shuffling

s3.t1

s0.t0

.

s0.p0

s3.t0

s2.p3

s0.t1

.

s2.p2

s1.p0

s1.p1

s2.p0

s2.t1

s1.t0

s0.p1

s3.p0

s2.t0

s1.t1

  • Devise a global schedule and all threads follow it

    • Inner ring: partitions ordered by thread number, socket; stationary

    • Outer ring: threads ordered by socket, thread number; rotates

  • Can be executed in lock-step or loosely

  • Needs:

    • Thread binding & synchronization

    • Control location of mem. allocations


Ring shuffling in action
Ring shuffling in action

Aggr. BW of all

channels

Ring shuffling

T7

T3

T5

T6

T1

T2

T4

T0

Step 1

T7

T3

T5

T6

T1

T2

T4

T0

Step 2

T7

T3

T5

T6

T1

T2

T4

T0

Step 3

T7

T3

T5

T6

T1

T2

T4

T0

Step …

Usage of QPI and

Memory paths

Orchestrated traffic utilizes underlying QPI network


Thread migration instead of shuffling
Thread migration instead of shuffling

Aggr. BW of all

channels

  • Move computation to data instead of shuffling them

    • Convert accesses to local memory reads

  • Choice of migrating only thread or thread + state

    • But, both very sensitive to amount of thread state


Outline1
Outline

Introduction

NUMA 2.0 and related work

Data shuffling

Evaluation

Conclusions


Shuffling benchmark peak bandwidth
Shuffling benchmark – peak bandwidth

~4x

3x

IBM x3850

4 sockets x 8 cores Intel X7650 Nehalem-EX

Fully connected QPI

2x IBM x3850

8 sockets x 8 cores Intel X7650 Nehalem-EX


Exploiting ring shuffling in joins
Exploiting ring shuffling in joins

Small overall perf. improvement because dominated by sort

Implemented the algorithm of Albutiu et al.

Sort-merge-based join implementation


Shuffling vs migration for aggregation
Shuffling vs migration for aggregation

Potential of thread migration when thread state small

Partitioning-based aggregation


Conclusions
Conclusions

Questions???

  • Hardware is a moving target

  • Need for primitives for data management operations

    • Highly optimized SW or HW implementations

    • BLAS for DBMSs

  • Data shuffling can be up to 3x if NUMA-aware

    • Needs binding of memory allocations, thread scheduling …

    • Potential of thread migration

  • Improved overall performance of optimized joins and aggregations

  • Continue investigating primitives, their implementation and exploitation

  • Looking for motivated summer interns! [email to [email protected]]



Shuffling data scalability
Shuffling data - scalability

IBM x3850

4 sockets x 8 cores

Fully connected QPI


Shuffling vs migration for aggregation breakdown
Shuffling vs migration for aggregation - breakdown

Partitioning-based aggregation


Naïve vs ring shuffling

Naïve uncoordinated shuffling

Coordinated shuffling

T7

T3

T5

T6

T7

T1

T2

T4

T3

T5

T6

T0

T1

T2

T4

T0

Iteration 1

T7

T3

T5

T6

T7

T1

T2

T4

T0

T3

T5

T6

T1

T2

T4

T0

Iteration 2

T7

T3

T5

T6

T7

T1

T2

T4

T0

T3

T5

T6

T1

T2

T4

T0

Iteration 3

T7

T3

T5

T6

T1

T2

T4

T7

T3

T5

T6

T0

T1

T2

T4

T0

Iteration …

Usage of QPI and

Memory paths


ad