1 / 21

NUMA-aware algorithms: the case of data shuffling

NUMA-aware algorithms: the case of data shuffling. Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman. *University of Wisconsin - Madison. IBM Almaden Research Center. Hardware is a moving target. Intel-based. Cloud. POWER-based. 2-socket.

bayard
Download Presentation

NUMA-aware algorithms: the case of data shuffling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - Madison IBM Almaden Research Center

  2. Hardware is a moving target Intel-based Cloud POWER-based 2-socket 4-socket (a) 4-socket (b) 8-socket Very difficult to optimize & maintain data management code for every HW platform Different degrees of parallelism, # sockets and memory hierarchies Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8-way SMT, …), storage technologies …

  3. NUMA effects => underutilize RAM bandwidth Socket 0 Socket 1 1 1 Memory Memory QPI 2 3 2 3 Memory Memory 4 Socket 2 Socket 3 3 4 Sequential accesses are not the final solution

  4. Use case: data shuffling Ignoring NUMA leaves perf. on the table • Each of the N threads need to send data to the N-1 other threads • Common operation: • Sort-merge join • Partitioned aggregation • MapReduce • Both Map and Reduce shuffle data • Scatter/gather

  5. NUMA-aware data mgmt. operations There are many different data operations that need similar optimizations • Tons of work on SMPs & NUMA1.0 • Sort-merge join [Albutiu et al. VLDB 2012] • Favor sequential accesses over random probes • OLTP on HW Islands [Porobic et al. VLDB 2012] • Should we treat multisocket multicores as a cluster?

  6. Need for primitives • Kernels used frequently on data management operations • E.g. sorting, hashing, data shuffling, … • Highly optimized software solutions • Similar to BLAS • Optimized by skilled devs per new HW platform • Hardware-based solutions • Database machines 2.0 (see Bionic DBMSs talk this afternoon) • If very important kernel, can be burnt into HW • Expensive, but orders of magnitude more efficient (perf., energy) • Companies like IBM and Oracle can do vertical engineering

  7. Outline • Introduction • NUMA 2.0 and related work • Data shuffling • Ring shuffling • Thread migration • Evaluation • Conclusions

  8. Data shuffling & naïve implementation Before After Shuffle • Naïve implementation • Each thread acting autonomously: • for (thread=0; thread<N; thread++) • readMyPartitionFrom(thread); How bad can that be? • N threads produce N-1 partitions for all other threads • Each thread needs to read its partitions • N * (N-1) transfers • Assume uniform sizes of partitions

  9. Shuffling naively in a NUMA system Naïve uncoordinated shuffling Aggr. BW of all channels T7 T3 T5 T6 T1 T2 T4 T0 Step 1 T7 T3 T5 T6 T1 T2 T4 T0 Step 2 Need to orchestrate threads/transfers to utilize the rest T7 T3 T5 T6 T1 T2 T4 T0 Step 3 Max mem. BW of 1 channel T7 T3 T5 T6 T1 T2 T4 T0 Step … Usage of QPI and Memory paths BUT we bought 4 memory channels and 6 QPIs

  10. Ring shuffling s3.t1 s0.t0 . s0.p0 s3.t0 s2.p3 s0.t1 . s2.p2 s1.p0 s1.p1 s2.p0 s2.t1 s1.t0 s0.p1 s3.p0 s2.t0 s1.t1 • Devise a global schedule and all threads follow it • Inner ring: partitions ordered by thread number, socket; stationary • Outer ring: threads ordered by socket, thread number; rotates • Can be executed in lock-step or loosely • Needs: • Thread binding & synchronization • Control location of mem. allocations

  11. Ring shuffling in action Aggr. BW of all channels Ring shuffling T7 T3 T5 T6 T1 T2 T4 T0 Step 1 T7 T3 T5 T6 T1 T2 T4 T0 Step 2 T7 T3 T5 T6 T1 T2 T4 T0 Step 3 T7 T3 T5 T6 T1 T2 T4 T0 Step … Usage of QPI and Memory paths Orchestrated traffic utilizes underlying QPI network

  12. Thread migration instead of shuffling Aggr. BW of all channels • Move computation to data instead of shuffling them • Convert accesses to local memory reads • Choice of migrating only thread or thread + state • But, both very sensitive to amount of thread state

  13. Outline Introduction NUMA 2.0 and related work Data shuffling Evaluation Conclusions

  14. Shuffling benchmark – peak bandwidth ~4x 3x IBM x3850 4 sockets x 8 cores Intel X7650 Nehalem-EX Fully connected QPI 2x IBM x3850 8 sockets x 8 cores Intel X7650 Nehalem-EX

  15. Exploiting ring shuffling in joins Small overall perf. improvement because dominated by sort Implemented the algorithm of Albutiu et al. Sort-merge-based join implementation

  16. Shuffling vs migration for aggregation Potential of thread migration when thread state small Partitioning-based aggregation

  17. Conclusions Questions??? • Hardware is a moving target • Need for primitives for data management operations • Highly optimized SW or HW implementations • BLAS for DBMSs • Data shuffling can be up to 3x if NUMA-aware • Needs binding of memory allocations, thread scheduling … • Potential of thread migration • Improved overall performance of optimized joins and aggregations • Continue investigating primitives, their implementation and exploitation • Looking for motivated summer interns! [email to ipandis@us.ibm.com]

  18. Backup slides

  19. Shuffling data - scalability IBM x3850 4 sockets x 8 cores Fully connected QPI

  20. Shuffling vs migration for aggregation - breakdown Partitioning-based aggregation

  21. Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T7 T3 T5 T6 T7 T1 T2 T4 T3 T5 T6 T0 T1 T2 T4 T0 Iteration 1 T7 T3 T5 T6 T7 T1 T2 T4 T0 T3 T5 T6 T1 T2 T4 T0 Iteration 2 T7 T3 T5 T6 T7 T1 T2 T4 T0 T3 T5 T6 T1 T2 T4 T0 Iteration 3 T7 T3 T5 T6 T1 T2 T4 T7 T3 T5 T6 T0 T1 T2 T4 T0 Iteration … Usage of QPI and Memory paths

More Related