1 / 24

Optimizing Collective Communication for Multicore

Optimizing Collective Communication for Multicore. By Rajesh Nishtala. What Are Collectives. An operation called by all threads together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are communicated

bkrupa
Download Presentation

Optimizing Collective Communication for Multicore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Collective Communication for Multicore By Rajesh Nishtala

  2. What Are Collectives • An operation called by all threads together to perform globally coordinated communication • May involve a modest amount of computation, e.g. to combine values as they are communicated • Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads • Focus on collectives in Single Program Multiple Data (SPMD) programming models Multicore Collective Tuning

  3. Some Collectives • Barrier((MPI_Barrier()) • A thread cannot exit a call to a barrier until all other threads have called the barrier • Broadcast (MPI_Bcast()) • A root thread sends a copy of an array to all the other threads • Reduce-To-All (MPI_Allreduce()) • Each thread contributes an operand to an arithmetic operation across all the threads • The result is then broadcast to all the threads • Exchange (MPI_Alltoall()) • For all i, j < N , thread i copies the jthpiece of an input array to the ith slot of an output array located on thread i. Multicore Collective Tuning

  4. Why Are They Important? • Basic communication building blocks • Found in many parallel programming languages and libraries • Abstraction • If an application is written with collectives, passes the responsibility of tuning to the runtime Percentage of runtime spent in collectives Multicore Collective Tuning

  5. Experimental Setup • Platforms • Sun Niagra2 • 1 socket of 8 multi-threaded cores • Each core supports 8 hardware thread contexts for 64 total threads • Intel Clovertown • 2 “traditional” quad-core sockets • BlueGene/P • 1 quad-core socket • MPI for Inter-process communication • shared memory MPICH2 1.0.7 Multicore Collective Tuning

  6. Threads v. Processes (Niagra2) • Barrier Performance • Perform a barrier across all 64 threads • Threads arranged into processes in different ways • One extreme has one thread per process while other has 1 process with 64 threads • MPI_Barrier() called between processes • Flat barrier amongst threads • 2 orders of magnitude difference in performance! Multicore Collective Tuning

  7. Threads v. Processes (Niagra2) cont. • Other collectives see similar issues with scaling using processes • MPI Collectives called between processes while shared memory is leverage within a process Multicore Collective Tuning

  8. Intel Clovertown and BlueGene/P • Less threads per node • Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P Multicore Collective Tuning

  9. Optimizing Barrier w/ Trees 0 8 4 2 1 • Leveraging shared memory is a critical optimization • Flat trees are don’t scale • Use to aid parallelism • Requires two passes of a tree • First (UP) pass indicates that all threads have arrived. • Signal parent when all your children have arrived • Once root gets signal from all children then all threads have reported in • Second (DOWN) pass indicates that all threads have arrived • Wait for parent to send me a clear signal • Propagate clear signal down to my children 12 10 9 6 5 3 14 13 11 7 15 Multicore Collective Tuning

  10. Example Tree Topologies 0 8 3 2 1 4 12 11 10 9 7 6 5 15 14 13 0 8 7 6 5 4 3 2 1 15 14 13 12 11 10 9 0 8 4 2 1 12 10 9 6 5 3 14 13 11 7 15 Radix 2 k-nomial tree (binomial) Radix 4 k-nomial tree (quadnomial) Radix 8 k-nomial tree (octnomial) Multicore Collective Tuning

  11. Barrier Performance Results • Time many back-to-back barriers • Flat tree is just one level with all threads reporting to thread 0 • Leverages shared memory but non-scalable • Architecture Independent Tree (radix=2) • Pick a generic “good” radix that is suitable for many platforms • Mismatched to architecture • Architecture Dependent Tree • Search overall radices to pick the tree that best matches the architecture G O O D Multicore Collective Tuning

  12. Broadcast Performance Results • Time a latency sensitive Broadcast (8 Bytes) • Time Broadcast followed by Barrier and subtract time for Barrier • Yields an approximation for how long it takes for the last thread to get the data G O O D Multicore Collective Tuning

  13. Reduce To All Performance Results • 4kBytes (512 Doubles) Reduce-To-All • In addition to data movement we also want to parallelize the computation • In Flat approach, computation gets serialized at the root • Tree based approaches allow us to parallelize the computation amongst all the floating point units • 8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way G O O D Multicore Collective Tuning

  14. Optimization Summary • Relying on flat trees is not enough for most collectives • Architecture dependent tuning is a further and important optimization G O O D Multicore Collective Tuning

  15. Extending the Results to a Cluster • Use one-rack of BlueGene/P (1024 nodes or 4096 cores) • Reduce-To-All by having one thread representative thread make call to inter-node all reduce • Reduce the number of messages in the network • Vary the number of threads per process but use all cores • Relying purely on shared memory doesn’t always yield the best performance • Reduces number of active cores working on computation drops • Can optimize so that computation is partitioned across cores • Not suitable for direct call to MPI_Allreduce() Multicore Collective Tuning

  16. pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 0 x: 1 pid: 0 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: 5 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 2 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: 1 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 1 Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } Proc 1 thinks collective is done Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Multicore Collective Tuning

  17. Strict v. Loose Synchronization • A fix to the problem • Add barrier before/after the collective • Enforces global ordering of the operations • Is there a problem? • We want to decouple synchronization from data movement • Specify the synchronization requirements • Potential to aggregate synchronization • Done by the user ora smart compiler • How can we realize these gains in applications? Multicore Collective Tuning

  18. Conclusions • Processes  Threads is a crucial optimization for single-node collective communication • Can use tree-based collectives to realize better performance, even for collectives on one node • Picking the correct tree that best matches architecture yields the best performance • Multicore adds to the (auto)tuning space for collective communication • Shared memory semantics allow us to create new loosely synchronized collectives Multicore Collective Tuning

  19. Questions? Multicore Collective Tuning

  20. Backup Slides Multicore Collective Tuning

  21. Threads and Processes • Threads • A sequence of instructions and an execution stack • Communication between threads through common and shared address space • No OS/Network involvement needed • Reasoning about inter-thread communication can be tricky • Processes • A set of threads and and an associated memory space • All threads within process share address space • Communication between processes must be managed through the OS • Inter-process communication is explicit but may be slow • More expensive to switch between processes Multicore Collective Tuning

  22. Experimental Platforms Clovertown Niagra2 BG/P Multicore Collective Tuning

  23. Specs Multicore Collective Tuning

  24. Details of Signaling • For optimum performance have many readers and one writer • Each thread sets a flag (a single word) that others will read • Every reader will get a copy of the cache-line and spin on that copy • When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes • Avoid atomic primitives • On way up the tree, child sets a flag indicating that subtree has arrived • Parent spins on that flag for each child • On way down, each child spins on parent’s flag • When it’s set, it indicates that the parent wants to broadcast the clear signal down • Flags must be on different cache lines to avoid false sharing • Need to switch back-and-forth between two sets of flags Multicore Collective Tuning

More Related