240 likes | 266 Views
Learn about collective communication in multicore systems, important collectives, tuning techniques, experimental setups, performance results, tree topologies, and optimization strategies. Understand the significance and impact of thread vs. process performance in specific hardware architectures.
E N D
Optimizing Collective Communication for Multicore By Rajesh Nishtala
What Are Collectives • An operation called by all threads together to perform globally coordinated communication • May involve a modest amount of computation, e.g. to combine values as they are communicated • Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads • Focus on collectives in Single Program Multiple Data (SPMD) programming models Multicore Collective Tuning
Some Collectives • Barrier((MPI_Barrier()) • A thread cannot exit a call to a barrier until all other threads have called the barrier • Broadcast (MPI_Bcast()) • A root thread sends a copy of an array to all the other threads • Reduce-To-All (MPI_Allreduce()) • Each thread contributes an operand to an arithmetic operation across all the threads • The result is then broadcast to all the threads • Exchange (MPI_Alltoall()) • For all i, j < N , thread i copies the jthpiece of an input array to the ith slot of an output array located on thread i. Multicore Collective Tuning
Why Are They Important? • Basic communication building blocks • Found in many parallel programming languages and libraries • Abstraction • If an application is written with collectives, passes the responsibility of tuning to the runtime Percentage of runtime spent in collectives Multicore Collective Tuning
Experimental Setup • Platforms • Sun Niagra2 • 1 socket of 8 multi-threaded cores • Each core supports 8 hardware thread contexts for 64 total threads • Intel Clovertown • 2 “traditional” quad-core sockets • BlueGene/P • 1 quad-core socket • MPI for Inter-process communication • shared memory MPICH2 1.0.7 Multicore Collective Tuning
Threads v. Processes (Niagra2) • Barrier Performance • Perform a barrier across all 64 threads • Threads arranged into processes in different ways • One extreme has one thread per process while other has 1 process with 64 threads • MPI_Barrier() called between processes • Flat barrier amongst threads • 2 orders of magnitude difference in performance! Multicore Collective Tuning
Threads v. Processes (Niagra2) cont. • Other collectives see similar issues with scaling using processes • MPI Collectives called between processes while shared memory is leverage within a process Multicore Collective Tuning
Intel Clovertown and BlueGene/P • Less threads per node • Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P Multicore Collective Tuning
Optimizing Barrier w/ Trees 0 8 4 2 1 • Leveraging shared memory is a critical optimization • Flat trees are don’t scale • Use to aid parallelism • Requires two passes of a tree • First (UP) pass indicates that all threads have arrived. • Signal parent when all your children have arrived • Once root gets signal from all children then all threads have reported in • Second (DOWN) pass indicates that all threads have arrived • Wait for parent to send me a clear signal • Propagate clear signal down to my children 12 10 9 6 5 3 14 13 11 7 15 Multicore Collective Tuning
Example Tree Topologies 0 8 3 2 1 4 12 11 10 9 7 6 5 15 14 13 0 8 7 6 5 4 3 2 1 15 14 13 12 11 10 9 0 8 4 2 1 12 10 9 6 5 3 14 13 11 7 15 Radix 2 k-nomial tree (binomial) Radix 4 k-nomial tree (quadnomial) Radix 8 k-nomial tree (octnomial) Multicore Collective Tuning
Barrier Performance Results • Time many back-to-back barriers • Flat tree is just one level with all threads reporting to thread 0 • Leverages shared memory but non-scalable • Architecture Independent Tree (radix=2) • Pick a generic “good” radix that is suitable for many platforms • Mismatched to architecture • Architecture Dependent Tree • Search overall radices to pick the tree that best matches the architecture G O O D Multicore Collective Tuning
Broadcast Performance Results • Time a latency sensitive Broadcast (8 Bytes) • Time Broadcast followed by Barrier and subtract time for Barrier • Yields an approximation for how long it takes for the last thread to get the data G O O D Multicore Collective Tuning
Reduce To All Performance Results • 4kBytes (512 Doubles) Reduce-To-All • In addition to data movement we also want to parallelize the computation • In Flat approach, computation gets serialized at the root • Tree based approaches allow us to parallelize the computation amongst all the floating point units • 8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way G O O D Multicore Collective Tuning
Optimization Summary • Relying on flat trees is not enough for most collectives • Architecture dependent tuning is a further and important optimization G O O D Multicore Collective Tuning
Extending the Results to a Cluster • Use one-rack of BlueGene/P (1024 nodes or 4096 cores) • Reduce-To-All by having one thread representative thread make call to inter-node all reduce • Reduce the number of messages in the network • Vary the number of threads per process but use all cores • Relying purely on shared memory doesn’t always yield the best performance • Reduces number of active cores working on computation drops • Can optimize so that computation is partitioned across cores • Not suitable for direct call to MPI_Allreduce() Multicore Collective Tuning
pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 0 x: 1 pid: 0 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: 5 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 2 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: 1 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 1 Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } Proc 1 thinks collective is done Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Multicore Collective Tuning
Strict v. Loose Synchronization • A fix to the problem • Add barrier before/after the collective • Enforces global ordering of the operations • Is there a problem? • We want to decouple synchronization from data movement • Specify the synchronization requirements • Potential to aggregate synchronization • Done by the user ora smart compiler • How can we realize these gains in applications? Multicore Collective Tuning
Conclusions • Processes Threads is a crucial optimization for single-node collective communication • Can use tree-based collectives to realize better performance, even for collectives on one node • Picking the correct tree that best matches architecture yields the best performance • Multicore adds to the (auto)tuning space for collective communication • Shared memory semantics allow us to create new loosely synchronized collectives Multicore Collective Tuning
Questions? Multicore Collective Tuning
Backup Slides Multicore Collective Tuning
Threads and Processes • Threads • A sequence of instructions and an execution stack • Communication between threads through common and shared address space • No OS/Network involvement needed • Reasoning about inter-thread communication can be tricky • Processes • A set of threads and and an associated memory space • All threads within process share address space • Communication between processes must be managed through the OS • Inter-process communication is explicit but may be slow • More expensive to switch between processes Multicore Collective Tuning
Experimental Platforms Clovertown Niagra2 BG/P Multicore Collective Tuning
Specs Multicore Collective Tuning
Details of Signaling • For optimum performance have many readers and one writer • Each thread sets a flag (a single word) that others will read • Every reader will get a copy of the cache-line and spin on that copy • When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes • Avoid atomic primitives • On way up the tree, child sets a flag indicating that subtree has arrived • Parent spins on that flag for each child • On way down, each child spins on parent’s flag • When it’s set, it indicates that the parent wants to broadcast the clear signal down • Flags must be on different cache lines to avoid false sharing • Need to switch back-and-forth between two sets of flags Multicore Collective Tuning