Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees HideoSaito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo) Grid 2005 (Nov. 13, 2005)

Message Passing in WANs • Increase in bandwidth of Wide-Area Networks • More opportunities to perform parallel comp. using multiple clusters connected by a WAN • Demands to use message passing for parallel computation in WANs • Existing programs written using message passing • Familiar programming model WAN

Collective Operations • Communication operations in which all processors participate • E.g., broadcast and reduction • Importance (Gorlatch et al. 2004) • Easier to program using coll. operations than with just point-to-point operations (i.e., send/receive) • Libraries can provide a faster implementation than with user-level send/recv • Libraries can take advantage of knowledge of the underlying hardware Root Broadcast

Coll. Ops. in LANs and WANs • Collective Operations for LANs • Optimized under the assumption that all links have the same latency/bandwidth • Collective Operations for WANs • Wide-area links are much slower than local-area links • Collective operations need to avoid links w/ high latency or low bandwidth • Existing methods use static Grid-aware trees constructed using manually-supplied information

Problems of Existing Methods • Large-scale, long-lasting applications • More situations in which… • Different computational resources are used upon each invocation • Computational resources are automatically allocated by middleware • Processors are added/removed after app. startup • Difficult to manually supply topology information • Static trees used in existing methods won’t work

Contribution of Our Work • Method to perform collective operations • Efficient in clustered wide-area systems • Doesn’t require manual configuration • Adapts to new topologies when processes are added/removed • Implementation for the Phoenix Message Passing System • Implementation for MPI (work in progress)

Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

Root Root Short Long Binomial Tree Scatter Ring All-gather MPICH • Thakur et al. 2005 • Assumes that the latency and bandwidth of all links are the same • Short messages: latency-aware algorithm • Long messages: bandwidth-aware algorithm

Kielmann et al. ’99 Separate static trees for wide-area and local-area communication Broadcast Root sends to the “coordinator node” of each cluster Coordinator nodes perform an MPICH-like broadcast within each cluster Root LAN Coord. Node Coord. Node LAN LAN MagPIe

Other Works on Coll. Ops. for WANs • Other works that rely on manually-supplied info. • Network Performance-aware Collective Communication for Clustered Wide Area Systems (Kielmann et al. 2001) • MPICH-G2 (Karonis et al. 2003)

Content Delivery Networks • Application-level multicast mechanisms using topology-aware overlay networks • Overcast (Jannotti et al. 2000) • SelectCast (Bozdog et al. 2003) • Banerjee et al. 2002 • Designed for content delivery; don’t work for message passing • Data loss • Single source • Only 1-to-N operations

Phoenix • Taura et al. (PPoPP2003) • Phoenix Programming Model • Message passing model • Programs are written using send/receive • Messages are addressed to virtual nodes • Strong support for addition/removal of processes during execution • Phoenix Message Passing Library • Message passing library based on the Phoenix Programming Model • Basis of our implementation

send(30) send(30) 0-19 20-29 30-39 migration Addition/Removal of Processes • Virtual node namespace • Messages are addressed to virtual nodes instead of to processes • API to “migrate” virtual nodes supports addition/removal of processes during execution JOIN 20-39

root root may be delivered together migration 0 1 2 3 Broadcast in Phoenix Broadcast in MPI Broadcast in Phoenix 4 4

root root Op 0 1 2 3 4 Reduction in Phoenix Reduction in MPI Reduction in Phoenix Op

Overview of Our Proposal • Create topology-aware spanning trees at run-time • Latency-aware trees (for short messages) • Bandwidth-aware trees (for long messages) • Perform broadcasts/reductions along the generated trees • Update the trees when processes are added/removed

Root RTT RTT RTT Spanning Tree Creation • Create a spanning tree for each process w/ that process at the root • Each process autonomously • Measures the RTT (or bandwidth) between itself and randomly selected other processes • Searches for a suitable parent for each spanning tree

Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root root distp,root p Latency-Aware Trees

Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root LAN Root LAN LAN Latency-Aware Trees

Root (w/in cluster) Root Parent Change Parent Change Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root

distcand,root distp,root RTTp,parent RTTp,cand Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root parent/root p cand p

cand’s parent estcand cand bwp2p p nchildren Bandwidth-Aware Trees • Goal • Efficient use of bandwidth during a broadcast/reduction • Bandwidth estimation • estp,cand =min(estcand, bwp2p/(nchildren+1)) • Parent selection • estp,cand > estp,parent

Short message (<128KB) Forward along a latency-aware tree Long message (>128KB) Pipeline along a bandwidth-aware tree Include in the header The set of virtual nodes to be forwarded via the receiving process 0 (root) {1,3,4} {2, 5} 1 2 {3} {4} {5} 5 3 4 5 migration Broadcast

Each processor Waits for a message from all of its children Performs the specified operation Forwards the result to its parent Timeout mechanism To avoid waiting forever for a child that has already sent its message to another process Parent Change p Reduction Old parent New parent Parent Timeout

Broadcast (1-byte) • MPI-like broadcast • Mapped 1 virtual node to each process • 201 processes in 3 clusters MPICH-like (Grid-unaware) Impl. Our Implementation MagPIe-like (Static Grid-aware) Impl.

Broadcast (Long) • MPI-like broadcast • Mapped 1 virtual node to each process • 137 processes in 4 clusters

Reduction • MPI-like Reduction • Mapped 1 virtual node to each process • 128 processes in 3 clusters

Rm. Add Addition/Removal of Processes • Repeated 4-MB broadcasts • 160 procs. in 4 clusters • Added/removed procs. while broadcasting • t = 0 [s] • 1 virtual node/process • t = 60 [s] • Remove half of the processes • t = 90 [s] • Re-add the removed processes

Conclusion • Presented a method to perform broadcasts and reductions in WANs w/out manual configuration • Experiments • Stable-state broadcast/reduction • 1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe • Addition/removal of processes • Effective execution resumed 8 seconds after adding/removing processes

Future Work • Optimize broadcast/reduction • Reduce the gap between our method and static Grid-enabled methods • Other collective operations • All-to-all • Barrier

Thank you!

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees