350 likes | 449 Views
This paper presents a method for performing collective operations efficiently in clustered wide-area systems without manual configuration. It adapts to new topologies and avoids high-latency or low-bandwidth links, improving performance for message passing in WANs. The implementation is designed for the Phoenix Message Passing System. The study outlines the challenges of existing methods, the contribution of the proposed method, and its experimental results, providing a significant advance in collective operations for WANs.
E N D
Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees HideoSaito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo) Grid 2005 (Nov. 13, 2005)
Message Passing in WANs • Increase in bandwidth of Wide-Area Networks • More opportunities to perform parallel comp. using multiple clusters connected by a WAN • Demands to use message passing for parallel computation in WANs • Existing programs written using message passing • Familiar programming model WAN
Collective Operations • Communication operations in which all processors participate • E.g., broadcast and reduction • Importance (Gorlatch et al. 2004) • Easier to program using coll. operations than with just point-to-point operations (i.e., send/receive) • Libraries can provide a faster implementation than with user-level send/recv • Libraries can take advantage of knowledge of the underlying hardware Root Broadcast
Coll. Ops. in LANs and WANs • Collective Operations for LANs • Optimized under the assumption that all links have the same latency/bandwidth • Collective Operations for WANs • Wide-area links are much slower than local-area links • Collective operations need to avoid links w/ high latency or low bandwidth • Existing methods use static Grid-aware trees constructed using manually-supplied information
Problems of Existing Methods • Large-scale, long-lasting applications • More situations in which… • Different computational resources are used upon each invocation • Computational resources are automatically allocated by middleware • Processors are added/removed after app. startup • Difficult to manually supply topology information • Static trees used in existing methods won’t work
Contribution of Our Work • Method to perform collective operations • Efficient in clustered wide-area systems • Doesn’t require manual configuration • Adapts to new topologies when processes are added/removed • Implementation for the Phoenix Message Passing System • Implementation for MPI (work in progress)
Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work
Root Root Short Long Binomial Tree Scatter Ring All-gather MPICH • Thakur et al. 2005 • Assumes that the latency and bandwidth of all links are the same • Short messages: latency-aware algorithm • Long messages: bandwidth-aware algorithm
Kielmann et al. ’99 Separate static trees for wide-area and local-area communication Broadcast Root sends to the “coordinator node” of each cluster Coordinator nodes perform an MPICH-like broadcast within each cluster Root LAN Coord. Node Coord. Node LAN LAN MagPIe
Other Works on Coll. Ops. for WANs • Other works that rely on manually-supplied info. • Network Performance-aware Collective Communication for Clustered Wide Area Systems (Kielmann et al. 2001) • MPICH-G2 (Karonis et al. 2003)
Content Delivery Networks • Application-level multicast mechanisms using topology-aware overlay networks • Overcast (Jannotti et al. 2000) • SelectCast (Bozdog et al. 2003) • Banerjee et al. 2002 • Designed for content delivery; don’t work for message passing • Data loss • Single source • Only 1-to-N operations
Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work
Phoenix • Taura et al. (PPoPP2003) • Phoenix Programming Model • Message passing model • Programs are written using send/receive • Messages are addressed to virtual nodes • Strong support for addition/removal of processes during execution • Phoenix Message Passing Library • Message passing library based on the Phoenix Programming Model • Basis of our implementation
send(30) send(30) 0-19 20-29 30-39 migration Addition/Removal of Processes • Virtual node namespace • Messages are addressed to virtual nodes instead of to processes • API to “migrate” virtual nodes supports addition/removal of processes during execution JOIN 20-39
root root may be delivered together migration 0 1 2 3 Broadcast in Phoenix Broadcast in MPI Broadcast in Phoenix 4 4
root root Op 0 1 2 3 4 Reduction in Phoenix Reduction in MPI Reduction in Phoenix Op
Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work
Overview of Our Proposal • Create topology-aware spanning trees at run-time • Latency-aware trees (for short messages) • Bandwidth-aware trees (for long messages) • Perform broadcasts/reductions along the generated trees • Update the trees when processes are added/removed
Root RTT RTT RTT Spanning Tree Creation • Create a spanning tree for each process w/ that process at the root • Each process autonomously • Measures the RTT (or bandwidth) between itself and randomly selected other processes • Searches for a suitable parent for each spanning tree
Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root root distp,root p Latency-Aware Trees
Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root LAN Root LAN LAN Latency-Aware Trees
Root (w/in cluster) Root Parent Change Parent Change Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root
distcand,root distp,root RTTp,parent RTTp,cand Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root parent/root p cand p
cand’s parent estcand cand bwp2p p nchildren Bandwidth-Aware Trees • Goal • Efficient use of bandwidth during a broadcast/reduction • Bandwidth estimation • estp,cand =min(estcand, bwp2p/(nchildren+1)) • Parent selection • estp,cand > estp,parent
Short message (<128KB) Forward along a latency-aware tree Long message (>128KB) Pipeline along a bandwidth-aware tree Include in the header The set of virtual nodes to be forwarded via the receiving process 0 (root) {1,3,4} {2, 5} 1 2 {3} {4} {5} 5 3 4 5 migration Broadcast
Each processor Waits for a message from all of its children Performs the specified operation Forwards the result to its parent Timeout mechanism To avoid waiting forever for a child that has already sent its message to another process Parent Change p Reduction Old parent New parent Parent Timeout
Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work
Broadcast (1-byte) • MPI-like broadcast • Mapped 1 virtual node to each process • 201 processes in 3 clusters MPICH-like (Grid-unaware) Impl. Our Implementation MagPIe-like (Static Grid-aware) Impl.
Broadcast (Long) • MPI-like broadcast • Mapped 1 virtual node to each process • 137 processes in 4 clusters
Reduction • MPI-like Reduction • Mapped 1 virtual node to each process • 128 processes in 3 clusters
Rm. Add Addition/Removal of Processes • Repeated 4-MB broadcasts • 160 procs. in 4 clusters • Added/removed procs. while broadcasting • t = 0 [s] • 1 virtual node/process • t = 60 [s] • Remove half of the processes • t = 90 [s] • Re-add the removed processes
Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work
Conclusion • Presented a method to perform broadcasts and reductions in WANs w/out manual configuration • Experiments • Stable-state broadcast/reduction • 1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe • Addition/removal of processes • Effective execution resumed 8 seconds after adding/removing processes
Future Work • Optimize broadcast/reduction • Reduce the gap between our method and static Grid-enabled methods • Other collective operations • All-to-all • Barrier