1 / 31

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees. April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito. Background. Opportunities to perform message passing in WANs are increasing. WAN. Message Passing in WANs. WAN → more resources

Download Presentation

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

  2. Background • Opportunities to perform message passing in WANs are increasing WAN

  3. Message Passing in WANs • WAN → more resources • However, systems designed for LANs do not perform well in WANs Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs

  4. root 3 5 2 2 1 2 1 1 0 2 1 1 reduction 2 1 0 0 Collective Operations • Operations in which all processors participate (cf. send/receive) • Ex. broadcast, reduction ∑

  5. LAN Collective Operations in WANs • Topology must be considered for high performance LAN LAN • Manual configuration is undesirable • Processors should be able to join/leave LAN

  6. Objective • To design and implement collective operations • w/ high performance in WANs • w/o manual configuration • w/ support for joining/leaving processors

  7. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  8. root Binomial Tree Collective Operations of MPICH • MPICH (Thakur et al. 2003) • Latency-aware tree for short messages • Bandwidth-aware tree for long messages

  9. Ring All-gather Collective Operations of MPICH root Scatter

  10. Collective Operations of MPICH • MPICH assumes that latency and bandwidth are uniform • But, latency and bandwidth are orders of magnitude different within local-area and wide-area links • Collective operations for LANs do not perform well in WANs

  11. High-Performance Collective Operations for WANs • MagPIe (Kielmann et al. 1999) • Bandwidth-Efficient Collective Operations (Kielmann et al. 2000) • MPICH-G2 (Karonis et al. 2003) • Manual configuration necessary • Processors cannot join/leave LAN LAN LAN

  12. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  13. Overview of Our Proposal • Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processor • Perform collective operations along those trees • Provide a mechanism to support joining/leaving processors • Implement as an extension to the Phoenix Message Passing Library

  14. {3} Processor C ph_send(3) ph_send(3) {0, 1, 2} {3, 4} {0, 1, 2} {4} Processor A Processor B Processor B Processor A Phoenix (Taura et al. 2003) • Message passing library for Grids • Not an impl. of MPI, but has its own API • Messages are sent to virtual nodes, not processors

  15. Not too deep, not too big a fan-out LAN root Minimal number of wide-area relationships LAN LAN Latency-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime

  16. LAN Wide-area relation- ships are quickly replaced by local-area relationships LAN Parent Selection • Change parents if both RTTn,c < RTTn,p AND RTTc,r < RTTn,r r RTTc,r RTTn,r c p RTTn,p RTTn,c n

  17. Will nodes that are placed too deep move up? Will nodes that are placed too shallow move down? Tree Creation within a LAN Force that makes tree shallower Force that makes tree deeper Tree that is not too deep, not too shallow

  18. Bandwidth-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent using bandwidth measured at runtime LAN Place processors as far away as possible from the root without sacrificing bandwidth LAN LAN

  19. Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes Fan-out too large! Parent Selection • Find a parent with high bandwidth to the root • Estimate BWn-c-r as min(BWn-c, BWc-r) r p c BWn-p-r BWn-c-r n

  20. Changing Topology {0} {1, 3, 4} {2, 5} {2, 5} {1} {2} {3, 4} {5} {4} {3,5} {4} point-to-point message to virtual node 5 Broadcast Stable Topology {0} {1} {2} {3} {4} {5}

  21. Changing Topology {0} timeout {1} {2} waiting for virtual node 5… {3,5} {4} Reduction Stable Topology {0} {1} {2} {3} {4} {5}

  22. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  23. Preliminary Experiments • Latency-aware Spanning Tree Creation (Java Applet) • Stable-state short-message broadcast • Stable-state short-message reduction • Transient-state short-message broadcast

  24. Broadcast (Stable-State) 1B broadcast over 201 processors in 3 clusters topology-unaware implementation topology-aware implementation our implementation

  25. Reduction (Stable-State) Reduction using 128 processors in 3 clusters topology-unaware implementation our implementation topology-aware implementation

  26. Transient-State Behavior • 201 processors in 3 clusters (1 virtual node per processor) • Repeatedly perform broadcasts • 100 processors leave after 60 secs • virtual nodes are remapped to remaining processors • 100 processors re-join after 30 secs • virtual nodes are given back to original processors

  27. Transient-State Behavior Join Leave

  28. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  29. Conclusion • Designed and implemented latency-aware broadcast and reduction for wide-area networks • Showed that they perform reasonably well in stable topologies • Showed that they support joining/leaving processors • Future Work • Implement bandwidth-aware spanning tree

  30. Publications • 斎藤秀雄,田浦健次朗,近山隆.動的スパニングツリーを用いた広域メッセージパッシング用の集合通信.In Symposium on Advanced Computing System and Infrastructures.May 2005 (ポスター論文,To Appear). • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004. • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.

  31. topology-aware implementation our implementation topology-unaware implementation Broadcast (Stable-State) Broadcast over 251 processors in 3 clusters

More Related