slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito PowerPoint Presentation
Download Presentation
April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Loading in 2 Seconds...

  share
play fullscreen
1 / 31
benjiro-fujii

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito - PowerPoint PPT Presentation

112 Views
Download Presentation
April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

  2. Background • Opportunities to perform message passing in WANs are increasing WAN

  3. Message Passing in WANs • WAN → more resources • However, systems designed for LANs do not perform well in WANs Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs

  4. root 3 5 2 2 1 2 1 1 0 2 1 1 reduction 2 1 0 0 Collective Operations • Operations in which all processors participate (cf. send/receive) • Ex. broadcast, reduction ∑

  5. LAN Collective Operations in WANs • Topology must be considered for high performance LAN LAN • Manual configuration is undesirable • Processors should be able to join/leave LAN

  6. Objective • To design and implement collective operations • w/ high performance in WANs • w/o manual configuration • w/ support for joining/leaving processors

  7. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  8. root Binomial Tree Collective Operations of MPICH • MPICH (Thakur et al. 2003) • Latency-aware tree for short messages • Bandwidth-aware tree for long messages

  9. Ring All-gather Collective Operations of MPICH root Scatter

  10. Collective Operations of MPICH • MPICH assumes that latency and bandwidth are uniform • But, latency and bandwidth are orders of magnitude different within local-area and wide-area links • Collective operations for LANs do not perform well in WANs

  11. High-Performance Collective Operations for WANs • MagPIe (Kielmann et al. 1999) • Bandwidth-Efficient Collective Operations (Kielmann et al. 2000) • MPICH-G2 (Karonis et al. 2003) • Manual configuration necessary • Processors cannot join/leave LAN LAN LAN

  12. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  13. Overview of Our Proposal • Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processor • Perform collective operations along those trees • Provide a mechanism to support joining/leaving processors • Implement as an extension to the Phoenix Message Passing Library

  14. {3} Processor C ph_send(3) ph_send(3) {0, 1, 2} {3, 4} {0, 1, 2} {4} Processor A Processor B Processor B Processor A Phoenix (Taura et al. 2003) • Message passing library for Grids • Not an impl. of MPI, but has its own API • Messages are sent to virtual nodes, not processors

  15. Not too deep, not too big a fan-out LAN root Minimal number of wide-area relationships LAN LAN Latency-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime

  16. LAN Wide-area relation- ships are quickly replaced by local-area relationships LAN Parent Selection • Change parents if both RTTn,c < RTTn,p AND RTTc,r < RTTn,r r RTTc,r RTTn,r c p RTTn,p RTTn,c n

  17. Will nodes that are placed too deep move up? Will nodes that are placed too shallow move down? Tree Creation within a LAN Force that makes tree shallower Force that makes tree deeper Tree that is not too deep, not too shallow

  18. Bandwidth-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent using bandwidth measured at runtime LAN Place processors as far away as possible from the root without sacrificing bandwidth LAN LAN

  19. Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes Fan-out too large! Parent Selection • Find a parent with high bandwidth to the root • Estimate BWn-c-r as min(BWn-c, BWc-r) r p c BWn-p-r BWn-c-r n

  20. Changing Topology {0} {1, 3, 4} {2, 5} {2, 5} {1} {2} {3, 4} {5} {4} {3,5} {4} point-to-point message to virtual node 5 Broadcast Stable Topology {0} {1} {2} {3} {4} {5}

  21. Changing Topology {0} timeout {1} {2} waiting for virtual node 5… {3,5} {4} Reduction Stable Topology {0} {1} {2} {3} {4} {5}

  22. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  23. Preliminary Experiments • Latency-aware Spanning Tree Creation (Java Applet) • Stable-state short-message broadcast • Stable-state short-message reduction • Transient-state short-message broadcast

  24. Broadcast (Stable-State) 1B broadcast over 201 processors in 3 clusters topology-unaware implementation topology-aware implementation our implementation

  25. Reduction (Stable-State) Reduction using 128 processors in 3 clusters topology-unaware implementation our implementation topology-aware implementation

  26. Transient-State Behavior • 201 processors in 3 clusters (1 virtual node per processor) • Repeatedly perform broadcasts • 100 processors leave after 60 secs • virtual nodes are remapped to remaining processors • 100 processors re-join after 30 secs • virtual nodes are given back to original processors

  27. Transient-State Behavior Join Leave

  28. Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

  29. Conclusion • Designed and implemented latency-aware broadcast and reduction for wide-area networks • Showed that they perform reasonably well in stable topologies • Showed that they support joining/leaving processors • Future Work • Implement bandwidth-aware spanning tree

  30. Publications • 斎藤秀雄,田浦健次朗,近山隆.動的スパニングツリーを用いた広域メッセージパッシング用の集合通信.In Symposium on Advanced Computing System and Infrastructures.May 2005 (ポスター論文,To Appear). • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004. • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.

  31. topology-aware implementation our implementation topology-unaware implementation Broadcast (Stable-State) Broadcast over 251 processors in 3 clusters