1 / 35

High Performance Broadcast Support in LA-MPI over Quadrics

This presentation outlines the problem statement, design challenges, implementation, performance evaluation, conclusions, and future work related to the use of Quadrics hardware for efficient and scalable broadcast support in LA-MPI. The presentation emphasizes the goal of achieving end-to-end reliability and reducing overhead for buffer management.

icaldwell
Download Presentation

High Performance Broadcast Support in LA-MPI over Quadrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur,D.K. Panda, R.T. Aulwes† and R.L. Graham† Dept. of Computer Science The Ohio State University Advanced Computing Lab† Los Alamos, NM 87545

  2. Presentation Outline • Problem Statement and Goals • Design Challenges and Implementation • Performance Evaluation • Conclusions and Future Work

  3. LA-MPI • The Los Alamos Message Passing Interface (LA-MPI) • Provide End-to-End Reliable Message Passing • Protect against network errors • Protect against I/O bus errors • Concurrent Message passing over multiple interconnects • Message striping over multiple network interface cards • Supported Platforms • Operating Systems • TRU64, Linux, Irix, MAC-OSX (32 and 64-bit) • Communication protocols • Shared Memory, UDP • HIPPI-800, Quadrics, Myrinet (GM), InfiniBand(ongoing)

  4. LA-MPI Architecture

  5. Timer Recv Descriptor Send Descriptor Network Paths Fragments (CRC/Checksum) Network Paths Fragments Recv’d Record Aggregate Information Specific? ACK? Release Fragment CRC/Checksum OK? Point-to-Point Communication bind Assemble Frag Frag Frag Frag Retransmit Yes No ACK NACK ACK Yes No Yes

  6. LA-MPI Broadcast Generic Tree-based Broadcast

  7. Quadrics Hardware Broadcast

  8. Quadrics Hardware Broadcast

  9. Quadrics Hardware Broadcast

  10. Quadrics Hardware Broadcast • Benefits • Efficient, Scalable and Reliable • Limitations • The receive address must be global • Receiving processes must be on contiguous nodes • Existing broadcast implementation making use of hardware broadcast • Elanlib

  11. Research Goals • Can we make use of the hardware Broadcast to provide an efficient and scalable broadcast support to LA-MPI while achieving the goal of end-to-end reliability? • Acknowledgments from receivers (after verifying CRC) must be collected to ensure reliability • Reduce the overhead for buffer management • Raw hardware broadcast latency ~3.3us • Elanlib broadcast latency ~8.5us • ~5us overhead when making use of hardware broadcast • Maintain the high performance and scalability of hardware broadcast

  12. Presentation Outline • Problem Statement and Goals • Design Challenges and Implementation • Performance Evaluation • Conclusions and Future Work

  13. Challenges • Memory management for global buffers • Broadcast over processes on non-contiguous nodes • Synchronization and acknowledgement • Retransmission and Reliability

  14. Global Buffer Management • Global Buffer must be consistent • Use a global allocator to provide global buffer on demand • Hard to manage and low buffer reuse rate • Can satisfy large number of requests • Maintain a static number of fixed size global channels • Easy to manage and high reuse rate • Need more frequent synchronization on the use of channels

  15. Single communicator • A communicator must recycle its global channels. • Synchronize before the use of a channel • Synchronize after the use of a channel • Synchronize when the global buffers are about to be used up • Reduce the frequency of synchronization • Amortize the cost of synchronization across multiple operations

  16. Multiple Communicators • Global buffers must be recycled across different communicators • A small number of concurrent communicators • Communicators tend to be disjoint • Our solution: • 8 sets of global buffers, one for COMM_WORLD • A communicator performs an Allreduce() to find out the list of available buffer sets and take the first available

  17. Challenges • Memory management for global buffers • Broadcast over processes on non-contiguous nodes • Synchronization and acknowledgement • Retransmission and Reliability

  18. Broadcast over Non-contiguous Nodes • To make use of hardware broadcast: • Group processes into sets of contiguous nodes, called broadcast segments • Approach #1, linearly chained broadcast RDMAs: • The root performs a broadcast RDMA to each segment • Not scalable • Completely distributed topology, i.e., the formation of broadcast segments by one node is transparent to all other nodes.

  19. Tree-Based Chained Broadcast RDMAs • Approach #2 (Tree-based Chaining) • Broadcast to the largest broadcast segment • Each process that receives data broadcasts to another broadcast segment • Sophisticated topology • Different trees are needed for different roots

  20. Synchronization and Acknowledgments • Delayed synchronization for small messages • Buffer Message at the broadcast channels • Trigger broadcast RDMA(s) to send the message • Synchronize the processes after a number of operations • Amortize the synchronization cost across multiple operations • With delayed synchronization, all nodes need to be notified about the conclusion on the status of used channels • For large messages, >16KB, synchronize processes at the completion of each broadcast to avoid message buffering cost

  21. Synchronization Approaches • Hardware barrier • Efficient and scalable • Not available for non-contiguous nodes • May generate too much broadcast traffic • Tree-based synchronization • One process as the manager for a communicator • ACKs are propagated to the manager through chained RDMA • NACKs are generated to the manager directly

  22. Retransmission and Reliability • Reliability against two kinds of errors • I/O bus errors • Retransmit the data • Network errors, e.g., card failures • Fail-over to tree-based broadcast, which is on top of point-to-point communication and end-to-end reliable. • Retransmission • Timestamp is created with each broadcast request • Retransmit the data when timer goes off or NACK is detected • If a card failure is suspected, then fail-over to tree-based broadcast

  23. Broadcast Message Flow Path

  24. Presentation Outline • Problem Statement and Goals • Design Challenges and Implementation • Performance Evaluation • Conclusions and Future Work

  25. Experiment Testbeds • Experiment Testbeds • 256 node quad-1.25GHz alpha TRU64 cluster at LANL • 8 node quad-700MHz Linux cluster at OSU • Both are equipped with Elan3 QM-400 cards • Evaluated MPI implementations • LA-MPI • MPICH • HP’s Alaska

  26. Performance Evaluation • Performance tests • Broadcast latency • Broadcast latency with SMP support • Scalability • Impact of the number of broadcast channels • Cost of reliability

  27. Broadcast Latency • Reduce the broadcast latency compared to the generic broadcast implementation • Achieve 4-byte broadcast latency of 3.5us over 8 nodes • Low overhead for buffer recycling and acknowledgments

  28. SMP Support • Achieve 4-byte broadcast latency of 7.1us over 256 processes • Achieve better performance for small messages compared to that of MPICH and HP’s Alaska, without using hardware barrier

  29. Scalability • Achieve better scalability compared to the generic algorithm • Good scalability while achieving high performance

  30. Broadcast Channels • The synchronization cost is about 13us • The cost of synchronization is amortized across multiple broadcast operations with a large number of broadcast channels.

  31. Reliability Cost • A reliability cost of 1us for small message. • Reliability cost for large messages are largely due to CRC/checksum.

  32. Presentation Outline • Problem Statement and Goals • Design Challenges and Implementation • Performance Evaluation • Conclusions and Future Work

  33. Conclusions • Achieve end-to-end reliable broadcast with low performance impact • Achieve efficient and scalable broadcast with Quadrics hardware broadcast • Reduce the overhead of broadcast buffer management

  34. Future Work • Reduce the synchronization cost by using hardware based barrier. • Implement the tree-based chained Broadcast RDMAs for processes over non-contiguous nodes • Dynamically choose broadcast algorithms according to the message pattern • Enhance the broadcast further by making use of multiple Quadrics NICs

  35. NBC LA-MPI http://www.acl.lanl.gov/la-mpi/ http://nowlab.cis.ohio-state.edu/ E-mail: {yuw,surs,panda}@cis.ohio-state.edu and {rta,rlgraham,lampi-support}@lanl.gov More Information

More Related