High Performance Broadcast Support in LA-MPI over Quadrics

High Performance Broadcast Support in LA-MPI over Quadrics W. Yu, S. Sur,D.K. Panda, R.T. Aulwes† and R.L. Graham† Dept. of Computer Science The Ohio State University Advanced Computing Lab† Los Alamos, NM 87545

Presentation Outline • Problem Statement and Goals • Design Challenges and Implementation • Performance Evaluation • Conclusions and Future Work

LA-MPI • The Los Alamos Message Passing Interface (LA-MPI) • Provide End-to-End Reliable Message Passing • Protect against network errors • Protect against I/O bus errors • Concurrent Message passing over multiple interconnects • Message striping over multiple network interface cards • Supported Platforms • Operating Systems • TRU64, Linux, Irix, MAC-OSX (32 and 64-bit) • Communication protocols • Shared Memory, UDP • HIPPI-800, Quadrics, Myrinet (GM), InfiniBand(ongoing)

LA-MPI Architecture

Timer Recv Descriptor Send Descriptor Network Paths Fragments (CRC/Checksum) Network Paths Fragments Recv’d Record Aggregate Information Specific? ACK? Release Fragment CRC/Checksum OK? Point-to-Point Communication bind Assemble Frag Frag Frag Frag Retransmit Yes No ACK NACK ACK Yes No Yes

LA-MPI Broadcast Generic Tree-based Broadcast

Quadrics Hardware Broadcast

Quadrics Hardware Broadcast • Benefits • Efficient, Scalable and Reliable • Limitations • The receive address must be global • Receiving processes must be on contiguous nodes • Existing broadcast implementation making use of hardware broadcast • Elanlib

Research Goals • Can we make use of the hardware Broadcast to provide an efficient and scalable broadcast support to LA-MPI while achieving the goal of end-to-end reliability? • Acknowledgments from receivers (after verifying CRC) must be collected to ensure reliability • Reduce the overhead for buffer management • Raw hardware broadcast latency ~3.3us • Elanlib broadcast latency ~8.5us • ~5us overhead when making use of hardware broadcast • Maintain the high performance and scalability of hardware broadcast

Challenges • Memory management for global buffers • Broadcast over processes on non-contiguous nodes • Synchronization and acknowledgement • Retransmission and Reliability

Global Buffer Management • Global Buffer must be consistent • Use a global allocator to provide global buffer on demand • Hard to manage and low buffer reuse rate • Can satisfy large number of requests • Maintain a static number of fixed size global channels • Easy to manage and high reuse rate • Need more frequent synchronization on the use of channels

Single communicator • A communicator must recycle its global channels. • Synchronize before the use of a channel • Synchronize after the use of a channel • Synchronize when the global buffers are about to be used up • Reduce the frequency of synchronization • Amortize the cost of synchronization across multiple operations

Multiple Communicators • Global buffers must be recycled across different communicators • A small number of concurrent communicators • Communicators tend to be disjoint • Our solution: • 8 sets of global buffers, one for COMM_WORLD • A communicator performs an Allreduce() to find out the list of available buffer sets and take the first available

Challenges • Memory management for global buffers • Broadcast over processes on non-contiguous nodes • Synchronization and acknowledgement • Retransmission and Reliability

Broadcast over Non-contiguous Nodes • To make use of hardware broadcast: • Group processes into sets of contiguous nodes, called broadcast segments • Approach #1, linearly chained broadcast RDMAs: • The root performs a broadcast RDMA to each segment • Not scalable • Completely distributed topology, i.e., the formation of broadcast segments by one node is transparent to all other nodes.

Tree-Based Chained Broadcast RDMAs • Approach #2 (Tree-based Chaining) • Broadcast to the largest broadcast segment • Each process that receives data broadcasts to another broadcast segment • Sophisticated topology • Different trees are needed for different roots

Synchronization and Acknowledgments • Delayed synchronization for small messages • Buffer Message at the broadcast channels • Trigger broadcast RDMA(s) to send the message • Synchronize the processes after a number of operations • Amortize the synchronization cost across multiple operations • With delayed synchronization, all nodes need to be notified about the conclusion on the status of used channels • For large messages, >16KB, synchronize processes at the completion of each broadcast to avoid message buffering cost

Synchronization Approaches • Hardware barrier • Efficient and scalable • Not available for non-contiguous nodes • May generate too much broadcast traffic • Tree-based synchronization • One process as the manager for a communicator • ACKs are propagated to the manager through chained RDMA • NACKs are generated to the manager directly

Retransmission and Reliability • Reliability against two kinds of errors • I/O bus errors • Retransmit the data • Network errors, e.g., card failures • Fail-over to tree-based broadcast, which is on top of point-to-point communication and end-to-end reliable. • Retransmission • Timestamp is created with each broadcast request • Retransmit the data when timer goes off or NACK is detected • If a card failure is suspected, then fail-over to tree-based broadcast

Broadcast Message Flow Path

Experiment Testbeds • Experiment Testbeds • 256 node quad-1.25GHz alpha TRU64 cluster at LANL • 8 node quad-700MHz Linux cluster at OSU • Both are equipped with Elan3 QM-400 cards • Evaluated MPI implementations • LA-MPI • MPICH • HP’s Alaska

Performance Evaluation • Performance tests • Broadcast latency • Broadcast latency with SMP support • Scalability • Impact of the number of broadcast channels • Cost of reliability

Broadcast Latency • Reduce the broadcast latency compared to the generic broadcast implementation • Achieve 4-byte broadcast latency of 3.5us over 8 nodes • Low overhead for buffer recycling and acknowledgments

SMP Support • Achieve 4-byte broadcast latency of 7.1us over 256 processes • Achieve better performance for small messages compared to that of MPICH and HP’s Alaska, without using hardware barrier

Scalability • Achieve better scalability compared to the generic algorithm • Good scalability while achieving high performance

Broadcast Channels • The synchronization cost is about 13us • The cost of synchronization is amortized across multiple broadcast operations with a large number of broadcast channels.

Reliability Cost • A reliability cost of 1us for small message. • Reliability cost for large messages are largely due to CRC/checksum.

Conclusions • Achieve end-to-end reliable broadcast with low performance impact • Achieve efficient and scalable broadcast with Quadrics hardware broadcast • Reduce the overhead of broadcast buffer management

Future Work • Reduce the synchronization cost by using hardware based barrier. • Implement the tree-based chained Broadcast RDMAs for processes over non-contiguous nodes • Dynamically choose broadcast algorithms according to the message pattern • Enhance the broadcast further by making use of multiple Quadrics NICs

NBC LA-MPI http://www.acl.lanl.gov/la-mpi/ http://nowlab.cis.ohio-state.edu/ E-mail: {yuw,surs,panda}@cis.ohio-state.edu and {rta,rlgraham,lampi-support}@lanl.gov More Information

High Performance Broadcast Support in LA-MPI over Quadrics

High Performance Broadcast Support in LA-MPI over Quadrics

Presentation Transcript

DAFS Storage for High Performance Computing using MPI-I

MPI Program Performance

Broadcast Audio over IP Axia Livewire

Open MPI - A High Performance Fault Tolerant MPI Library

An MPI Approach to High-Performance Computing with FPGAs

MPI support in gLite

High Performance User-Level Sockets over Gigabit Ethernet

Fault Tolerant MPI in High Performance Computing: Semantics and Applications

High Performance in Trading

Open MPI - A High Performance MPI-2 Library

Fast Broadcast in High-Speed Networks

High-Performance Complex Event Processing over Streams

High-Performance Traffic Over the South Bay

Performance of high resolution global model over La Plata Basin

Performance Oriented MPI

Support for high performance UDP/TCP applications

Parallelisms of Quadrics

Integrated MPI/OpenMP Performance Analysis

Video Broadcast over L2 P2MP LSPs

MPI and Parallel Code Support

MPI Program Performance

MPI Program Performance