Multiplexing Endpoints of HCA for Scaling MPI Applications

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune, India. IEEE Cluster 2010 21st September 2010 This work has been developed under the project 'National PARAM Supercomputing Facility and Next Generation HPC Technology' sponsored by Government of India's Department of Information Technology (DIT) under Ministry of Communication and Information Technology (MCIT) vide administrative approval No. DIT/R&D/C-DAC/2(2)/2008 dated 26/05/2008.

Presentation outline • Introduction • Problem Statement • Proposed Design • Performance Evaluation • Related Work • Conclusion & Future Work

Introduction • HPC clusters are increasing in size to address the computational needs of large challenging problems. • MPI is the de-facto standard for writing parallel applications. It typically uses fully connected topology. • ADI provides portability to MPI for multiple networks and network interfaces.

uDAPL Overview • uDAPL is proposed by Direct Access Transport (DAT) Collaborative. • It defines lightweight, transport-independent and platform-independent set of user level APIs to exploit RDMA capabilities, such as those present in InfiniBand, VIA and iWARP. • Supported by many MPIs like MVAPICH2, Intel MPI, OpenMPI and HP-MPI.

uDAPL Communication Model Memory buffers Memory buffers Process Process EVD EVD Descriptor Posting Event Completion Software Hardware SQ SQ CQ CQ RQ RQ Endpoint Endpoint

Reliable Connection • In RC, a connection is formed between every process pair using endpoints (equivalent to queue pairs) at both ends. • Limited endpoints of a HCA restrict the number of connections that can be established by an MPI application. • Thus limiting nodes to be deployed in cluster.

Endpoint (EP) requirement • A cluster has (N * P) number of processes. where, P = number of processes or cores per node. N = number of nodes in cluster. • Every process need to establish connections to rest of (N * P – 1) processes. For simplicity, assume it be (N * P). EP requirement for a Process = (N * P) EP requirement for a node = (N * P * P) • Increasing N or P increases EP requirement. • Increasing P drastically increases the EP requirement. • Nmax = Endpoints with HCA / (P * P)

Problem Statement • Hardware upgrade to meet increased endpoint requirements is costly and time-consuming. • Can an optimal solution with existing HCA be thought ?

Multiplexing approach • Extends scalability with existing hardware. • Maps multiple software connections to fewer hardware connections without incurring any significant performance penalty. • Thus, same HCA can support more number of nodes in the cluster.

We distinguish software ep (swep) and hardware ep (hwep). Multiple sweps use single hwep for data transfer. A hardware connection is between hweps from two nodes. Therefore software connections only between these two nodes will use this hardware connection. One hwep is shared by sweps belonging to different processes on a node. Multiplexing should support both connection management as well as data transfer routines such as send, receive, RDMA Write etc. P3 P2 P1 P4 Software Hardware sweps hweps Multiplexing Design: swep & hwep

Multi-Way Binding Problem P2 P1 H1 H0 h0 H1 H2 P4 P3 H2 P6 P5 H1 h1 H3 P8 P7 N2 N1

The processing (issuing or servicing) of a connection request at a node is completely independent of the processing at the remote node. Without multiplexing, multi-way binding will not occur as every connection request sent or received will allocate a separate hwep. Issue related to Connection management. Connection between hweps has to be strictly one-to-one. Two hweps on one side (H1 and H3) are trying to bind to a single remote hwep (h2). P1 P2 H1 H2 P3 H3 N1 N2 Multiplexing Design: Multi-way binding

Solution with VID P2 P1 H1, vid 0 H0 h0 H1 H2 P4 P3 vid 0 vid 0 H2, vid 0 P6 P5 H1 h1 H3 P8 P7 N2 N1

For equal sharing, total number of hweps on a HCA can be divided as N * m, where N is the number of nodes in cluster. Here m is less than the practical EP requirement of P * P. If range of VID for a remote node (0 to m-1) is exhausted, a hwep already used (preferably least used) has to be reused. Virtual Identifier (VID) as a unique identifier for a hwep. Hweps with the same VID will be connected to each other. P1 P2 H1, vid 0 H2, vid 0 P3 H3, vid 1 N1 N2 Multiplexing Design: Solution with VID

A hwep context contains all the information about a single swep or a connection, like EVD number and PZ number. In multiplexing, one hwep is used by multiple sweps. Either of the queue can own the context information. Fig. (a) is redrawn to show hwep as a queue-pair in fig. (b). Both queues will inherit VID. Generally, one hwep corresponds to one swep. P2 P1 P2 P1 sweps Software Hardware SQ RQ RQ SQ hweps (b) (a) Multiplexing Design: Endpoint as a Queue-pair

Separating SQ and RQ • Many MPI libraries use single EVD, single PZ and same memory privilege for a process. • Hence all sweps of a process use the same EVD and PZ. • We share SQ among processes and RQ with only one process. • Thus RQ owns information stored in a hwep context while the same information for SQ is conveyed as a part of descriptor. • During connection establishment, only RQ is selected. • Remote SQ is automatically chosen with VID of the remote SQ same as that of the local RQ. • SRQ functionality is feasible using RQ of a hwep.

For a fixed cluster environment, static mapping avoids various multiplexing overheads. Such as during allocating hweps, sweps and maintaining their association. LPID = Local Process Identifier RPID = Remote Process Identifier RN = Remote Node Number RPID 0 RPID 1 LPID 0 P number of sweps for each LPID RPID (P - 1) RN 0 LPID 1 LPID 2 to (P-2) LPID (P-1) RN 1 RN2 to RN(N-2) RN (N-1) Static Mapping: Division of sweps

Static Mapping: Division of hweps • Similarly, static allocation of hweps is possible. • Multiplexing is (N * P * P) : (N * P * X) i.e. P : X. • where X is less than P. • P sweps will share X hweps. • X SQs and X RQs will be used by P sweps. • Combination of LPID, RPID and RN acts as a VID.

Performance Evaluation • We compare results for following two models • without multiplexing termed as basic model • with multiplexing termed as scalable model. • We have evaluated multiplexing design using uDAPL over PARAMNet-3 (pnet3) interconnect.

Experimental Platform • Two clusters: Cluster A of 16 nodes, Cluster B of 48 nodes. • Each node has quad 2.93 GHz Intel Xeon Tigerton quad-core processors, 64 GB RAM and PCI-express based pnet3 HCA. • Intel MPI having environment variable based control for using only RDMA-Write operations. • Pnet3 is a high-performance cluster interconnect developed by C-DAC. It comprises of • 48 port switch with 10Gbps full-duplex CX4 connectivity. • X4/x8 PCIe HCA having 4096 endpoints. • Light weight protocol software stack known as KSHIPRA. • KSHIPRA supports uDAPL library as well as some selected components of OFED stack i.e. IPoIB, SDP and iSER.

Multiplexing Ratio (mux-ratio) • Mux-ratio is the ratio in which multiple sweps use a single hwep. • It is not possible to run applications using Basic Model beyond 16 nodes. • In multiplexing, increasing mux-ratio increases the number of nodes that can be deployed in the cluster. • Brings down the hwep requirement to number of hweps supported by HCA.

Intel MPI Benchmarks (IMB) • Very little variation in readings is observed across all the mux-ratios in nearly all of the benchmarks. IMB Alltoall, 128 processes on 8 nodes

NAS Parallel Benchmarks (NPB) • NPB contains computing kernels typical of various CFD scientific applications. Each benchmark has different communication pattern. • IS shows maximum of 5 % degradation with 16:1 multiplexing. NAS Class C readings, 256 processes on 16 nodes

HPL Benchmark • 32 and 48 nodes run shows successful scalability of MPI applications using multiplexing technique. • The marginal improvement is due to management of lesser number of hweps on HCA.

Related Work • SRQ based designs for reducing communication buffer requirements. • On-demand connection management: connection only when required. • Worst case all-to-all pattern may emerge. • As our work is incorporated into uDAPL provider, many features of MPI can be used in conjunction with our technique. • eXtended Reliable Connection (XRC) transport provides services of RC transport while providing additional scalability for multi-core clusters. • It allows a single connection from one process to entire node. • Hybrid programming model (e.g. OpenMP with MPI) uses threads within a node and MPI processes across nodes. • All threads running on a node share same set of connections. • For hybrid model to work, MPI applications should be thread enabled. • Our work is part of transport library, so MPI applications can run seamlessly.

Conclusion and Future Work • Proposed multiplexing technique to extend scalability of MPI applications. • effort is to map the MPI requirement to the available pool of endpoints on HCA. • The multiplexing technique can be applied to any transport library that provides connection-oriented service. • We can scale the cluster size in a proportion same as the mux-ratio. • E.g. with 16:1 mux-ratio, the number of nodes in the cluster can be 16 times with the same HCA. • No visible performance degradation is observed up to 48 nodes. • Future work includes evaluation at larger scale, addition of send-receive support and addition of SRQ support.

Thank you yogeshwars@cdac.in www.cdac.in www.cdac.in/html/htdg/products.asp

Backup slides

uDAPL Communication Model • Support for both Channel Semantics (Send/Receive) and Memory Semantics (RDMA Write and RDMA Read). • Reliable Connection oriented model with endpoints as source and sink of a communication channel. • Data Transfer Operations (DTO) (i.e. Work Requests or descriptors) are posted on an endpoint. • Completion of DTO is reported as an event on Event Dispatcher (EVD) (similar to CQ). • Either polling/de-queue or wait model can be used for completion reaping. • Protection Zone (PZ) and Memory Privilege flags validates memory access. • defines SRQ mechanism that provides the ability to share receive buffers among several connections.

Send-Receive Handling Complexities • During recv DTO processing, mismatch in receive descriptors corresponding to their send descriptors can happen. This is due to sharing of hwep RQ. • Hwep RQ can have descriptors from different sweps of varied lengths. • Additional hardware support to handle above complexities is required.

Multiplexing Endpoints of HCA for Scaling MPI Applications

Multiplexing Endpoints of HCA for Scaling MPI Applications

Presentation Transcript

TAG Meeting September 21, 2010

Biotechnology Bell ringers for September 21 st , 2010

September 21, 2010 Instructors ' Meeting

21 th September 2010

January 21 st , 2010

September 1 st , 2010

Africa Briefing 21 September 2010

September 21, 2010

September 21, 2010 deM 132

C SINGH, JUNE 7-8, 2010

C SINGH, JUNE 7-8, 2010

SEO in 2010 January 21 st , 2010

C SINGH, JUNE 7-8, 2010

C SINGH, JUNE 7-8, 2010

C SINGH, JUNE 7-8, 2010

C SINGH, JUNE 7-8, 2010

SCORT 2010 September 21, 2010

Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune, India. IEEE Cluster 2010 21 st September 2010

Istanbul, 21-22 September, 2010

Pathways Cluster Hamilton 21 st September 2010

Tuesday September 21, 2010