1 / 1

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD).

lesa
Download Presentation

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD) Global Arrays (GA) is popular high-level parallel programming model that provides data and computation management facilities to the NWChem computational chemistry suite. GA's global-view data model is supported by the ARMCI partitioned global address space runtime system, which traditionally is implemented natively on each supported platform in order to provide the best performance. The industry standard Message Passing Interface (MPI) also provides one-sided functionality and is available on virtually every supercomputing system. We present the first high-performance, portable implementation of ARMCI using MPI one-sided communication. We interface the existing GA infrastructure with ARMCI-MPI and demonstrate that this approach reduces the amount of resources consumed by the runtime system, provides comparable performance, and enhances portability for applications like NWChem. Performance Evaluation MPI Global Memory Regions NWChem Performance ARMCI Absolute Process Id MPI global memory reg-ions (GMR) align MPI's remote memory access (RMA) window seman-tics with ARMCI's PGAS model. GMR provides the following functionality: Allocation Metadata MPI_Win window; int size[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; … • Alloction and mapping of shared segments. • Translation between ARMCI shared addresses and MPI windows and window displacements • Translation between ARMCI absolute ranks used in communication and MPI window ranks • Management of MPI RMA semantics for direct load/store accesses to shared data. Direct load/store access to shared regions in MPI must be protected with MPI_Win_lock/unlock operations. • Guarding of shared buffers: When a shared buffer is provided as the local buffer for an ARMCI communication operation, GMR must perform appropriate locking and copy operations in order to preserve multiple MPI rules, including the constraint that a window cannot be locked for more than one target at any given time. Global Arrays • Experiments were conducted on a cross-section of leader-ship class architectures • Mature implementations of ARMCI on BG/P, InfiniBand, and XT (GA 5.0.1) • Development implementation of ARMCI on Cray XE6 • NWChem performance study: computational modeling of water pentamer GA allows the programmer to create large, multidimensional shared arrays that span the memory of multiple nodes in a distributed memory system. The programmer interacts with GA through asynchronous, one-sided Get, Put, and Accumulate data movement operations as well as through high-level mathematical routines provided by GA. Accesses to the a global array are specified through high-level array indices; GA translates these high-level operations into low-level communication operations that may target multiple nodes depending on the underlying data distribution. • Performance and scaling is comparable on most architectures; MPI tuning would improve overheads. • Performance of ARMCI-MPI on Cray XE6 is 30% higher and scaling is better • Native ARMCI on Cray XE6 is a work in progress; ARMCI-MPI enables early use of GA/NWChem on such new platforms GA's low-level communication is managed by the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication runtime system. Porting and tuning of GA is accomplished through ARMCI and traditionally involves natively implementing its core operations. Because of its sophistication, creating an efficient and scalable implementation of ARMCI for a new system requires expertise in that system and significant effort. In this work, we present the first implementation of ARMCI on MPI’s portable one-sided communication interface. Communication Performance Noncontiguous Communication Contiguous Transfer Bandwidth Noncontiguous Transfer Bandwidth GA_Put GA operations frequently access array regions that are not contiguous in memory. These operations are translated into ARMCI strided and generalized I/O vector communica-tion operations. We have provided four implementations for ARMCI’s strided interface: Direct: MPI subarray datatypes are generated to describe local and remote buffers; a single operation is issued. Blue Gene/P Cray XE ARMCI_PutS MPI One-Sided Communication 0 1 Rank 0 MPI provides an interface for one-sided communication and encapsulates shared regions of memory in window objects. Windows are used to guarantee data consistency and portability across a wide range of platforms, including those with noncoherent memory access. This is accomplished through public and private copies which are synchronized through access epochs. To ensure portability, windows have several important properties: Rank 1 Lock 2 3 Put(X) Epoch Get(Y) Public Copy Unlock IOV-Conservative: Strided transfer is converted to an I/O vector and each operation is issued within its own epoch. IOV-Batched: Strided transfer is converted to an I/O vector and N operations are issued within an epoch. IOV-Direct: Strided transfer is converted to an I/O vector and an MPI indexed datatype is generated to describe both the local and remote buffers; a single operation is issued InfiniBand Cluster Cray XE Completion Private Copy • ARMCI-MPI performance is comparable to native, but tuning is needed, especially on Cray XE • Contiguous performance of MPI-RMA is nearly twice that of native ARMCI on Cray XE • Direct strided is best, except on BG/P where batched is better due to pack/unpack throughput • Concurrent, conflicting accesses are erroneous • All access to the window data must occur within an epoch • Put, get, and accumulate operations are non-blocking and occur concurrently in an epoch • Only one epoch is allowed per window at any given time Additional Information: J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, “Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication," Preprint ANL/MCS-P1878-0411, April 2011. Online: http://www.mcs.anl.gov/publications/paper_detail.php?id=1535

More Related