1 / 1

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, johnmc}@cs.rice.edu. NSF. a(10,20). a(10,20). a(10,20). image 1. image 2. image N. image 1. image 2. image N. integer :: a(10,20)[*].

alida
Download Presentation

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, johnmc}@cs.rice.edu NSF a(10,20) a(10,20) a(10,20) image 1 image 2 image N image 1 image 2 image N integer :: a(10,20)[*] Copies from left neighbor if (this_image() > 1)a(1:10,1:2)=a(1:10,19:20)[this_image()-1] 3 2 1 2 1 1 2D wave-front parallelism Programming Ultra-scale Parallel Systems Co-Array Fortran Language CAF Model Refinements Rice CAF Compiler • Parallel extension of Fortran 90 • SPMD programming model • fixed number of images during execution • images operate asynchronously • Both private and shared data • real a(20,20)private: a 20x20 array in each image • real a(20,20) [*] shared: a 20x20 array in each image • Simple one-sided communication (PUT & GET) • x(:,j:j+2) = a(r,:) [p:p+2] copy rows from p:p+2 intolocal columns • Flexible explicit synchronization • sync_team(team [,wait]) • team= a vector of process ids to synchronize with • wait=a vector of processes to wait for • Pointers and dynamic allocation • Point-to-point synchronization • sync_notify(p) • sync_wait(p) • Less restrictive memory fences at call site • Collective operations • Source-to-source code generation • Open source compiler • Build on Open64/SL infrastructure • Support for core language features • Code generation: • library-based communication: portable ARMCI and GASNet communication libraries and array descriptor CHASM library • load/store communication: on shared-memory platforms • Operating systems: • Linux IA64/IA32 • Alpha Tru64 • SGI IRIX64 • Interconnects & Platforms: • Quadrics QSNet (Elan 3), QSNet II (Elan 4) • Myrinet 2000 • Ethernet • SGI Altix 3000, SGI Origin 2000 • CHALLENGES • High-performance and good scalability • Programmer productivity • CAF: a promising near-term alternative • As expressive as MPI • Simpler to program than MPI • More amenable to compiler optimizations • User has control over performance-critical factors • MPI: a library-based parallel programming model • Portable and widely used • The developer has explicit control over data and communication placement • Difficult and error prone to program • Most of the burden for communication optimization falls on application developers; compiler support is underutilized Current Optimizations • Procedure Splitting • Hints for non-blocking communication • Library-based and load/store communication • Packing of strided communication Planned Optimizations • Communication vectorization and aggregation • Synchronization strength-reduction • Automatic split-phase communication • Platform-driven communication optimizations • transform communication from one-sided into two- sided and collective, if useful • multi-model code for hierarchical architectures • convert GETs into PUTs • Multi-buffer co-arrays for asynchrony tolerance • Employ virtualization for latency tolerance • Interoperability with other programming models CAF Applications and Benchmarks • Sweep3D – wave-front parallelism • Spark98 – sparse matrix vector multiply • NAS Parallel Benchmarks 2.3: MG, CG, SP, BT, LU • Random Access, STREAM San Fernando Valley Earthquake Simulation – Spark98 Neutron transport problem – Sweep3D Computational Fluid Dynamics: Cluster Platforms NAS BT C on Itanium2+Myrinet 2000 NAS MG C on Itanium2+Myrinet 2000 Spark98 on SGI Altix 3000 Sweep3D 1503on Itanium2+Quadrics Mesh Computational Fluid Dynamics: SGI Altix 3000 • Sparse matrix vector multiply (sf2 traces) • Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs • CAF GETs is simple and more “natural” to code, but up to 13% slower • Without considering locality, applications do not scale on NUMA architectures (Hybrid) • ARMCI library is more efficient than MPI Sweep3D 1503on Itanium2+Myrinet Sweep3D 1503on SGI Altix 3000 NAS BT B on SGI Altix 3000 NAS MG C on SGI Altix 3000 Partitioned Mesh

More Related