1 / 32

Experiences Building a Multi-platform Compiler for Co-array Fortran

Experiences Building a Multi-platform Compiler for Co-array Fortran. John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University. AHPCRC PGAS Workshop September, 2005. Goals for HPC Languages. Expressiveness Ease of programming

bairn
Download Presentation

Experiences Building a Multi-platform Compiler for Co-array Fortran

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences Building a Multi-platform Compiler for Co-array Fortran John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University AHPCRC PGAS Workshop September, 2005

  2. Goals for HPC Languages • Expressiveness • Ease of programming • Portable performance • Ubiquitous availability

  3. lacking in OpenMP HPF & OpenMP compilers must get this right PGAS Languages • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution and locality control • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization simpler than msg passing

  4. Co-array Fortran Programming Model • SPMD process images • fixed number of images during execution • images operate asynchronously • Both private and shared data • real x(20, 20) a private 20x20 array in each image • real y(20, 20)[*] a shared 20x20 array in each image • Simple one-sided shared-memory communication • x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions • sync_all – a barrier and a memory fence • sync_mem – a memory fence • sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation • Parallel I/O

  5. integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) One-sided Communication with Co-Arrays image 1 image 2 image N image 1 image 2 image N

  6. CAF Compilers • Cray compilers for X1 & T3E architectures • Rice Co-Array Fortran Compiler (cafc)

  7. Rice cafc Compiler • Source-to-source compiler • source-to-source yields multi-platform portability • Implements core language features • core sufficient for non-trivial codes • preliminary support for derived types • soon support for allocatable components • Open source Performance comparable to that of hand-tuned MPI codes

  8. Implementation Strategy • Goals • portability • high performance on a wide range of platforms • Approach • source-to-source compilation of CAF codes • use Open64/SL Fortran 90 infrastructure • CAF  Fortran 90 + communication operations • communication • ARMCI and GASNet one-sided comm libraries for portability • load/store communication on shared-memory platforms

  9. Key Implementation Concerns • Fast access to local co-array data • Fast communication • Overlap of communication and computation

  10. Accessing Co-Array Data Two Representations • SAVE and COMMON co-arrays as Fortran 90 pointers • F90 pointers to memory allocated outside Fortran run-time system • original references accessing local co-array data • rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … • transformed references • rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … • Procedure co-array arguments as F90 explicit-shape arrays • CAF language requires explicit shape for co-array arguments real :: a(10,10,10)[*] type CAFDesc_real_3 real, pointer:: ptr(:,:,:) ! F90 pointer to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a

  11. Performance Challenges • Problem • Fortran 90 pointer-based representation does not convey • the lack of co-array aliasing • contiguity of co-array data • co-array bounds information • lack of knowledge inhibits important code optimizations • Approach: procedure splitting

  12. Procedure Splitting CAF to CAF optimization subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c(1)) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*] ... = c_arg(50) ... end subroutine f_inner subroutine f(…) real, save :: c(100)[*] ... = c(50) ... end subroutine f • Benefits • better alias analysis • contiguity of co-array data • co-array bounds information • better dependence analysis result: back-end compiler can generate better code

  13. Implementing Communication • x(1:n) = a(1:n)[p] + … • General approach: use buffer to hold off processor data • allocate buffer • perform GET to fill buffer • perform computation: x(1:n) = buffer(1:n) + … • deallocate buffer • Optimizations • no buffer for co-array to co-array copies • unbuffered load/store on shared-memory systems

  14. Strided vs. Contiguous Transfers • Problem • CAF remote reference might induce many small data transfers • a(i,1:n)[p] = b(j,1:n) • Solution • pack strided data on source and unpack it on destination • Constraints • can’t express both source-level packing and unpacking for a one-sided transfer • two-sided packing/unpacking is awkward for users • Preferred approach • have communication layer perform packing/unpacking

  15. Pragmatics of Packing Who should implement packing? • CAF programmer • difficult to program • CAF compiler • must convert PUTs into two-sided communication to unpack • difficult whole-program transformation • Communication library • most natural place • ARMCI currently performs packing on Myrinet (at least)

  16. Synchronization • Original CAF specification: team synchronization only • sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions • sync_notify(q) • sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q

  17. Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication • use of indexed subscripts in co-dimensions • lack of whole program analysis • Approach: support hints for non-blocking communication • overcome conservative compiler analysis • enable sophisticated programmers to achieve good performance today

  18. Questions about PGAS Languages • Performance • can performance match hand-tuned msg passing programs? • what are the obstacles to top performance? • what should be done to overcome them? • language modifications or extensions? • program implementation strategies? • compiler technology? • run-time system enhancements? • Programmability • how easy is it to develop high performance programs?

  19. Investigating these Issues Evaluate CAF, UPC, and MPI versions of NAS benchmarks • Performance • compare CAF and UPC performance to that of MPI versions • use hardware performance counters to pinpoint differences • determine optimization techniques common for both languages as well as language specific optimizations • language features • program implementation strategies • compiler optimizations • runtime optimizations • Programmability • assess programmability of the CAF and UPC variants

  20. Platforms and Benchmarks • Platforms • Itanium2+Myrinet 2000 (900 MHz Itanium2) • Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) • SGI Altix 3000 (1.5 GHz Itanium2) • SGI Origin 2000 (R10000) • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames • MG, CG, SP, BT • CAF and UPC versions were derived from Fortran77+MPI versions

  21. Intel compiler: restrict yields factor of 2.3 performance improvement CAF point to point 35% faster than barriers UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers MG class A (2563) on Itanium2+Myrinet2000 Higher is better

  22. Intel C compiler: scalar performance MG class C (5123) on SGI Altix 3000 Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts 64 Higher is better

  23. MG class B (2563) on SGI Origin 2000 Higher is better

  24. Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers CG class C (150000) on SGI Altix 3000 Higher is better

  25. Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! CG class B (75000) on SGI Origin 2000 Higher is better

  26. restrict yields 18% performance improvement SP class C (1623) on Itanium2+Myrinet2000 Higher is better

  27. SP class C (1623) on Alpha+Quadrics Higher is better

  28. CAF: comm. packing 7% faster CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster UPC: use of restrict boosts the performance 43% BT class C (1623) on Itanium2+Myrinet2000 Higher is better

  29. use of restrict improves performance 30% BT class B (1023) on SGI Altix 3000 Higher is better

  30. Performance Observations • Achieving highest performance can be difficult • need effective optimizing compilers for PGAS languages • Communication layer is not the problem • CAF with ARMCI or GASNet yields equivalent performance • Scalar code optimization of scientific code is the key! • SP+BT: SGI Fortran: unroll+jam, SWP • MG: SGI Fortran: loop alignment, fusion • CG: Intel Fortran: optimized sum reduction • Linearized subscripts for multidimensional arrays hurt! • measured 30% performance gap with Intel Fortran

  31. Performance Prescriptions For portable high performance, we need … • Better language support for CAF synchronization • point-to-point synchronization is an important common case! • currently only a Rice extension outside the CAF standard • Better CAF & UPC compiler support • communication vectorization • synchronization strength reduction: important for programmability • Compiler optimization of loops with complex dependences • Better run-time library support • efficient communication support for strided array sections

  32. Programmability Observations • Matching MPI performance required using bulk communication • communicating multi-dimensional array sections is natural in CAF • library-based primitives are cumbersome in UPC • Strided communication is problematic for performance • tedious programming of packing/unpacking at src level • Wavefront computations • MPI buffered communication easily decouples sender/receiver • PGAS models: buffering explicitly managed by programmer

More Related