1 / 31

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C. Cristian Coarfa, Yuri Dotsenko , John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda

tao
Download Presentation

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda Pacific Northwest National Laboratory

  2. lacking in OpenMP HPF & OpenMP compilers must get this right GAS Languages • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution and locality control • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization simpler than msg passing

  3. Questions • Can GAS languages match the performance of hand-tuned message passing programs? • What are the obstacles to obtaining performance with GAS languages? • What should be done to ameliorate them? • by language modifications or extensions • by compilers • by run-time systems • How easy is it to develop high performance programs in GAS languages?

  4. Approach Evaluate CAF and UPC using NAS Parallel Benchmarks • Compare performance to that of MPI versions • use hardware performance counters to pinpoint differences • Determine optimization techniques common for both languages as well as language specific optimizations • language features • program implementation strategies • compiler optimizations • runtime optimizations • Assess programmability of the CAF and UPC variants

  5. Outline • Questions and approach • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

  6. CAF & UPC Common Features • SPMD programming model • Both private and shared data • Language-level one-sided shared-memory communication • Synchronization intrinsic functions (barrier, fence) • Pointers and dynamic allocation

  7. CAF & UPC Differences I • Multidimensional arrays • CAF: multidimensional arrays, procedure argument reshaping • UPC: linearization, typically using macros • Local accesses to shared data • CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N) • UPC: shared array reference using MYTHREAD or a C pointer

  8. CAF and UPC Differences II • Scalar/element-wise remote accesses • CAF: multidimensional subscripts + bracket syntax a(1,1) = a(1,M)[this_image()-1] • UPC: shared (“flat”) array access with linearized subscripts a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N] • Bulk and strided remote accesses • CAF: use natural syntax of Fortran 90 array sections and operations on remote co-array sections (less temporaries on SMPs) • UPC: use library functions (and temporary storage to hold a copy)

  9. M N P1 P2 PN Bulk Communication CAF: integer a(N,M)[*] a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1] UPC: shared int *a; upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int));

  10. CAF & UPC Differences III • Synchronization • CAF: team synchronization • UPC: split-phase barrier, locks • UPC: worksharing construct upc_forall • UPC: richer set of pointer types

  11. Outline • Questions and approach • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

  12. CAF Compilers • Rice Co-Array Fortran Compiler (cafc) • Multi-platform compiler • Implements core of the language • core sufficient for non-trivial codes • currently lacks support for derived type and dynamic co-arrays • Source-to-source translator • translates CAF into Fortran 90 and communication code • uses ARMCI or GASNet as communication substrate • can generate load/store for remote data accesses on SMPs • Performance comparable to that of hand-tuned MPI codes • Open source • Vendor compilers: Cray

  13. UPC Compilers • Berkeley UPC Compiler • Multi-platform compiler • Implements full UPC 1.1 specification • Source-to-source translator • converts UPC into ANSI C and calls to UPC runtime library & GASNet • tailors code to a specific architecture: cluster or SMP • Open source • Intrepid UPC compiler • Based on GCC compiler • Works on SGI Origin, Cray T3E and Linux SMP • Other vendor compilers: Cray, HP

  14. Outline • Motivation and Goals • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

  15. Scalar Performance • Generate code amenable to backend compiler optimizations • Quality of back end compilers • poor reduction recognition in the Intel C compiler • Local access to shared data • CAF: use F90 pointers and procedure arguments • UPC: use C pointers instead of UPC shared pointers • Alias and dependence analysis • Fortran vs. C language semantics • multidimensional arrays in Fortran • procedure argument reshaping • Convey lack of aliasing for (non-aliased) shared variables • CAF: use procedure splitting so co-arrays are referenced as arguments • UPC: use restrict C99 keyword for C pointers used to access shared data

  16. Communication • Communication vectorization is essential for high performance on cluster architectures for both languages • CAF • use F90 array sections (compiler translates to appropriate library calls) • UPC • use library functions for contiguous transfers • use UPC extensions for strided transfer in Berkeley UPC compiler • Increase efficiency of strided transfers by packing/unpacking data at the language level

  17. Synchronization • Barrier-based synchronization • Can lead to over-synchronized code • Use point-to-point synchronization • CAF: proposed language extension (sync_notify, sync_wait) • UPC: language-level implementation

  18. Outline • Questions and approach • CAF & UPC • Experimental evaluation • Conclusions

  19. Platforms and Benchmarks • Platforms • Itanium2+Myrinet 2000 (900 MHz Itanium2) • Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) • SGI Altix 3000 (1.5 GHz Itanium2) • SGI Origin 2000 (R10000) • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames • MG, CG, SP, BT • CAF and UPC versions were derived from Fortran77+MPI versions

  20. Intel compiler: restrict yields 2.3 time performance improvement CAF point to point 35% faster than barriers UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers MG class A (2563) on Itanium2+Myrinet2000 Higher is better

  21. Intel C compiler: scalar performance MG class C (5123) on SGI Altix 3000 Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts 64 Higher is better

  22. MG class B (2563) on SGI Origin 2000 Higher is better

  23. Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers CG class C (150000) on SGI Altix 3000 Higher is better

  24. Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! CG class B (75000) on SGI Origin 2000 Higher is better

  25. restrict yields 18% performance improvement SP class C (1623) on Itanium2+Myrinet2000 Higher is better

  26. SP class C (1623) on Alpha+Quadrics Higher is better

  27. CAF: comm. packing 7% faster CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster UPC: use of restrict boosts the performance 43% BT class C (1623) on Itanium2+Myrinet2000 Higher is better

  28. use of restrict improves performance 30% BT class B (1023) on SGI Altix 3000 Higher is better

  29. Conclusions • Matching MPI performance required using bulk communication • library-based primitives are cumbersome in UPC • communicating multi-dimensional array sections is natural in CAF • lack of efficient run-time support for strided communication is a problem • With CAF, can achieve performance comparable to MPI • With UPC, matching MPI performance can be difficult • CG: able to match MPI on all platforms • SP, BT, MG: substantial gap remains

  30. Why the Gap? • Communication layer is not the problem • CAF with ARMCI or GASNet yields equivalent performance • Scalar code optimization of scientific code is the key! • SP+BT: SGI Fortran: unroll+jam, SWP • MG: SGI Fortran: loop alignment, fusion • CG: Intel Fortran: optimized sum reduction • Linearized subscripts for multidimensional arrays hurt! • measured 30% performance gap with Intel Fortran

  31. Programming for Performance • In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult • To make codes efficient across the full range of architectures, we need • better language support for synchronization • point-to-point synchronization is an important common case! • better CAF & UPC compiler support • communication vectorization • synchronization strength reduction • better compiler optimization of loops with complex dependence patterns • better run-time library support • efficient communication of strided array sections

More Related