1 / 35

Experiences with Co-array Fortran on Hardware Shared Memory Platforms

Experiences with Co-array Fortran on Hardware Shared Memory Platforms. Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Daniel Chavarria-Miranda Rice University, Houston, TX. Co-array Fortran. Global Address Space (GAS) language SPMD programming model Simple extension of Fortran 90

robyn
Download Presentation

Experiences with Co-array Fortran on Hardware Shared Memory Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Daniel Chavarria-Miranda Rice University, Houston, TX

  2. Co-array Fortran • Global Address Space (GAS) language • SPMD programming model • Simple extension of Fortran 90 • Explicit control over data placement and computation distribution • Private data • Shared data: both local and remote • One-sided communication (PUT and GET) • Team and point-to-point synchronization

  3. a(10,20) a(10,20) a(10,20) image 1 image 2 image N image 1 image 2 image N Co-array Fortran: Example integer :: a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] Copies from left neighbor

  4. Compiling CAF • Source-to-source translation • Prototype Rice cafc • Fortran 90 pointer-based co-array representation • ARMCI-based data movement • Goal: performance transparency • Challenges: • Retain CAF source-level information • Array contiguity, array bounds, lack of aliasing • Exploit efficient fine-grain communication on SMPs

  5. Outline • Co-array representation and data access • Local data • Remote data • Experimental evaluation • Conclusions

  6. Representation and Access for Local Data • Efficient local access to SAVE/COMMON co-arrays is crucial to achieving best performance on a target architecture • Fortran 90 pointer • Fortran 90 pointer to structure • Cray pointer • Subroutine argument • COMMON block (need support for symmetric shared objects)

  7. Fortran 90 Pointer Representation CAF declaration: real, save :: a(10,20)[*] After translation: type T1 integer(PtrSize) handle real, pointer :: local(:,:) end type T1 type (T1) ca Local access: ca%local(2,3) • Portable representation • Back-end compiler has no knowledge about: • Potential aliasing (no-alias flags for some compilers) • Contiguity • Bounds • Implemented in cafc

  8. Fortran 90 Pointer to Structure Representation CAF declaration: real, save :: a(10,20)[*] After translation: type T1 real :: local(10,20) end type T1 type (T1), pointer :: ca • Conveys constant bounds and contiguity • Potential aliasing is still a problem

  9. Cray Pointer Representation CAF declaration: real, save :: a(10,20)[*] After translation: real :: a_local(10,20) pointer (a_ptr, a_local) • Conveys constant bounds and contiguity • Potential aliasing is still a problem • Cray pointer is not in Fortran 90 standard

  10. Subroutine Argument Representation CAF source: subroutine foo(…) real, save :: a(10,20)[*] a(i,j) = … + a(i-1,j) * … end subroutine foo After translation: subroutine foo(…) ! F90 representation for co-array a call foo_body(ca%local(1,1), ca%handle, …) end subroutine foo subroutine foo_body(a_local, a_handle, …) real :: a_local(10,20) a_local(i,j) = … + a_local(i-1,j) * … end subroutine foo_body

  11. Subroutine Argument Representation (cont.) • Avoid conservative assumptions about co-array aliasing by the back-end compiler • Performance is close to optimal • Extra procedures and procedure calls • Implemented in cafc

  12. COMMON Block Representation CAF declaration: real :: a(10,20)[*] common /a_cb/ a After translation: real :: ca(10,20) common /ca_cb/ ca • Yields best performance for local accesses • OS must support symmetric data objects

  13. Outline • Co-array representation and data access • Local data • Remote data • Experimental evaluation • Conclusions

  14. Generating CAF Communication • Generic parallel architectures • Library function calls to move data • Shared memory architectures (load/store) • Fortran 90 pointers • Vector of Fortran 90 pointers • Cray pointers

  15. Communication Generation for Generic Parallel Architectures CAF code: a(:) = b(:)[p] + … Translated code: allocate b_temp(:) call GET( b, p, b_temp, … ) a(:) = b_temp(:) + … deallocate b_temp • Portable: works on clusters and SMPs • Function overhead per fine-grain access • Uses temporary to hold off-processor data • Implemented in cafc

  16. Communication Generation Using Fortran 90 Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do • Function call overhead for each reference • Implemented in cafc

  17. Pointer Initialization Hoisting Naïvely translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do Code with hoisted pointer initialization: ptrA => A(1:N) call CafSetPtr(ptrA,p,A_handle) do j = 1, N C(j) = ptrA(j) end do • Pointer initialization hoisting is not yet implemented in cafc

  18. Communication Generation Using Vector of Fortran 90 Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: … initialization … do j = 1, N C(j) = ptrVectorA(p)%ptrA(j) end do • Does not require pointer initialization hoisting and avoids function calls • Worse performance than that of hoisted pointer initialization

  19. Communication Generation Using Cray Pointers CAF code: do j = 1, N C(j) = A(j)[p] end do Translated code: integer(PtrSize) :: addrA(:) … addrA initialization … do j = 1, N ptrA = addrA(p) C(j) = A_rem(j) end do • addrA(p) – address of co-array A on image p • Cray pointer initialization hoisting yields only marginal improvement

  20. Outline • Co-array representation and data access • Local data • Remote data • Experimental evaluation • Conclusions

  21. Experimental Platforms • SGI Altix 3000 • 128 Itanium2 1.5 GHz, 6 MB L3 cache processors • Linux (2.4.21 kernel) • Intel Fortran Compiler 8.0 • SGI Origin 2000 • 16 MIPS R12000 350 MHz, 8 MB L2 cache processors • IRIX64 6.5 • MIPSpro Compiler 7.3.1.3m

  22. Benchmarks • STREAM • Random Access • Spark98 • NAS MG and SP

  23. STREAM Copy kernel DO J = 1, N DO J = 1, N C(J) = A(J) C(J) = A(J)[p] END DO END DO Triad kernel DO J = 1, N DO J = 1, N A(J)=B(J)+s*C(J) A(J)=B(J)[p]+s*C(J)[p] END DO END DO Goal: investigate how well architecture bandwidth can be delivered up to the language level

  24. STREAM: Local Accesses • COMMON block is the best, if platform allows • Subroutine parameter has similar performance to COMMON block representation • Pointer-based representations have performance within 5% of the best on the Altix (with no-aliasing flag), and within 15% on the Origin • Fortran 90 pointer representation yields 30% of performance on the Altix without using the flag to specify lack of pointer aliasing • Array section statements with Fortran 90 pointer representation yield 40-50% performance on the Origin

  25. STREAM: Remote Accesses • COMMON block representation for local access + Cray pointer for remote accesses is the best • Subroutine argument + Cray pointer for remote accesses has similar performance • Remote accesses with function call per access yield very poor performance (24 times slower than the best on the Altix, five times slower on the Origin) • Generic strategy (with intermediate temporaries) delivers only 50-60% of performance on the Altix and 30-40% of performance on the Origin for vectorized code (except for Copy kernel) • Pointer initialization hoisting is crucial for Fortran 90 pointers remote accesses and desirable for Cray pointers • Similarly coded OpenMP version has comparable performance on the Altix (90% for the scale kernel) and 86-90% on the Origin

  26. Spark98 • Based on CMU’s earthquake simulation code • Computes sparse matrix-vector product • Irregular application with fine-grain accesses • Matrix distribution and computation partitioning is done offline (sf2 traces) • Spark98 computes partial product locally, then assembles the result across processors

  27. Spark98 (cont.) • Versions • Serial (Fortran kernel, ported from C) • MPI (Fortran kernel, ported from C) • Hybrid (best shared memory threaded version) • CAF versions (based on MPI version): • CAF Packed PUTs • CAF Packed GETs • CAF GETs (computation with remote data accessed “in place”)

  28. Spark98 GETs Result Assembly v2(:,:) = v(:,:) call sync_all() do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s] end if end do call sync_all()

  29. Spark98 GETs Result Assembly v2(:,:) = v(:,:) call sync_all() do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s] end if end do call sync_all()

  30. Spark98 Performance on Altix Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs CAF GETs is simple and more “natural” to code, but up to 13% slower Without considering locality, applications do not scale on NUMA architectures (Hybrid) ARMCI library is more efficient than MPI

  31. NAS MG and SP • Versions: • MPI (NPB 2.3) • CAF (based on MPI NPB 2.3) • Generic code generation with subroutine argument co-array representation (procedure splitting) • Shared memory code generation (Fortran 90 pointers; vectorized source code) with subroutine argument co-array representation • OpenMP (NPB 3.0) • Class C

  32. NAS SP Performance on Altix Performance of CAF versions is comparable to that of MPI CAF-generic has better performance than CAF-shm because it uses memcpy, which hides latency by keeping optimal number of memory ops in flight OpenMP scales poorly

  33. NAS MG Performance on Altix

  34. Conclusions • Direct load/store communication improves performance of fine-grain accesses by a factor of 24 on the Altix 3000 and five on the Origin 2000 • “In-place” data use in CAF statements incurs acceptable abstraction overhead • Performance comparable to that of MPI codes for fine- and coarse-grain applications • We plan to implement in cafc optimal, architecture dependent, code generation for local and remote co-array accesses

  35. www.hipersoft.rice.edu/caf

More Related