1 / 41

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications. Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University. Sequential Fortran program + data partitioning. Partition computation Insert comm / sync Manage storage.

stefan
Download Presentation

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University

  2. Sequential Fortran program + data partitioning Partition computation Insert comm / sync Manage storage Same answers as Fortran program HPF Program Compilation Parallel Machine High-Performance Fortran (HPF) • Industry-standard data parallel language • Partitioning of data drives partitioning of computation, …

  3. Motivation Obtaining high performance from applications written using high-level parallel languages has been elusive • Tightly-coupled applications are particularly hard • Data dependences serialize computation • induces tradeoffs between parallelism, communication granularity and frequency • traditional HPF partitionings limit scalability and performance • Communication might be needed inside loops

  4. Contributions • A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications • An analysis of their performance impact

  5. dHPF Compiler • Based on an abstract equational framework • manipulates sets of processors, array elements, iterations and pairwise mappings between these sets • optimizations and code generation are implemented as operations on these sets and mappings • Sophisticated computation partitioning model • enables partial replication of computation to reduce communication • Support for the multipartitioning distribution • MULTI distribution specifier • suited for line-sweep computations • Innovative optimizations • reduce communication • improve locality

  6. Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Interprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions

  7. Line-Sweep Computations • 1D recurrences on a multidimensional domain • Recurrences order computation along each dimension • Compiler based parallelization is hard: loop carried dependences, fine-grained parallelism

  8. Local Sweeps along x and z Local Sweep along y Transpose Transpose back Partitioning Choices (Transpose)

  9. Partitioning Choices (block + CGP) • Partial wavefront-type parallelism Processor 0 Processor 1 Processor 2 Processor 3

  10. Partitioning Choices (multipartitioning) • Full parallelism for sweeping along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3

  11. NAS SP & BT Benchmarks • NAS SP & BT benchmarks from NASA Ames • use ADI to solve the Navier-Stokes equation in 3D • forward & backward line sweeps on each dimension, for each time step • SP solves scalar penta-diagonal systems • BT solves block-tridiagonal systems • SP has double communication volume and frequency

  12. Experimental Setup • 2 versions from NASA, each written in Fortran 77 • parallel MPI hand-coded version • sequential version (3500 lines) • dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase) • Inlined several procedures manually: • enables dHPF to overlap local computation with communication without interprocedural tiling • Platform: SGI Origin 2000 (128 250 MHz procs.), SGI’s MPI implementation, SGI’s compilers

  13. Performance Comparison Compare four versions of NAS SP & BT • Multipartitioned MPI hand-coded version from NASA • different executables for each number of processors • Multipartitioned dHPF-generated version • single executable for all numbers of processors • Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition) • single executable for all numbers of processors • Block-partitioned pghpf-compiled version from PGI’s source code (using a full transpose with a 1D partition) • single executable for all numbers of processors

  14. Efficiency for NAS SP (1023 ‘B’ size) similar comm. volume, more serialization > 2x multipartitioning comm. volume

  15. Efficiency for NAS BT (1023 ‘B’ size) > 2x multipartitioning comm. volume

  16. Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Inteprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions

  17. Evaluation Methodology • All versions are dHPF-generated using multipartitioning • Turn off a particular optimization (“n - 1” approach) • determine overhead without it (% over fully optimized) • Measure its contribution to overall performance • total execution time • total communication volume • L2 data cache misses (where appropriate) • Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)

  18. Partially Replicated Computation SHADOW a(2, 2) SHADOW a(2, 2) ON_HOME a(i-2, j) È ON_HOME a(i+2, j) È ON_HOME a(i, j-2) È ON_HOME a(i-1, j+1) È ON_HOME a(i, j) ON_EXT_HOME a(i, j) • Partial computation replication is used to reduce communication

  19. Impact of Partial Replication • BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> • Both: eliminate comm. for six 3D arrays in compute_rhs

  20. Impact of Partial Replication (cont.)

  21. Interprocedural Communication Reduction Extensions to HPF/JA Directives • REFLECT: placement of near-neighbor communication • LOCAL: communication not needed for a scope • extended ONHOME: partial computation replication • Compiler doesn’t need full interprocedural communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh

  22. Interprocedural Communication Reduction (cont.) From top neighbor From left neighbor SHADOW a(2, 1) REFLECT (a(0:0, 1:0), a(1:0, 0:0)) SHADOW a(2, 1) REFLECT (a) • The combination of REFLECT, extended ONHOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

  23. Normalizing Communication do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2) Same non-local data needed P0 P1 P0 P1 a(i, j + 2) a(i, j) a(i, j) a(i, j - 2)

  24. Coalescing Communication Coalesced Message A A

  25. Impact of Normalized Coalescing

  26. Impact of Normalized Coalescing Key optimization for scalability

  27. Direct Access Buffers Choices for receiving complex coalesced messages • Unpack them into the shadow regions • two simultaneous live copies in cache • unpacking can be costly • uniform access to non-local & local data • Reference them directly out of the receive buffer • introduces two modes of access for data (non-local & interior) • overhead of having a single loop with these two modes is high • loops should be split into non-local & interior portions, according to the data they reference

  28. Impact of Direct Access Buffers • Use direct access buffers for the main swept arrays • Direct access buffers + loop splitting reduces L2 data cache misses by ~11%, resulting in a reduction of ~11% in execution time

  29. Conclusions • Compiler-generated code can match the performance of sophisticated hand-coded parallelizations • High performance comes from the aggregate benefit of multiple optimizations • Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed • Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

  30. Efficiency for NAS SP (‘A’)

  31. Efficiency for NAS BT (‘A’)

  32. Data Partitioning

  33. Data Partitioning (cont.)

  34. Partially Replicated Computation Local portion A + Shadow Regions Local portion A + Shadow Regions Replicated Computation Local portion U + Shadow Regions Local portions U/B + Shadow Regions Communication Processor p Processor p + 1 do i = 1, n do j = 2, n a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)

  35. Using HFP/JA for Comm. Elimination

  36. Using HFP/JA for Comm. Elimination

  37. Normalized Comm. Coalescing (cont.) do timestep = 1, T do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddo enddo Coalesce communication at this point

  38. Impact of Direct Access Buffers

  39. Impact of Direct Access Buffers

  40. Direct Access Buffers Pack, Send, Receive & Unpack Processor 1 Processor 0

  41. Direct Access Buffers Pack, Send & Receive Use Processor 1 Processor 0

More Related