Maximizing Efficiency with PGAS Language Evolution

PGAS Language Update Kathy Yelick

PGAS Languages: Why use 2 Programming Models when 1 will do? • Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global A[0]:1 A[1]: 2 A[p-1]: p x: 1 y: x: 5 y: x: 7 y: null Global address space l: l: l: g: g: g: p0 p1 pn • On distributed memory: • Remote read/write, one-sided communication; never have to say “receive” • As scalable as MPI (not cache-coherent shared memory)! • On shared memory these are just loads and stores • Permits sharing, whereas MPI rules it out! • UPC, Titanium and Co-Array Fortran are PGAS languages • Version of Co-Array Fortran now in the Fortran Spec • UPC has multiple compilers (Cray, Berkeley, gcc, HP,..)

Progress • Working on Titanium release • Various improvements, e.g., new Java 4.0 libraries • Focus on Infiniband clusters, shared memory machines, and if possible BG/P and Cray XT • UPC Land Evolution benchmark • Generalization of collectives to teams • Previous UPC and Titanium languages are pure SPMD (no Ocean + Atmos + Land in parallel without this) • Work towards Titanium on XT and BG/P • XT requires “registration” for remote memory access • Executable size / lack of shared library support is a problem • Multicore using SPMD / PGAS style (not UPC) • Autotuning of stencil operators (from Green Flash project)

Landscape Evolution • How to evolve landscapes over time? • Erosion, rainfall, river incision • Want to parallelize this for memory footprint reasons • Starting point: • Serial code, recursive algorithm • Series of tiles done separately • “Seams” are visible in output • Parallel algorithm uses address space to remove seams 3.0 1.0 3.0 3.0 1.0 3.0 3.0 10.0 3.0 3.0 10.0 3.0 1.0 3.0 3.0 1.0 3.0 3.0

Optimizing PGAS Code (Case Study) • Original implementation was very inefficient • Remote variables, upc_locks, non-local queues • PPW Performance tool from UFL helpful • UPC functions comprised > 60% of runtime • Speedup inherently limited by topology • Small test problem has limited theoretical parallelism (7.1); achieves 6.1 after optimizations, compared to 2.8 before get wait get global heap other wait lock other global heap

Progress on BG/P (Intrepid), IB (Ranger), Cray XT (Franklin et al) XT4 (Franklin): GASNet Broadcast outperforms MPI BG/P: 3D FFT Performance. Aside: UPC, CAF, Titanium, Chapel on all NSF and DOE machines are running on top of GASNet, including Cray’s!

PGAS on Multicore • PGAS offers a single programming model for distributed and shared memory • Partitioning for distributed memory (and multisocket NUMA nodes) • Ability to save memory footprint within multicore of multisocket shared memory • New work in UPC project • Process with shared memory support • Interoperability with MPI • Sometimes it performs better than threads • Autotuned shared memory collectives • Reductions on Ranger shown

What’s a stencil ? • Nearest neighbor computations on structured grids (1D…ND array) • stencils from PDEs are often a weighted linear combination of neighboring values • cases where weights vary in space/time • stencil can also result in a table lookup • stencils can be nonlinear operators • caveat: We only examine implementations like Jacobi’s Method (i.e. separate read and write arrays) i,j,k+1 i,j+1,k i-1,j,k i,j,k i+1,j,k i,j-1,k i,j,k-1

Strategy Engine:Auto-tuning Optimizations • Strategy Engine explores a number of auto-tuning optimizations: • loop unrolling/register blocking • cache blocking • constant propagation / common subexpression elimination • Future Work: • cache bypass (e.g. movntpd) • software prefetching • SIMD intrinsics • data structure transformations

Laplacian Performance • On the memory-bound architecture (Barcelona), auto-parallelization doesn’t make a difference. • Auto-tuning enables scalability. • Barcelona is bandwidth-proportionally faster than the XT4. • Nehalem is ~2.5x faster than Barcelona, and 4x faster than the XT4 • Auto-parallelization plus tuning significantly outperforms OpenMP. OpenMP Comparison Auto-tuning Auto-NUMA Auto- parallelization serial reference

Possible Paths Forward • Progress on PGAS languages will continue • Demonstration of PGAS for Climate • Continue UPC Landscape work (independent) • Resurrect Co-Array Fortran POP on XT4/5 • Autotune CCSM kernels: CRM (to be added) • More possibilities • Halo updates in CG solver in POP • How much time is spent in this? • IE at larger scale • Scalability of Ice • Do lookup tables in UPC (across a node) • 150-200 MB currently; might get larger

Priority List • POP halo exchange • POP on > 4K processors did not work • Sent a parameter fix for Cray Portals • CG solver can benefit from overlap • LANL Released POP (tripole grid?) • CICE solver • Could benefit current IE runs • Also limited by halo updates • Lookup table and hybrid programming • Pick up CSU student code • One-sided MPI messaging • Improving OpenMP code performance • Autotuning for CRM

Maximizing Efficiency with PGAS Language Evolution