LLNL-PRES -653032

The RAJA Programming Approach: Toward Architecture Portability for Large Multiphysics Applications Salishan Conference on High-Speed Computing • April 21-24, 2014 • Rich Hornung LLNL-PRES-653032

Hardware trends in the “Extreme-scale Era” will disrupt S/W maintainability & architecture portability • Managing increased hardware complexity and diversity… • Multi-level memory • High-bandwidth (e.g., stacked) memory on-package • High-capacity (e.g., NVRAM) main memory • Deeper cache hierarchies (user managed?) • Heterogeneous processor hierarchy, changing core count configurations • Latency-optimized (e.g., fat cores) • Throughput-optimized (e.g., GPUs, MIC) • Increased importance of vectorization / SIMD • 2 – 8 wide (double precision) on many architectures and growing, 32 wide on GPUs • As # cores/chip increases, cache coherence across full chip may not exist • …requires pervasive, disruptive, architecture-specific software changes • Data-specific changes • Data structure transformations (e.g., Structof Array vs. Array of Struct) • Need to insert directivesand intrinsics(e.g., restrict and align annotations) on individual loops • Algorithm-specific changes • Loop inversion, strip mining, loop fusion, etc. • Which loops and which directives (e.g., OpenMP, OpenACC) may be architecture-specific

Applications must enable implementation flexibility without excessive developer disruption • Architecture/Performance portability (our defn): map application memory and functional requirements to memory systems and functional units on a range of architectures, while maintaining a consistent programming style and control of all aspects of execution • The problem is acute for large multi-physics codes • O(105) – O(106) lines of code; O(10K) loops • Mini-apps do not capture the scale of code changes needed • Manageable portable performance requires separating platform-specific data management and execution concerns from numerical algorithms in our applications… ….with code changes that are intuitive to developers

“RAJA” is a potential path forward for our multiphysics applications at LLNL • We need algorithms & programming styles that: • Can express various forms of parallelism • Enable high performance portably • We can explore in our codes incrementally • There is no clear “best choice” for future PM/language. • RAJA is based on standard C++ (we rely on already) • It supports constructs & extends concepts used heavily in LLNL codes • It can be added to codes incrementally & used selectively • It allows various PMs “under the covers”– it does not wed a code to a particular technology • It is lightweight and offers developers customizable implementation choices

RAJA centralizes platform-specific code changes by decoupling loop body and traversal RAJA-style loop C-style for-loop Real_ptrx; Real_ptry ; Real_typea ; // … forall< exec_policy >( IndexSet, [&] (Index_typei) { y[ i ] += a * x[ i ] ; }); double* x ; double*y ; double a ; // … for (inti = begin; i < end; ++i ) { y[ i ] += a * x[ i ] ; } • Data type encapsulation hides non-portable compiler directives, intrinsics, etc. (not required by RAJA, but a good idea in general) • Traversal templates encapsulate platform-specific scheduling & execution • Index sets encapsulate loop iteration patterns & data placement • C++ lambda functions enable decoupling (this is essential for us!) Important: Loop body is the same. Transformations can be adopted incrementally and each part can be customized for specific code needs.

RAJA index sets enable tuning operations for segments of loop iteration space It is common to define arrays of indices to process; e.g., nodes, elts w/material, etc. intelems[] = {0, 1, 2, 3, 4, 5, 6, 7, 14, 27, 36, 40, 41, 42, 43, 44, 45, 46, 47, 87, 117}; Create “Hybrid” Index Set containing work segments HybridISetsegments = createHybridISet( elems, nelems); 0…7 14, 27, 36 40…47 87,117 Traversal method dispatches segments according to execution policy forall< exec_policy >( segments, loop_body ); Unstructured segment Range segment for (inti = begin; i < end; ++i) { loop_body(i) ; } for (inti = 0; i < seg_len; ++i) { loop_body( segment[i] ) ; } • Segments can be tailored to architecture features (e.g., SIMD hardware units): • createHybridISet() methods coordinate runtime& compile-time optimizations • Platform specific header files contain tailored traversal definitions

We have studied RAJA in LULESH extensively – results suggest the benefits of converting codes to this model… Code compiled with icpc v14.0.106 and run on one Intel ES-2670 node (TLCC2)

LULESH RAJA thread performance on Intel ES-2670 node Understanding RAJA overhead at small thread counts requires more analysis.

LULESH RAJA OpenMP performance on BG/Q node

We are exploring RAJA in ASC codes; e.g., Ares • Define RAJA index sets in routines where domain extents and indirection arrays are set up In single code location domain->index_sets->real_zones = ... • Then, replace loopswith calls to API routines that translate to RAJA mechanics Loops like these… for (int j = domain->jmin; j < domain->jmax; ++j) { for (inti = domain->imin; i < domain->imax; ++i) { int zone = i + j * domain->jp; // use “zone” as array index } } for (int ii = 0; ii < domain->numRealZones; i++) { int zone = domain->Zones[i]; // use “zone” as array index } …becomes this forEachRealZone<exec_policy>(domain, [=](int zone) { // use “zone” as array index });

Preliminary Ares performance study • Sedov problem • Lagrangian hydro only • 3D, 192^3 = 7,077,888 zones • Run 25 cycles • Time run with “date \n run \date” in input file • Evaluate strong-scaling with different MPI-OpenMP combinations • “Coarse-grained” threading: thread domain loops • “Fine-grained” threading: thread numerical kernel loops • Note: this problem is not representative of a real problem in terms of computational work and exercises a small portion of code

Original code on BG/Q using MPI + domain loop threads (coarse-grained) Real problems run here, typically. Number of domains is N x M, where N = # MPI tasks, M = # threads/task

Fine-grained loop threading via RAJA shows speedup over original code using only MPI RAJA version: N MPI tasks x M threads, N x M = 64 Original version: N MPI tasks In either case, N domains

Performance of RAJA version w/ fine-grained threading is comparable to original w/ coarse-grained threading RAJA version: N MPI tasks x M threads, N domains Original version: N MPI tasks x M threads, N x M domains In either case, N x M = 64

What have we learned about RAJA usage in Ares? • Basic integration is straightforward • Hard work is localized • Setting up & manipulating index sets • Defining platform-specific execution policies for loop classes • Converting loops is easy, but tedious • Replace loop header with call to iteration template • Identify loop type (i.e., execution pattern) • Determine whether loop can and should be parallelized • Are other changes needed; e.g., variable scope, thread safety? • What is appropriate execution policy? Platform-specific? • Encapsulation of looping constructs also benefits software consistency, readability, and maintainability

What have we learned about RAJA performance in Ares? • RAJA version is sometimes faster, sometimes slower. • We have converted only 421 Lagrange hydro loops (327 DPstream, 83 DPwork, 11 Seq). Threading too aggressive? • Other code transformations can expose additional parallelism opportunities and enable compiler optimizations(e.g., SIMD). • We need to overcome RAJA serial performance hit. • Compiler optimization for template/lambda constructs? • Need better inlining? • Once RAJA in place, exploration of data layout and execution choices to improve performance is straightforward and centralized (we have done some of this already)

We can’t do this with s/w engineering alone – we need help from compilers too • We have identified specific compiler deficiencies, developed concrete recommendations and evidence of feasibility, and engaged compiler teams • We also created LCALS to study & monitor the issues • A suite of loops implemented with various s/w constructs (Livermore Loops modernized and expanded – this time in C++) • Very useful for dialogue with compiler vendors • Generate test cases for vendors showing optimization/support issues • Try vendor solutions and report findings • Introduce & motivate encapsulation concepts not on vendors’ RADAR • Track version-to-version compiler performance • Available at https://codesign.llnl.gov “A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes”, Rich Hornung and Jeff Keasler, LLNL-TR-653681. (https://codesign.llnl.gov/codesign-papers-presentations.php)

We have worked with Intel to resolve compiler SIMD optimization issues for C++ template/lambda abstraction layer All runs on Intel ES-2670 node, compiled with icpc: -O3 -mavx -inline-max-total-size=10000 -inline-forceinline -ansi-alias -std=c++0x

Illustrative RAJA use cases

Use case 1: hybrid index sets enable optimizations when using “indirection” arrays Many multi-physics applications use indirection arraysto traverse unstructured meshes, to access elements with a given material, etc. Since indirection arrays are (re)defined at runtime, many compile-time optimizations cannot be applied (e.g., SIMD vectorization). ALE3D example: large problem with 10 materials, 512 domains, 16+ million zones (many multi-material), most work done in “long” stride-1 index ranges: Hybrid index sets can help recover some lost performance by exposing traversals of stride-1 ranges without indirection.

Use case 2: encapsulate fine-grain fault recovery • No impact on source code & recovery cost is commensurate with scope of fault. • It requires: idempotence, O/S can signal processor faults, try/catch can process O/S signals, etc. • These requirements should notbe showstoppers! template <typename LB> forall(int begin, int end, LB loop_body) { bool done = false ; while (!done) { try { done = true ; for (inti = begin; i < end; ++i)loop_body(i) ; } catch (Transient_fault) {cache_invalidate() ; done = false ; } } }

Use case 3: simplify loop logic This code uses integer arraysfor both control logic and indirection. void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var, ... ) { for ( int iz = 0 ; iz < nmixz ; iz++ ) { if ( mixzon->doflag[ iz ] == 1 ) { for ( int i = 0 ; i < domain->numlocalmats ; i++ ) { RegMixedZone_t& mzreg = ...; if ( mzreg.ndxput[ iz ] >= 0 ) { // ... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc ...

Use case 3: simplify loop logic • Encoding conditional logic in RAJA index sets simplifies code and removes two levels of nesting (good for developers and compilers!): • Serial speedup:1.6x on Intel Sandybridge, 1.9x on IBM BG/Q (g++) • Aside: compiling original code w/ g++4.7.2 (needed for lambdas) gives 1.99x speedup over XLC. So, original w/XLC  RAJA w/g++ yields 3.78x total performance increase. void domixupvar( Domain_t* domain, MixedZone_t* mixzon, double *var, ... ) { for (inti = 0; i < domain->numlocalmats; i++) { int ir = domain->localmats[i] ; RAJA::forall<exec_policy>(*mixzon->reg[ir].ndxput_is, [&] (intiz) { // ... var[ mzreg.ndxput[ iz ] ] = mzreg.airnew[ iz ]; // etc ...

Use case 4: reorder loop iterations to enable parallelism 1 1 1 2 1 1 1 2 1 1 2 3 2 4 3 2 4 2 2 3 Option A 1 1 1 1 1 1 1 1 2 2 2 4 2 4 2 3 2 3 3 2 1 1 2 1 1 1 1 2 1 1 Option B • A common operation in staggered-mesh codes sums values to nodes from surrounding zones; i.e., nodal_val[ node ] += zonal_val[ zone ] • Index set segments can be used to define independent groups of computation (colors) • Option A (~8x speedup w/16 threads): • Iterate over groups sequentially (group 1 completes, then group 2, etc.) • Operations within a group execute in parallel • Option B (~17% speedup over option A): • Zones in a group (row) processed sequentially • Iterate over rows of each color in parallel • Note: No source code change needed to switch between iteration / parallel execution patterns.

Use case 5: HMC memory addressing Byte assignment of memory space to quadrants (e.g., an array “block” within a quadrant has 64 doubles) A Hybrid Memory Cube has four quadrants. Latency is lowered (by as much as 50%) if access stays within a quadrant. Addressing rotates quickly through vaults, 4 vaults/quadrant. Quadrant access is striped in a way that requires non-contiguous array allocation.

Use case 5: HMC memory addressing #define QUADRANT_SIZE 64#define QUADRANT_MASK 63#define QUADRANT_STRIDE (QUADRANT_SIZE * 4)template <typename LOOP_BODY>void forall(int begin, int end, LOOP_BODY loop_body){intbeginQuad = (begin / QUADRANT_SIZE ) ;intendQuad = ((end - 1) / QUADRANT_SIZE) ;intbeginOffset = (beginQuad * QUADRANT_STRIDE + (begin & QUADRANT_MASK) ;intendOffset = (endQuad * QUADRANT_STRIDE) + ((end - 1) & QUADRANT_MASK) + 1 ; do { /* do at most QUADRANT_SIZE iterations */ for(ii=beginOffSet; ii<endOffSet; ++i) {loop_body(ii) ; }beginOffset += QUADRANT_STRIDE ;endOffset += QUADRANT_STRIDE ; } while (beginQuad++ != endQuad) ;} A specialized traversal template can be written for the HMC to keep memory access within a quadrant; e.g.,

Conclusions • RAJA can encapsulate platform-specific implementation concerns in a large code. • Insertion is not hard, but can be tedious. (Many loops, but few patterns) • We are working with ROSE team to see what can be done automatically. • What are the benefits? • Application code can be simpler – easier to read, write, and maintain. • Developers can customize model for code-specific constructs. • Centralized loop-level execution control. Code can be parameterized to run efficiently on different platforms. • What are the concerns? • Improving performance requires detailed analysis and other code changes (not unique to RAJA and ROSE can help with this…) • Extending model to other PMs and architectures (we’re optimistic) • Managing “memory spaces” on future architectures (a little help, anybody?) • We can’t “do it all” via s/w engineering. We also need help from compilers, PMs, O/S, language features, etc. (all have portability issues!)

Acknowledgements • Jeff Keasler, my collaborator on RAJA development • Esteban Pauli, “guinea pig” for trying out RAJA in Ares

The end.

LLNL-PRES -653032

LLNL-PRES -653032

Presentation Transcript

Marketing Pres

pres

Economics Pres

pres title

CIC-A2-PRES

David Asner/LLNL

LLNL-PRES-638945

LLNL Collaboration on NCSX

*Now at LLNL

PRES. LYNDON JOHNSON

LLNL FY2003 Geothermal Activities

LLNL Library Overview Presentation

Test pres

Phenix Computing @ LLNL: Resources

Pres. Eisenhower

Blank Pres

LLNL Collaboration on NCSX

survey inserted pres