Heterogeneous Computing in Charm++

Heterogeneous Computingin Charm++ David Kunzman

Motivations • Performance and Popularity of Accelerators • Our work currently focuses on Cell (and Larrabee) • Difficult to program accelerators • Architecture specific code (not portable) • Many asynchronous events (data movement, multiple cores) • Heterogeneous Clusters Exist Already • Roadrunner at LANL (Opterons and Cells) • Lincoln at NCSA (Xeons and GPUs) • MariCel at BSC (Powers and Cells)

Goals • Portability of code • Code should be portable between systems with and without accelerators • Across homogeneous and heterogeneous clusters • Reduce programmer effort • Allow various pieces of code to be written independently • Pieces of code share the accelerator(s) • Scheduled by the runtime system automatically • Naturally extend the existing Charm++ model • Same programming model for all hosts and accelerators

Approach • Make entry methods portable between host and accelerator cores • Allows the programmer to write entry method code once and use the same code for all cores • Still make use of architecture/core specific features • Take advantage of the clear communication boundaries in Charm++ • Almost all data is encapsulated within chare objects • Data is passed between chare objects by invoking entry methods

Extending Charm++ • SIMD Instruction Abstraction • To reach any significant fraction of peak, must use SIMD instructions on modern cores • Abstract SIMD instructions so code is portable • Accelerated Entry Methods • May execute on accelerators • Essentially a standard entry method split into two stages • Function body (accelerator or host; limited) • Callback function (host; not limited)

SIMD Instruction Abstraction • Abstract SIMD instructions supported by multiple architectures • Currently adding support for: SSE (x86), AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee • Generic C implementation when no direct architectural support is present • Types: vecf, veclf, veci, ... • Operations: vaddf, vmulf, vsqrtf, ...

Example Entry Method entry void accum(int inArrayLen, float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; for (int i = 0; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Example Entry Method w/ SIMD entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Accel Entry Method Structure Standard Interface File: entry void entryName ( …passed parameters… ); Source File: void ChareClass::entryName ( …passed parameters … ) { … function body … } Accelerated Interface File: entry [accel] void entryName ( …passed parameters… ) [ …local parameters… ] { … function body … } callback_member_function; Invocation (both):chareObj.entryName(… passed parameters …) vs.

Example Accelerated Entry Method entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; } accum_callback; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

Timeline of Events • Runtime system… • Directs data movement (messages & DMAs) • Schedules accelerated entry methods and callbacks

Communication Overlap • Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE

Handling Host Core Differences • Automatic modification of application data at communication boundaries • Structure of data is known via parameters and Pack-UnPack (PUP) routines • During packing process, add information on how the data is encoded • During unpacking, if needed, modify data to match local architecture

Molecular Dynamics (MD) Code • Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified) • Coulomb’s Law • Single precision floating-point • Particles evenly divided between patch objects • ~92K particles in 144 patches (similar to ApoA1 benchmark) • Compute objects (self and pair wise) compute forces for patch objects • Patches integrate combined force data and update particle positions

MD Code Results • Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs • 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures • Better scaling is achieved when Xeons are present • 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak on one SPE, assuming that SPE has an infinite local store)

Visualizing MD Code Execution

Summary • Support for accelerators and heterogeneous execution in Charm++ • Programming model and runtime system changes • Accelerated entry methods • SIMD instruction abstraction • Automatic modification of application data • Visualization support • Support • Currently supports Cell • Adding support for Larrabee • Clusters where host cores have different architectures

Future Work • Dynamic measurement based load balancing on heterogeneous systems • Increase support for more accelerators • In the process of adding support for Larrabee • Increasing support for existing abstractions and/or developing new abstractions

Questions

Heterogeneous Computing in Charm++