1 / 19

Heterogeneous Computing in Charm++

Heterogeneous Computing in Charm++. David Kunzman. Motivations. Performance and Popularity of Accelerators Our work currently focuses on Cell (and Larrabee) Difficult to program accelerators Architecture specific code (not portable) Many asynchronous events (data movement, multiple cores)

Download Presentation

Heterogeneous Computing in Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heterogeneous Computingin Charm++ David Kunzman

  2. Motivations • Performance and Popularity of Accelerators • Our work currently focuses on Cell (and Larrabee) • Difficult to program accelerators • Architecture specific code (not portable) • Many asynchronous events (data movement, multiple cores) • Heterogeneous Clusters Exist Already • Roadrunner at LANL (Opterons and Cells) • Lincoln at NCSA (Xeons and GPUs) • MariCel at BSC (Powers and Cells)

  3. Goals • Portability of code • Code should be portable between systems with and without accelerators • Across homogeneous and heterogeneous clusters • Reduce programmer effort • Allow various pieces of code to be written independently • Pieces of code share the accelerator(s) • Scheduled by the runtime system automatically • Naturally extend the existing Charm++ model • Same programming model for all hosts and accelerators

  4. Approach • Make entry methods portable between host and accelerator cores • Allows the programmer to write entry method code once and use the same code for all cores • Still make use of architecture/core specific features • Take advantage of the clear communication boundaries in Charm++ • Almost all data is encapsulated within chare objects • Data is passed between chare objects by invoking entry methods

  5. Extending Charm++ • SIMD Instruction Abstraction • To reach any significant fraction of peak, must use SIMD instructions on modern cores • Abstract SIMD instructions so code is portable • Accelerated Entry Methods • May execute on accelerators • Essentially a standard entry method split into two stages • Function body (accelerator or host; limited) • Callback function (host; not limited)

  6. SIMD Instruction Abstraction • Abstract SIMD instructions supported by multiple architectures • Currently adding support for: SSE (x86), AltiVec/VMX (PowerPC; PPE), SIMD instructions on SPEs, and Larrabee • Generic C implementation when no direct architectural support is present • Types: vecf, veclf, veci, ... • Operations: vaddf, vmulf, vsqrtf, ...

  7. Example Entry Method entry void accum(int inArrayLen, float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; for (int i = 0; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

  8. Example Entry Method w/ SIMD entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; }; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

  9. Accel Entry Method Structure Standard Interface File: entry void entryName ( …passed parameters… ); Source File: void ChareClass::entryName ( …passed parameters … ) { … function body … } Accelerated Interface File: entry [accel] void entryName ( …passed parameters… ) [ …local parameters… ] { … function body … } callback_member_function; Invocation (both):chareObj.entryName(… passed parameters …) vs.

  10. Example Accelerated Entry Method entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) [ readOnly : int localArrayLen <impl_obj->localArrayLen>, readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] { if (inArrayLen != localArrayLen) return; vecf *inArrayVec = (vecf*)inArray; vecf *localArrayVec = (vecf*)localArray; int arrayVecLen = inArrayLen / vecf_numElems; for (int i = 0; i < arrayVecLen; ++i) localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]); for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i) localArray[i] = localArray[i] + inArray[i]; } accum_callback; To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);

  11. Timeline of Events • Runtime system… • Directs data movement (messages & DMAs) • Schedules accelerated entry methods and callbacks

  12. Communication Overlap • Data movement automatically overlapped with accelerated entry method execution on SPEs and entry method execution on PPE

  13. Handling Host Core Differences • Automatic modification of application data at communication boundaries • Structure of data is known via parameters and Pack-UnPack (PUP) routines • During packing process, add information on how the data is encoded • During unpacking, if needed, modify data to match local architecture

  14. Molecular Dynamics (MD) Code • Based on object interaction seen in NAMD’s nonbonded electrostatic force computation (simplified) • Coulomb’s Law • Single precision floating-point • Particles evenly divided between patch objects • ~92K particles in 144 patches (similar to ApoA1 benchmark) • Compute objects (self and pair wise) compute forces for patch objects • Patches integrate combined force data and update particle positions

  15. MD Code Results • Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs • 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures • Better scaling is achieved when Xeons are present • 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak on one SPE, assuming that SPE has an infinite local store)

  16. Visualizing MD Code Execution

  17. Summary • Support for accelerators and heterogeneous execution in Charm++ • Programming model and runtime system changes • Accelerated entry methods • SIMD instruction abstraction • Automatic modification of application data • Visualization support • Support • Currently supports Cell • Adding support for Larrabee • Clusters where host cores have different architectures

  18. Future Work • Dynamic measurement based load balancing on heterogeneous systems • Increase support for more accelerators • In the process of adding support for Larrabee • Increasing support for existing abstractions and/or developing new abstractions

  19. Questions

More Related