Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Petascale Programming with Virtual Processors:Charm++, AMPI, and domain-specific frameworks Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

Challenges and opportunities: character of the new machines Charm++ and AMPI Basics Capabilities, Programming techniques Dice them fine: VPS to the rescue Juggling for overlap Load balancing: scenarios and strategies Case studies Classical Molecular Dynamics Car-Parinello AI MD Quantum Chemistry Rocket SImulation Raising level of abstraction: Higher level compiler supported notations Domain-specific “frameworks” Example: Unstructured mesh (FEM) framework Outline

Current: Lemieux: 3000 processors, 750 nodes, full-bandwidth fat-tree network ASCI Q: similar architecture System X: Infiniband Tungston: myrinet Thunder Earth Simulator Planned: IBM’s Blue Gene L: 65k nodes, 3D-taurus topology Red Storm (10k procs) Future? BG/L is an example: 1M processors! 0.5 MB per procesor HPCS 3 architecural plans Machines: current, planned and future

Some Trends: Communication • Bisection bandwidth: • Can’t scale as well with number of processors • without being expensive • Wire-length delays • even on lemieux: messages going thru the highest level switches take longer • Two possibilities: • Grid topologies, with near neighbor connections • High Link speed, low bisection bandwidth • Expensive, full-bandwidth networks

Trends: Memory • Memory latencies are 100 times slower than processor! • This will get worse • A solution: put more processors in, • To increase bandwidth between processors and memory • On chip DRAM • In other words: low memory-to-processor ratio • But this can be handled with programming style • Application viewpoint, for physical modeling: • Given a fixed amount of run-time (4 hours or 10 days) • Doubling spatial resolution • increases CPU needs more than 2-fold (smaller time-steps)

Application Complexity is increasing • Why? • With more FLOPS, need better algorithms.. • Not enough to just do more of the same.. • Example: Dendritic growth in materials • Better algorithms lead to complex structure • Example: Gravitational force calculation • Direct all-pairs: O(N2), but easy to parallelize • Barnes-Hut: N log(N) but more complex • Multiple modules, dual time-stepping • Adaptive and dynamic refinements • Ambitious projects • Projects with new objectives lead to dynamic behavior and multiple components

Specific Programming Challenges • Explicit management of resources • This data on that processor • This work on that processor • Analogy: memory management • We declare arrays, and malloc dynamic memory chunks as needed • Do not specify memory addresses • As usual, Indirection is the key • Programmer: • This data, partitioned into these pieces • This work divided that way • System: map data and work to processors

Virtualization: Object-based Parallelization • Idea: Divide the computation into a large number of objects • Let the system map objects to processors User is only concerned with interaction between objects System implementation User View

Virtualization: Charm++ and AMPI • These systems seek an optimal division of labor between the “system” and programmer: • Decomposition done by programmer, • Everything else automated Decomposition Mapping Charm++ HPF Abstraction Scheduling Expression MPI Specialization

Charm++: Parallel C++ Asynchronous methods Object arrays In development for over a decade Basis of several parallel applications Runs on all popular parallel machines and clusters AMPI: A migration path for legacy MPI codes Allows them dynamic load balancing capabilities of Charm++ Uses Charm++ object arrays Minimal modifications to convert existing MPI programs Automated via AMPizer Collaboration w. David Padua Bindings for C, C++, and Fortran90 Charm++ and Adaptive MPI Both available from http://charm.cs.uiuc.edu

Protein Folding Quantum Chemistry (QM/MM) Molecular Dynamics Computational Cosmology Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Dendritic Growth Space-time meshes Rocket Simulation The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

Message From This Talk • Virtualization is ready and powerful to meet the needs of tomorrows applications and machines • Virtualization and associated techniques that we have been exploring for the past decade are ready and powerful enough to meet the needs of high-end parallel computing and complex and dynamic applications • These techniques are embodied into: • Charm++ • AMPI • Frameworks (Strucured Grids, Unstructured Grids, Particles) • Virtualization of other coordination languages (UPC, GA, ..)

Graduate students including: Gengbin Zheng Orion Lawlor Milind Bhandarkar Terry Wilmarth Sameer Kumar Jay deSouza Chao Huang Chee Wai Lee Recent Funding: NSF (NGS: Frederica Darema) DOE (ASCI : Rocket Center) NIH (Molecular Dynamics) Acknowlwdgements

Charm++ : Object Arrays • A collection of data-driven objects (aka chares), • With a single global name for the collection, and • Each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..]

Charm++ : Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3]

Chare Arrays • Elements are data-driven objects • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... • Send messages to index, receive messages at element. Reductions and broadcasts across the array • Dynamic insertion, deletion, migration-- and everything still has to work!

array[1D] foo { entry void foo(int problemNo); entry void bar(int x); }; Interface (.ci) file Generated class CProxy_foo someFoo=...; someFoo[i].bar(17); In a .C file i’th object Charm++ Remote Method Calls • To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: method and parameters • This results in a network message, and eventually to a call to the real object’s method: In another .C file void foo::bar(int x) { ... }

Charm++ Startup Process: Main module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Interface (.ci) file Special startup object Generated class #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” Called at startup In a .C file

Other Features • Broadcasts and Reductions • Runtime creation and deletion • nD and sparse array indexing • Library support (“modules”) • Groups: per-processor objects • Node Groups: per-node objects • Priorities: control ordering

AMPI: “Adaptive” MPI • MPI interface, for C and Fortran, implemented on Charm++ • Multiple “virtual processors” per physical processor • Implemented as user-level threads • Very fast context switching-- 1us • E.g., MPI_Recv only blocks virtual processor, not physical • Supports migration (and hence load balancing) via extensions to MPI

7 MPI processes AMPI:

7 MPI “processes” Real Processors AMPI: Implemented as virtual processors (user-level migratable threads)

How to Write an AMPI Program • Write your normal MPI program, and then… • Link and run with Charm++ • Compile and link with charmc • charmc -o hello hello.c -language ampi • charmc -o hello2 hello.f90 -language ampif • Run with charmrun • charmrun hello

How to Run an AMPI program • Charmrun • A portable parallel job execution script • Specify number of physical processors: +pN • Specify number of virtual MPI processes: +vpN • Special “nodelist” file for net-* versions

AMPI MPI Extensions • Process Migration • Asynchronous Collectives • Checkpoint/Restart

How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables

Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data

Migrate Stack Data • Without compiler support, cannot change stack’s address • Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) • Solution: “isomalloc” addresses • Reserve address space on every processor for every thread stack • Use mmap to scatter stacks in virtual memory efficiently • Idea comes from PM2

Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code 0x00000000 0x00000000

Migrate Stack Data • Isomalloc is a completely automatic solution • No changes needed in application or compilers • Just like a software shared-memory system, but with proactive paging • But has a few limitations • Depends on having large quantities of virtual address space (best on 64-bit) • 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine • Depends on unportable mmap • Which addresses are safe? (We must guess!) • What about Windows? Blue Gene?

Heap Data • Heap data is any dynamically allocated data • C “malloc” and “free” • C++ “new” and “delete” • F90 “ALLOCATE” and “DEALLOCATE” • Arrays and linked data structures are almost always heap data

Migrate Heap Data • Automatic solution: isomalloc all heap data just like stacks! • “-memory isomalloc” link option • Overrides malloc/free • No new application code needed • Same limitations as isomalloc • Manual solution: application moves its heap data • Need to be able to size message buffer, pack data into message, and unpack on other side • “pup” abstraction does all three

Comparison with Native MPI • Performance • Slightly worse w/o optimization • Being improved • Flexibility • Small number of PE available • Special requirement by algorithm Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Modularity Predictability Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Principle of persistence Enables runtime optimizations Automatic dynamic load balancing Communication optimizations Other runtime optimizations Benefits of Virtualization More info: http://charm.cs.uiuc.edu

Data driven execution Scheduler Scheduler Message Q Message Q

Adaptive Overlap of Communication • With Virtualization, you get Data-driven execution • There are multiple entities (objects, threads) on each proc • No single object or threads holds up the processor • Each one is “continued” when its data arrives • No need to guess which is likely to arrive first • So: Achieves automatic and adaptive overlap of computation and communication • This kind of data-driven idea can be used in MPI as well. • Using wild-card receives • But as the program gets more complex, it gets harder to keep track of all pending communication in all places that are doing a receive

Why Message-Driven Modules ? SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Checkpoint/Restart • Any long running application must be able to save its state • When you checkpoint an application, it uses the pup routine to store the state of all objects • State information is saved in a directory of your choosing • Restore also uses pup, so no additional application code is needed (pup is all you need)

Checkpointing Job • In AMPI, use MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete • In Charm++, use CkCheckpoint(<dir>,<resume>); • Called on one processor; calls resume when checkpoint is complete • Restarting: • The charmrun option ++restart <dir> is used to restart • Number of processors need not be the same

AMPI’s Collective Communication Support • Communication operation in which all or a large subset participate • For example broadcast • Performance impediment • All to all communication • All to all personalized communication (AAPC) • All to all multicast (AAM)

Communication Optimization Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1 But each byte travels twice on the network

Radix Sort Performance Benchmark A Mystery ?

CPU time vs Elapsed Time Time breakdown of an all-to-all operation using Mesh library • Computation is only a small proportion of the elapsed time • A number of optimization techniques are developed to improve collective communication performance

Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms] • VPs implemented as threads • Overlapping computation with waiting time of collective operations • Total completion time reduced

Shrink/Expand • Problem: Availability of computing platform may change • Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

Projections Performance Analysis Tool

Projections • Projections is designed for use with a virtualized model like Charm++ or AMPI • Instrumentation built into runtime system • Post-mortem tool with highly detailed traces as well as summary formats • Java-based visualization tool for presenting performance information

Trace Generation (Detailed) • Link-time option “-tracemode projections” • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer • Memory footprint controlled by limiting number of log entries • I/O perturbation can be reduced by increasing number of log entries • Generates a <name>.<pe>.log file for each processor and a <name>.sts file for the entire application • Commonly used Run-time options +traceroot DIR +logsize NUM

Visualization Main Window

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Presentation Transcript

Computing System Element Choices

Macro Processors

Design Automation of Co-Processors for Application Specific Instruction Set Processors

Adaptive MPI

Fault-Tolerant Programming Models and Computing Frameworks

Machine-Level Programming I: Basics

Is domain-specific reasoning in conditional reasoning tasks really domain-specific?

Introduction

Embedded Computer Architecture

Agile and Efficient Domain-Specific Languages using Multi-stage Programming in Java Mint

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Hume: a Domain-Specific Language for Programming with Bounded Resource

High Productivity Language Systems: Next-Generation Petascale Programming

Presentation outline

Funding and Policy Frameworks: Virtual Learning

Frameworks

Constructing trusted virtual execution environment in P2P grids

Control Localization in Domain Specific Translation

Quality Assurance Netherlands Universities

AMPI and Charm++

A case for Virtual Channel Processors

Advanced Charm++ Tutorial