620 likes | 751 Views
How Charm works its magic. Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign. System implementation. User View. Parallel Programming Environment. Charm++ and AMPI
E N D
How Charm works its magic Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign
System implementation User View Parallel Programming Environment • Charm++ and AMPI • Embody the idea of processor virtualization • Processor Virtualization • Divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors • Let the system map objects to processors
Charm++ Parallel C++ “Arrays” of Objects Automatic load balancing Prioritization Mature System Available on all parallel machines we know Several applications: Mol. Dynamics QM/MM Cosmology Materials/processes Operations Research AMPI = MPI + virtualization A migration path for MPI codes Automatic dynamic load balancing for MPI applications Uses Charm++ object arrays and migratable threads Bindings for C, C++, Fortran90 Porting MPI applications Minimal modifications needed Automated via AMPizer AMPI progress Ease of use: automatic packing Asynchronous communication Split-phase interfaces Charm++ and AMPI
Charm++ • Parallel C++ with Data Driven Objects • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu
7 MPI processes AMPI:
7 MPI “processes” Real Processors AMPI: Implemented as virtual processors (user-level migratable threads)
Software Engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message Driven Execution Adaptive overlap of communication Modularity Predictability: Automatic Out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters: Vacate, adjust to speed, share Automatic checkpointing Change the set of processors used Principle of Persistence Enables Runtime Optimizations Automatic Dynamic Load Balancing Communication Optimizations Other Runtime Optimizations Benefits of Virtualization More: http://charm.cs.uiuc.edu We will illustrate: -- An application breakthrough -- Cluster Performance Optimization -- A Communication Optimization
Technology Demonstration • Recent breakthrough in Molecular Dynamics performance • NAMD, implemented using Charm++ • Demonstrates power of techniques, applicable to CSAR • Collection of charged atoms, with bonds • Thousands of atoms (10,000 - 500,000) • 1 femtosecond time-step, millions needed! • At each time-step • Bond forces • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: every 4 timesteps using PME (3D FFT) • Multiple Time Stepping • Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers
Virtualized Approach to Parallelization using Charm++ 192 + 144 VPs 700 VPs 30,000 VPs These 30,000+ Virtual Processors (VPs) are mapped to real processors by Charm runtime system
Asynchronous reductions, and message-driven execution in Charm allow applications to tolerate random variations
15.6 ms, 0.8 TF Performance: NAMD on Lemieux To be Published in SC2002: Gordon Bell Award Finalist ATPase: 320,000+ atoms including water
Motivation Reduce tedium of parallel programming for commonly used paradigms & parallel data structures Encapsulate parallel data structures and algorithms Provide easy to use interface, Sequential programming style preserved Use adaptive load balancing framework Used to build parallel components Frameworks Unstructured Grids Generalized ghost regions Used in RocFrac version, RocFlu Outside CSAR Fast Collision Detection Multiblock Framework Structured Grids Automates communication AMR Common for both above Particles Multiphase flows MD, Tree codes Component Frameworks
Objective: For commonly used structures, application scientists Shouldn’t have to deal with parallel implementation issues Should be able to reuse code Components and Challenges Unstructured meshes: Unmesh Dynamic refinement support for FEM Solver interfaces Multigrid support Structured meshes: Mblock Multigrid support Study applications Particles Adaptive mesh refinement: Shrinking and growing trees Applicable to the three above Component Frameworks
Application Orchestration / Intergration Support Data transfer Application Components A B C D Framework Components Unmesh MBlock Particles Charm/AMPI Solvers AMR support Parallel Standard Libraries MPI/lower layers
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Object Groups (BOCs) Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies (nbr) Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline
Converse • Converse is a layer on which Charm++ is built • Provides machine-dependent code • Provides “utilities” needed by the RTS of many parallel programming languages • Used for implementing many mini-languages • Main Components of Converse: • Machine model • Scheduler and General Msgs • Communication • Threads • Machine model: • Collections of nodes, each node is a collection processes. • Processes on a node can share memory • Macros for supporting node-level (shared) and processor level globals
Data driven execution Scheduler Scheduler Message Q Message Q
Converse Scheduler • The core of converse is message-driven execution • But scheduled entities are not just “messgese” from remote processors • Genalized notion of messages: any schedulable entity • From scheduler’s point of view: a block of memory, • First few bytes encode a handler function (as an index into a table) • Scheduler, in each iteration: • Polls network, enqueuing messages in a fifo • Selects a message from either the fifo or local-queue • Executes handler of of selected message • This may result in enqueuing of message in the local-queue • Local queue is prioritized lifo/fifo • Priorities may be integers (smaller: higher) or bitvectors (lexicographic)
Converse: communication and threads • Communication support: • “send” a converse message to a remote processor • Message must have handler-index encoded at the beginning • Variety of send-variations (sync/async, memory deallocation) and broadcasts supproted • Threads: bigger topic • User level threads • Migratable threads • Scheduled via converse scheduler • Suspend and awaken : low level thread package
Communication API (Send/Recv) MPI Net Shmem UDP (machine-eth.c) TCP (machine-tcp.c) Myrinet (machine-gm.c) Communication Architecture
Parallel Program Startup • Net version - nodelist Rsh/ssh (IP, port) Rsh/ssh (IP, port) Charmrun node compute node compute node my node (IP, port) Broadcast all nodes (IP, port)
Converse Initialization • ConverseInit • Global variables initialization; • Start worker threads; • ConverseRunPE for each Charm PE • Per thread initialization; • Loop into scheduler: CsdScheduler()
handler xhandler Dgram Header length Message formats • Net version #define CMK_MSG_HEADER_BASIC { CmiUInt2 d0,d1,d2,d3,d4,d5,hdl,d7; } • MPI version #define CMK_MSG_HEADER_BASIC { CmiUInt2 rank, root, hdl,xhdl,info,d3; } rank root handler xhandler info d3
SMP support • MPI-smp as an example • Create threads: CmiStartThreads • Worker threads work cycle • See code in mahcine-smp.c • Communication thread work cycle • See code in machine-smp.c
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies (nbr) Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline
Need for Proxies • Consider: • Object x of class A wants to invoke method f of obj y of class B. • x and y are on different processors • what should the syntax be? • y->f( …)? : doesn’t work because y is not a local pointer • Needed: • Instead of “y” we must use an ID that is valid across processors • Method Invocation should use this ID • Some part of the system must pack the parameters and send them • Some part of the system on the remote processor must invoke the right method on the right object with the parameters supplied
Charm++ solution: proxy classes • Classes with remotely invokeable methods • inherit from “chare” class (system defined) • entry methods can only have one parameter: a subclass of message • For each chare class D • which has methods that we want to remotely invoke • The system will automatically generate a proxy class Cproxy_D • Proxy objects know where the real object is • Methods invoked on this class simply put the data in an “envelope” and send it out to the destination • Each chare object has a proxy • CProxy_D thisProxy; // thisProxy inherited from “CBase_D” • Also you can get a proxy for a chare when you create it: • CProxy_D myNewChare = CProxy_D::ckNew(arg);
Generation of proxy classes • How does charm generate the proxy classes? • Needs help from the programmer • name classes and methods that can be remotely invoked • declare this in a special “charm interface” file (pgm.ci) • Include the generated code in your program pgm.ci mainmodule PiMod { mainchare main { entry main(); entry results(int pc); }; chare piPart { entry piPart(void); }; pgm.h #include “PiMod.decl.h” .. Generates PiMod.def.h PiMod.def.h Pgm.c … #include “PiMod.def.h”
Object Groups • A group of objects (chares) • with exactly one representative on each processor • A single proxy for the group as a whole • invoke methods in a branch (asynchronously), all branches (broadcast), or in the local branch • creation: • agroup = Cproxy_C::ckNew(msg) • remote invocation: • p.methodName(msg); // p.methodName(msg, peNum); • p.ckLocalBranch()->f(….);
Information sharing abstractions • Observation: • Information is shared in several specific modes in parallel programs • Other models support only a limited sets of modes: • Shared memory: everything is shared: sledgehammer approach • Message passing: messages are the only method • Charm++: identifies and supports several modes • Readonly / writeonce • Tables (hash tables) • accumulators • Monotonic variables
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline
Seed Balancing • Applies (currently) to singleton chares • Not chare array elements • When a new chare is created, the system has freedom to assign it any processor • The “seed” message (containing constructor parameters) may be moved around among procs until it takes root • Use • Tree-structured computations • State-space search, divide-conquer • Early applications of Charm • See papers
Object Arrays • A collection of data-driven objects • With a single global name for the collection • Each member addressed by an index • [sparse] 1D, 2D, 3D, tree, string, ... • Mapping of element objects to procS handled by the system User’s view A[0] A[1] A[2] A[3] A[..]
Object Arrays • A collection of data-driven objects • With a single global name for the collection • Each member addressed by an index • [sparse] 1D, 2D, 3D, tree, string, ... • Mapping of element objects to procS handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3]
Object Arrays • A collection of data-driven objects • With a single global name for the collection • Each member addressed by an index • [sparse] 1D, 2D, 3D, tree, string, ... • Mapping of element objects to procS handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3]
Migration support • Forwarding of messages • Optimized by hop-counts: • If a message took multiple hops, the receiver sends the current address to sender • Array manager maintains a cache of known addresses • Home processor for each element • Defined by a hash function on the index • Reductions and broadcasts • Must work in presence of migrations! • Also, in presence of deletions/insertions • Communication time proportional to number of processors not objects • Uses spanning tree based on processors, handling migrated “stragglers” separately
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline
Vacating workstations If the “owner” starts using it, Migrate objects away Detection: Adjusting to speed: Static: measure speeds at the beginning Use speed ratios in load balancing Dynamic: In a time-shared environment Measure unix “load” Migrate proportional set of objects if machine is loaded Adaptive job scheduler Tells jobs to change the sets of processors used Each job maintains a bit-vector of processors they can use. Scheduler can change bit-vector Jobs obey by migrating objects Forwarding, reductions: Residual process is left behind Flexible mapping: using migratibility
Job Specs Bids Job Specs File Upload File Upload Job Id Job Id Faucets: Optimizing Utilization Within/across Clusters http://charm.cs.uiuc.edu/research/faucets Cluster Job Submission Cluster Job Monitor Cluster
Job Monitoring: Appspector • When you attach to a job: • Live performance data (bottom) • Application data – App. Supplied -- Live
Allocate A 8 processors Conflict ! Job B Job A Job B 10 processors Job A B Queued Inefficient Utilization Within A Cluster Parallel Servers are “profit centers” in faucets: need high utilization 16 Processor system Current Job Schedulers can yield low system utilization.. A competitive problem in Faucets-like systems
B Finishes Shrink A Allocate B ! A Expands ! Allocate A ! Min_pe = 8 Max_pe= 16 Job B Job A Job B Max_pe = 10 Min_pe = 1 Job A Two Adaptive Jobs Adaptive Jobs can shrink or expand the number of processors they use, at runtime: by migrating virtual processors in Charm/AMPI 16 Processor system
AQS: Adaptive Queuing System • AQS: A Scheduler for Clusters • Has the ability to manage adaptive jobs • Currently, those implemented in Charm++ and AMPI • Handles regular (non-adaptive) MPI jobs • Experimental results on CSE Turing Cluster
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies (nbr) Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline
IV: Principle of Persistence • Once the application is expressed in terms of interacting objects: • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt and large,but infrequent changes (eg:AMR) • Slow and small changes (eg: particle migration) • Parallel analog of principle of locality • Heuristics, that holds for most CSE applications • Learning / adaptive algorithms • Adaptive Communication libraries • Measurement based load balancing
Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions • Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex)
Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked
There is magic: Overview of charm capabilities Virtualization paper Summary of charm features: Converse Machine model Scheduler and General Msgs Communication Threads Proxies and generated code Support for migration Seed balancing Migration support w reduction How migration is used Vacating workstations and adjusting to speed Adaptive Scheduler Principle of persistence: Measurement based load balancing Centralized strategies, refinement, commlb Distributed strategies (nbr) Collective communication opts Delegation Converse client-server interface Libraries: liveviz, fft, .. Outline