Workshop on Charm++ and Applications Welcome and Introduction

Workshop on Charm++ and Applications Welcome and Introduction Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign Charm Workshop 2003

PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • How? • Develop novel Parallel programming techniques • Embody them into easy to use abstractions • So, application scientist can use advanced techniques with ease • Enabling technology: reused across many apps Charm Workshop 2003

Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Real Processors A Core Idea: Processor Virtualization Benefits Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI MPI processes Virtual Processors (user-level migratable threads) Charm Workshop 2003

Tutorials • Charm++ • AMPI • Projections • FEM framework • Faucets Over the next 3 days Keynote: Prof. Sadayapan Electronic Structure Calculations • Applications • Molecular Dynamics • Quantum Chemistry (QM/MM) • Space-time meshes • Rocket Simulation • Computational Cosmology • System progress talks • Adaptive MPI • Application Visualization • Performance Visualization • Performance Prediction • Frameworks/Libraries • Load Balancers • Communication Optimizations • Discrete Event Simulation Charm Workshop 2003

Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE Quantum Chemistry (QM/MM) Protein Folding Molecular Dynamics Computational Cosmology Crack Propagation Parallel Objects, Adaptive Runtime System Libraries and Tools Space-time meshes Dendritic Growth Rocket Simulation Charm Workshop 2003

Biophysics: Molecular Dynamics (NIH, ..) Long standing, 91-, Klaus Schulten, Bob Skeel Gordon bell award in 2002, Production program used by biophysicists Quantum Chemistry (NSF) QM/MM via Car-Parinello method + Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom, Josep Torrelas, Laxmikant Kale Simulation of Plasma Material simulation (NSF) Dendritic growth, quenching, space-time meshes, QM/FEM R. Haber, D. Johnson, J. Dantzig, + Rocket simulation (DOE) DOE, funded ASCI center Mike Heath, +30 faculty Computational Cosmology (NSF, NASA) Simulation: Scalable Visualization: Some Active Collaborations Charm Workshop 2003

Collaborations Represented Today • Biophysics: Molecular Dynamics (NIH, ..) • Long standing, 91-, Klaus Schulten, Bob Skeel • Gordon bell award in 2002 • production program used by biophysicists • Quantum Chemistry (NSF) • QM/MM via Car-Parinello method + • Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, • Nick Nystrom, Josep Torrelas, Laxmikant Kale • Material simulation (NSF) • R. Haber, D. Johnson, J. Dantzig, Jeff • Rocket simulation (DOE) • DOE, Mike Heath, nature of problem and how charm is useful • Computational Cosmology (NSF, NASA) • Tom Quinn, .. • Simulation and Scalable Visualization: Charm Workshop 2003

Performance visualization Automatic checkpointing Robust implementation available Out-of-core execution Fully automatic Almost complete Cross communicators Interoperability With Frameworks With Charm++ MPI-2 features By Dec 2003 (most) MPI I/O Support for virtualization Processor level caches Hybrid programming: Shared memory primitives Adjust to available processors AMPI Capabilities and Progress Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 Charm Workshop 2003

Performance on Large Machines • Problem: • How to predict performance of applications on future machines? (E.g. BG/L) • How to do performance tuning without continuous access to large machine? • Solution: • Leverage virtualization • Develop machine emulator • Simulator: accurate time modeling • Run program on “100,000 processors” using only hundreds of processors • Analysis: • Use performance viz. suite (projections) • Molecular Dynamics Benchmark • ER-GRE: 36573 atoms • 1.6 million objects • 8 step simulation • 16k BG processors Charm Workshop 2003

Principle of Persistence • Once the application is expressed in terms of interacting objects: • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt and large,but infrequent changes (eg:AMR) • Slow and small changes (eg: particle migration) • Parallel analog of principle of locality • Heuristics, that holds for most CSE applications • Learning / adaptive algorithms • Adaptive Communication libraries • Measurement based load balancing Charm Workshop 2003

Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions • Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex) Charm Workshop 2003

Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication is from/to objects, not processors • Load balancers use this to optimize object placement • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime • E.g. Each to all individualized sends • Performance depends on many runtime characteristics • Library switches between different algorithms V. Krishnan, MS Thesis, 1996 Ongoing work: Sameer Kumar, G Zheng, and Greg Koenig Charm Workshop 2003

Asynchronous collectives Powered by a communication optimization library 2D FFT optimized using AMPI Time breakdown: all-to-all using Mesh library • Observation: Computation is only a small proportion of the elapsed time. Time breakdown of 2D FFT benchmark • AMPI overlaps useful computation with the waiting time of collective operations • Total completion time reduced Charm Workshop 2003

“Overhead” of Virtualization Isn’t there significant overhead of virtualization? No! Not in most cases. Charm Workshop 2003

Component Frameworks • Motivation • Reduce tedium of parallel programming for commonly used paradigms • Encapsulate parallel data structures and algorithms • Provide easy to use interface, • Sequential programming style preserved • No alienating invasive constructs • Use adaptive load balancing framework • Component frameworks • FEM / Unstructured Grids • Multiblock : Structured Grids • AMR Charm Workshop 2003

Parallel Collision Detection • Results: 2s per polygon; • Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor. (scaled problem) Up to 100 million polygons This was a significant improvement over the state-of-art. Made possible by virtualization, and • Asynchronous, as needed, creation of voxels • Localization of communication: voxel often on the same processor as the contributing polygon Charm Workshop 2003

Frameworks progress • In regular use • At CSAR (Rocket Center) • Finite Volume Fluid flow code (Rocflu) developed using it • AMPI version of the framework developed • Adaptivity: • Mesh refinement: 2D done, 3D in progress • Remeshing support • assembly, and parallel solution transfer • Extended usage planned for Space-time meshes Charm Workshop 2003

Cluster Bartering via Faucets Allows single-point job submission Grid level resource allocation Quality of service contracts Clusters participate in bartering mode Cluster Scheduler AMPI jobs can be made to change number of processors used, at runtime Scheduler exploits such adaptive jobs to maximize throughput, and reduce response time (especially for high priority jobs) Grid: Progress on Faucets • Tutorial on Wednesday Charm Workshop 2003

AQS:Adaptive Queuing System • Multithreaded • Reliable and robust • Supports most features of standard queuing systems • Exploits adaptive jobs • Charm++ and MPI • Shrink/Expand set of procs • Handles regular jobs • i.e. non-adaptive • In routine use on the “architecture cluster” Charm Workshop 2003

Job Specs Bids Job Specs File Upload File Upload Job Id Job Id Faucets: Optimizing Utilization Within/across Clusters http://charm.cs.uiuc.edu/research/faucets Cluster Job Submission Cluster Job Monitor Cluster Charm Workshop 2003

Other Ongoing and Planned System Projects • Fault tolerance • Parallel languages • JADE • Orchestration • Parallel Debugger • Automatic out-of-core execution • Parallel Multigrid Support Charm Workshop 2003

Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands of atoms (1,000 - 500,000) • 1 femtosecond time-step, millions needed! • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: every 4 timesteps using PME (3D FFT) • Multiple Time Stepping • Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers Charm Workshop 2003

NAMD: A Production MD program NAMD • Fully featured program • NIH-funded development • Distributed free of charge (~5000 downloads so far) • Binaries and source code • Installed at NSF centers • User training and support • Large published simulations (e.g., aquaporin simulation featured in keynote) Charm Workshop 2003

CPSD:Spacetime Discontinuous Galerkin Method • Collaboration with: • Bob Haber, Jeff Erickson, Mike Garland, .. • NSF funded center • Space-time mesh is generated at runtime • Mesh generation is an advancing front algorithm • Adds an independent set of elements called patches to the mesh • Each patch depends only on inflow elements (cone constraint) • Completed: • Sequential mesh generation interleaved with parallel solution • Ongoing: Parallel Mesh generation • Planned: non-linear cone constraints, adaptive refinements Charm Workshop 2003

CPSD: Dendritic Growth • Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid • Adaptive refinement and coarsening of grid involves re-partitioning Jon Dantzig et al with O. Lawlor and Others from PPL Charm Workshop 2003

Rocket Simulation • Dynamic, coupled physics simulation in 3D • Finite-element solids on unstructured tet mesh • Finite-volume fluids on structured hex mesh • Coupling every timestep via a least-squares data transfer • Challenges: • Multiple modules • Dynamic behavior: burning surface, mesh adaptation Robert Fielder, Center for Simulation of Advanced Rockets Collaboration with M. Heath, P. Geubelle, others Charm Workshop 2003

Computational Cosmology • N body Simulation • N particles (1 million to 1 billion), in a periodic box • Move under gravitation • Organized in a tree (oct, binary (k-d), ..) • Output data Analysis: in parallel • Particles are read in parallel • Interactive Analysis • Issues: • Load balancing, fine-grained communication, tolerating communication latencies. • Multiple-time stepping Collaboration with T. Quinn, Y. Staedel, M. Winslett, others Charm Workshop 2003

QM/MM • Quantum Chemistry (NSF) • QM/MM via Car-Parinello method + • Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, • Nick Nystrom, Josep Torrelas, Laxmikant Kale • Current Steps: • Take the core methods in PinyMD (Martyna/Tuckerman) • Reimplement them in Charm++ • Study effective parallelization techniques • Planned: • LeanMD (Classical MD core..) • Full QM/MM • Integrate with other code Charm Workshop 2003

Summary and Messages • We at PPL have advanced parallel-methods technology • We are committed to supporting applications • We grow our base of reusable techniques via such collaborations • Try using our technology: • AMPI, (Charm++), Faucets, FEM framework, .. • Available via the web (http://charm.cs.uiuc.edu) Charm Workshop 2003

Workshop on Charm++ and Applications Welcome and Introduction

Workshop on Charm++ and Applications Welcome and Introduction

Presentation Transcript

WORKSHOP ON PRIASOFT

CHARM OVERVIEW

?????????????????? Workshop on ...

National Workshop on

Advanced Charm++ Tutorial

Workshop on

Workshop on Consultation

Workshop on Flowcharts

Workshop on Arts

Recent Results on Charm Decays

CHARM

National Workshop On

Open Charm Spectroscopy

Workshop on

Workshop on

Charm ...the issues

Advanced Charm++ Tutorial

CHARM Update

Charm Kingdom

Workshop on jaundice

Workshop on

Workshop on Writing