1 / 34

Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study

Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study. Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai Lee http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign.

stiltner
Download Presentation

Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Molecular Dynamics to 3000 Processors with Projections:A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai Lee http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

  2. Motivation • Performance optimization is increasingly challenging • Modern applications are complex and dynamic • Some may involve small amount of computation per step • Performance issues and obstacles change: • As the number of processors change • Performance tuning on small machines isn’t enough • Need very good Performance Analysis tools • Feedback at the level of applications • Analysis capabilities • Scalable views • Automatic instrumentation

  3. Projections • Many performance analysis systems exist • And are very useful in their own way • Paragraph, Upshot, Pablo, .. • The system we present is “Projections” • Designed for message-driven programs • Has some unique benefits compared with others • The next few slides • Outline: • Processor virtualization and message-driven execution • Projections: trace generation • Projections: views • Case Study: NAMD, Molecular Dynamics program that won a Gordon Bell award at SC’02 by scaling MD for biomolecules to 3,000 procs

  4. Virtualization: Object-based Parallelization User is only concerned with interaction between objects System implementation User View

  5. Charm++ Parallel C++ Asynchronous methods In development for over a decade Basis of several parallel applications Runs on all popular parallel machines and clusters AMPI A migration path for MPI codes Allows them dynamic load balancing capabilities of Charm++ Minimal modifications to convert existing MPI programs Bindings for C, C++, and Fortran90 Charm++ and Adaptive MPIRealizations of Virtualization Approach Both available from http://charm.cs.uiuc.edu

  6. Software Engineering Number of virtual processors can be independently controlled Separate VPs for modules Message Driven Execution Adaptive overlap Modularity Predictability: Automatic Out-of-core Dynamic mapping Heterogeneous clusters: Vacate, adjust to speed, share Automatic checkpointing Change the set of processors Principle of Persistence: Enables Runtime Optimizations Automatic Dynamic Load Balancing Communication Optimizations Other Runtime Optimizations Benefits of Virtualization More info: http://charm.cs.uiuc.edu

  7. Trace Generation • Automatic instrumentation by runtime system • Detailed • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer. • Summary • reduces the size of output files and memory overhead. • It produces (in the default mode) a few lines of output data per processor. • This data is recorded in bins corresponding to intervals of size 1ms by default.

  8. Post mortem analysis: views • Utilization Graph • As a function of time interval or processor • Shows processor utilization • As well as: time spent on specific parallel methods • Timeline: • upshot-like, but more details • Pop-up views of method execution, message arrows, user-level events • Profile: stacked graphs: • For a given period, breakdown of the time on each processor • Includes idle time, and message-sending, receiving times

  9. Projections Views: continued • Animation of utilization • Histogram of method execution times • How many method-execution instances had a time of 0-1 ms? 1-2 ms? .. • Performance counters: • Associated with each entry method • Usual counters, interface to PAPI

  10. Case Study: NAMD • We illustrate the use of Projections • Using a case study • Illustrate different “views” • Show performance debugging methodology

  11. NAMD: A Production MD program NAMD • Fully featured program • NIH-funded development • Distributed free of charge (~5000 downloads so far) • Binaries and source code • Installed at NSF centers • User training and support • Large published simulations (e.g., aquaporin simulation featured in SC’02 keynote)

  12. Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands of atoms (10,000 - 500,000) • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: using PME (3D FFT) • Multiple Time Stepping : PME every 4 timesteps • Calculate velocities and advance positions • Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers

  13. F1F0 ATP-Synthase (ATP-ase) The Benchmark • CConverts the electrochemical energy of the proton gradient into the mechanical energy of the central stalk rotation, driving ATP synthesis (G = 7.7 kcal/mol). 327,000 atoms total, 51,000 atoms -- protein and nucletoide 276,000 atoms -- water and ions

  14. Spatial Decomposition Via Charm • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or“Patches”

  15. Object Based Parallelization for MD: Force Decomposition + Spatial Decomposition • Now, we have many objects to load balance: • Each diamond can be assigned to any proc. • Number of diamonds (3D): • 14·Number of Patches

  16. Adding PME • PME involves: • A grid of modest size (e.g. 192x144x144) • Need to distribute charge from patches to grids • 3D FFT over the grid • Strategy: • Use a smaller subset (non-dedicated) of processors for PME • Overlap PME with cutoff computation • Use individual processors for both PME and cutoff computations • Multiple timestepping

  17. NAMD Parallelization using Charm++ : PME 192 + 144 VPs 700 VPs 30,000 VPs These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

  18. Some Challenges • Here, we will focus on cut-off only (no PME) simulation • For simplicity of presentation of performance issues • New parallel machine with faster processors • PSC Lemieux • 1 processor performance (apoa1): • 57 seconds on ASCI red to 7.08 seconds on Lemieux • Makes is harder to parallelize: • E.g. larger communication-to-computation ratio • Each timestep is few milliseconds on 1000’s of processors

  19. Grainsize and Amdahls’s law • A variant of Amdahl’s law, for objects: • The fastest time can be no shorter than the time for the biggest single object! • Lesson from previous efforts • Splitting computation objects: • 30,000 nonbonded compute objects • Instead of approx 10,000

  20. Distribution of execution times of non-bonded force computation objects (over 24 steps) Mode: 700 us

  21. Message Packing Overhead and Multicast Effect of Multicast Optimization on Integration Overhead By eliminating overhead of message copying and allocation.

  22. Measurement Based Load Balancing • Principle of persistence • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt but infrequent changes • Slow and small changes • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions

  23. Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing

  24. Load Balancing Aggressive Load Balancing Refinement Load Balancing Processor Utilization against Time on (a) 128 (b) 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step.

  25. Some overloaded processors Processor Utilization across processors after (a) greedy load balancing and (b) refining Note that the underloaded processors are left underloaded (as they don’t impact perforamnce); refinement deals only with the overloaded ones

  26. New Challenge: Stretched Computations • Jitter in computes up to 80ms • On 1000+ processors using 4 processors per node • NAMD ATPase 3000 processors time steps of 12 ms • Within that time: each processor sends and receives : • Approximately 60-70 messages of 4-6 KB each • OS Context switch time is 10 ms • OS and Communication layer can have “hiccups” • “Hiccups” termed as stretches • Stretches can be a large performance impediment

  27. With barrier: Without: Benefits of Avoiding Barrier • Problem with barriers: • Not the direct cost of the operation itself as much • But it prevents the program from adjusting to small variations • E.g. K phases, separated by barriers (or scalar reductions) • Load is effectively balanced. But • In each phase, there may be slight non-determistic load imbalance • Let Li,j be the load on I’th processor in j’th phase • In NAMD, using Charm++’s message-driven execution: • The energy reductions were made asynchronous • No other global barriers are used in cut-off simulations

  28. Handling Stretches • Challenge • NAMD still did not scale well to 3000 procs with 4 procs per node • due to stretches : inexplicable increase in compute time or communication gaps at random (but few) points • Stretches caused by: Operating system, file system and resource management daemons interfering with the job • Badly configured network API • Messages waiting for the rendezvous of the previous message to be acknowledged, leading to stretches in the ISends • Managing stretches • Use blocking receives • Giving OS time when the job process is idle, to run daemons • Fine tuning the network layer • See also: • A recent analysis of causes of stretches by Fabrizio Petrini

  29. 100 milliseconds

  30. Stretch Removal Histogram Views Number of function executions as a function of their granularity Note: log scale on Y-axis Before Optimizations Over 16 large stretched calls After Optimizations About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

  31. Future work Profile (a) and Timeline (b) view of a 3000 processor run. • load balancing is still a possible issue • These observations indicate that further performance improvement may be possible!

  32. Summary and Conclusion • Processor virtualization • A useful technique for complex applications • Charm++ and AMPI embody this • Can be downloaded at http://charm.cs.uiuc.edu • Projections: • Performance analysis tool especially suited for processor virtualization • NAMD performance was optimized to scale to 3000 processors using projections • Future: • Further automation of analysis • On-demand displays (via a query-and-display language)

More Related