350 likes | 482 Views
Scalability and interoperable libraries in NAMD. Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois at Urbana-Champaign. Contributors. PI s : Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1:
 
                
                E N D
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois at Urbana-Champaign
Contributors • PI s : • Laxmikant Kale, Klaus Schulten, Robert Skeel • NAMD 1: • Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson • NAMD2: • M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips, N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..
Middle layers Applications “Middle Layers”: Languages, Tools, Libraries Parallel Machines
Molecular Dynamics • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • bonds: • non-bonded: electrostatic and van der Waal’s • Calculate velocities and Advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 100,000)
Cut-off radius • Use of cut-off radius to reduce work • 8 - 14 Å • Faraway charges ignored! • 80-95 % work is non-bonded force computations • Some simulations need faraway contributions • Periodic systems: Ewald, Particle-Mesh Ewald • Aperiodic systems: FMA • Even so, cut-off based computations are important: • near-atom calculations are part of the above • multiple time-stepping is used: k cut-off steps, 1 PME/FMA
Scalability • The Program should scale up to use a large number of processors. • But what does that mean? • An individual simulation isn’t truly scalable • Better definition of scalability: • If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size
Isoefficiency • Quantify scalability • (Work of Vipin Kumar, U. Minnesota) • How much increase in problem size is needed to retain the same efficiency on a larger machine? • Efficiency : Seq. Time/ (P · Parallel Time) • parallel time = • computation + communication + idle
Traditional Approaches • Replicated Data: • All atom coordinates stored on each processor • Non-bonded Forces distributed evenly • Analysis: Assume N atoms, P processors • Computation: O(N/P) • Communication: O(N log P) • Communication/Computation ratio: P log P • Fraction of communication increases with number of processors, independent of problem size! • So, not scalable by this definition
Atom decomposition • Partition the Atoms array across processors • Nearby atoms may not be on the same processor • Communication: O(N) per processor • Communication/Computation: O(N)/(N/P): O(P) • Again, not scalable by our definition
Force Decomposition • Distribute force matrix to processors • Matrix is sparse, non uniform • Each processor has one block • Communication: • Ratio: • Better scalability in practice • (can use 100+ processors) • Plimpton: • Hwang, Saltz, et al: • 6% on 32 Pes 36% on 128 processor • Yet not scalable in the sense defined here!
Spatial Decomposition • Allocate close-by atoms to the same processor • Three variations possible: • Partitioning into P boxes, 1 per processor • Good scalability, but hard to implement • Partitioning into fixed size boxes, each a little larger than the cutoff distance • Partitioning into smaller boxes • Communication: O(N/P): • so, scalable in principle
Spatial Decomposition in NAMD • NAMD 1 used spatial decomposition • Good theoretical isoefficiency, but for a fixed size system, load balancing problems • For midsize systems, got good speedups up to 16 processors…. • Use the symmetry of Newton’s 3rd law to facilitate load balancing
Spatial Decomposition But the load balancing problems are still severe:
FD + SD • Now, we have many more objects to load balance: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14·Number of Patches
Bond Forces • Multiple types of forces: • Bonds(2), Angles(3), Dihedrals (4), .. • Luckily, each involves atoms in neighboring patches only • Straightforward implementation: • Send message to all neighbors, • receive forces from them • 26*2 messages per patch!
Bonded Forces: • Assume one patch per processor: • an angle force involving atoms in patches: • (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) • is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B
Implementation • Multiple Objects per processor • Different types: patches, pairwise forces, bonded forces, • Each may have its data ready at different times • Need ability to map and remap them • Need prioritized scheduling • Charm++ supports all of these
Charm++ • Parallel C++ with Data Driven Objects • Object Groups: • global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu
Data driven execution Scheduler Scheduler Message Q Message Q
Load Balancing • Is a major challenge for this application • especially for a large number of processors • Unpredictable workloads • Each diamond (force object) and patch encapsulate variable amount of work • Static estimates are inaccurate • Measurement based Load Balancing Framework • Robert Brunner’s recent Ph.D. thesis • Very slow variations across timesteps
Bipartite graph balancing • Background load: • Patches (integration, ..) and bond-related forces: • Migratable load: • Non-bonded forces • bond-related forces involving atoms of the same patch • Bipartite communication graph • between migratable and non-migratable objects • Challenge: • Balance Load while minimizing communication
Load balancing • Collect timing data for several cycles • Run heuristic load balancer • Several alternative ones • Re-map and migrate objects accordingly • Registration mechanisms facilitate migration • Needs a separate talk!
Performance: size of system Performance data on Cray T3E
Multi-paradigm programming • Long-range electrostatic interactions • Some simulations require this • Contributions of faraway atoms can be calculated infrequently • PVM based library, DPMTA • developed at Duke by John Board et al • Patch life cycle • Better expressed as a thread
Converse • Supports multi-paradigm programming • Provides portability • Makes it easy to implement RTS for new paradigms • Several languages/libraries: • Charm++, threaded MPI, PVM, Java, md-perl, pc++, Nexus, Path, Cid, CC++, DP, Agents,..
NAMD2 • In production use • Internally for about a year • Several simulations completed/published • Fastest MD program? We think so • Modifiable/extensible • Steered MD • Free energy calculations
Real Application for CS research? • Benefits • Subtle and complex research problems uncovered only with real application • Satisfaction of “real” concrete contribution • With careful planning, you can truly enrich the “middle layers” • Bring back a rich variety of relevant CS problems • Apply to other domains: Rockets? Casting?