1 / 35

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++. James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/. Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/.

Download Presentation

Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/

  2. UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers Theoretical and Computational Biophysics Group

  3. Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently

  4. Our goal for NAMD is practical supercomputing for NIH researchers • 44,000 users can’t all be computer experts. • 11,700 have downloaded more than one version. • 2300 citations of NAMD reference papers. • One program for all platforms. • Desktops and laptops – setup and testing • Linux clusters – affordable local workhorses • Supercomputers – free allocations on TeraGrid • Blue Waters – sustained petaflop/s performance • User knowledge is preserved. • No change in input or output files. • Run any simulation on any number of cores. • Available free of charge to all. Phillips et al., J. Comp. Chem.26:1781-1802, 2005.

  5. NAMD uses a hybrid force-spatial parallel decomposition Kale et al.,J. Comp. Phys.151:283-312, 1999. • Spatially decompose data and communication. • Separate but related work decomposition. • “Compute objects” facilitate iterative, measurement-based load balancing system.

  6. Charm++ overlaps NAMD algorithms Phillips et al., SC2002. Objects are assigned to processors, queued as data arrives, and executed in priority order.

  7. NAMD adjusts grainsize to match parallelism to processor count • Tradeoff between parallelism and overhead • Maximum patch size is based on cutoff • Ideally one or more patches per processor • To double, split in x, y, z dimensions • Number of computes grows much faster! • Hard to automate completely • Also need to select number of PME pencils • Computes partitioned in outer atom loop • Old: Heuristic based on on distance, atom count • New: Measurement-based compute partitioning

  8. Measurement-based grainsize tuning enables scalable implicit solvent simulation Before - Heuristic (256 cores) After - Measurement-based (512 cores)

  9. The age of petascale biomolecular simulation is near

  10. Larger machines enable larger simulations

  11. Target is still 100 atoms per thread 2002 Gordon Bell Award ATP synthase: 300K atoms Chromatophore: 100M atoms PSC Lemieux: 3000 cores Blue Waters: 300,000 cores, 1.2M threads

  12. Scale brings other challenges • Limited memory per core • Limited memory per node • Finicky parallel filesystems • Limited inter-node bandwidth • Long load balancer runtimes Which is why we collaborate with PPL!

  13. Challenges in 100M-atom Biomolecule Simulation • How to overcome sequential bottleneck? • Initialization • Output trajectory & restart data • How to achieve good strong-scaling results? • Charm++ Runtime

  14. Loading Data into System (1) • Traditionally done on a single core • Molecule size is small • Result of 100M-atom system • Memory: 40.5 GB ! • Time: 3301.9 sec !

  15. Loading Data into System (2) • Compression scheme • Atom “Signature” representing common attributes of a atom • Support more science simulation parameters • However, not enough • Memory: 12.8 GB! • Time: 125.5 sec!

  16. Loading Data into System (3) • Parallelizing initialization • #input procs: a parameter chosen either by user or auto-computed at runtime • First, each loads 1/N of all atoms • Second, atoms shuffled with neighbor procs for later spatial decomposition • Good enough  e.g. 600 input procs • Memory: 0.19 GB • Time: 12.4 sec

  17. Output Trajectory & Restart Data (1) • At least 4.8GB output to file system per output step • tens ms/step target makes it more critical • Parallelizing output • Each output proc is responsible for a portion of atoms • Output to single file for compatibility

  18. Output Issue (1)

  19. Output Issue (2) • Multiple and independent file • Post-processing into a single file

  20. Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores

  21. Multi-threading MPI-based Charm++ Runtime • Exploit multicore • Portable as based on MPI • On each node: • “processor” represented as a thread • N “worker” threads share 1 “communication” thread • Worker thread: only handle computation • Communication: only handle network message

  22. Benefits of SMP Mode (1) • Intra-node communication is faster • Msg transferred as a pointer • Program launch time reduced • 224K cores: ~6 min  ~1 min • Transparent to application developers • Correct charm++ program runs both in non-SMP and SMP mode

  23. Benefits of SMP Mode(2) • Reduce memory footprint further • Read-only data structures shared • Memory footprint for MPI library is reduced • On avg. 7X reduction! • Better cache performance • Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)

  24. Potential Bottleneck on Communication Thread • Computation & Communication Overlap alleviates the problem to some extent

  25. Node-aware Communication • In runtime: multicast, broadcast etc. • E.g.: a series of bcast in startup: 2.78X reduction • In application: multicast tree • Incorporate knowledge of computation to guide the construction of the tree • Least loaded node as intermediate node

  26. Handle Burst of Messages (1) • A global barrier after each timestep due to constant pressure algorithm • More amplified due to only 1 comm thd per node

  27. Handle Burst of Messages (2) • Work flow of comm thread • Alternate in send/release/receive modes • Dynamic flow control • Exit one mode to another • E.g. 12.3% for 4480-node (53,760 cores)

  28. Hierarchical Load Balancer • Large memory consumption in centralized one • Processors are divided into groups • Load balancing is done in each group

  29. Improvement due to Load Balancing

  30. Performance Improvement ofSMP over non-SMP on Jaguar

  31. Strong Scale on Jaguar (2) 224,076 cores 107,520 cores 53,760 cores 6,720 cores

  32. Weak Scale on Intrepid (~1466 atoms/core) 100M 48M 12M 24M 2M 6M • 100M-atom ONLY runs in SMP mode • Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap

  33. Conclusion and Future Work • IO bottleneck solved by parallelization • An approach that optimizes both application and its underlying runtime • SMP mode in runtime • Continue to improve performance • PME calculation • Integrate and optimize new science codes

  34. Acknowledgement • Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation • David Tanner for the implicit solvent work • Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE • Funds: NIH, NSF

  35. Thanks 

More Related