1 / 22

Optimal Biomolecular simulations in heterogeneous computing architectures

Optimal Biomolecular simulations in heterogeneous computing architectures. Scott Hampton, Pratul Agarwal. Funding by:. 10 -15 s. 10 -12. 10 -9. 10 -6. 10 -3. 10 0 s. Motivation: Better time-scales. Molecular dynamics (MD). 2010. 2005. 10 -15 s. 10 -12. 10 -9. 10 -6. 10 -3.

feng
Download Presentation

Optimal Biomolecular simulations in heterogeneous computing architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OptimalBiomolecular simulations in heterogeneous computing architectures Scott Hampton, Pratul Agarwal Funding by:

  2. 10-15 s 10-12 10-9 10-6 10-3 100 s Motivation: Better time-scales Molecular dynamics (MD) 2010 2005 10-15 s 10-12 10-9 10-6 10-3 100 s Protein dynamical events Elastic vibration of globular region <0.1 Å Protein breathing motions Folding (3° structure), 10–100 Å kBT/h Side chain and loop motions0.1–10 Å Bond vibration 2° structure 10-15 s 10-12 10-9 10-6 10-3 100 s Experimental techniques H/D exchange NMR: R1, R2 and NOE NMR: residual dipolar coupling Neutron scattering Neutron spin echo Agarwal: Biochemistry (2004), 43, 10605-10618; Microbial Cell Factories (2006) 5:2; J. Phys. Chem. B113, 16669-80.

  3. substrate Enzyme protein vibration hydration shell bulk solvent A few words about how we see biomolecules • Solvent is important: Explicit solvent needed • Fast motions in active-site • Slow dynamics of entire molecules Agarwal, P. K., Enzymes: An integrated view of structure, dynamics and function, Microbial Cell Factories. (2006) 5:2

  4. Introduction: MD on future architectures • Molecular biophysics simulations • Molecular dynamics: time-evolution behavior • Molecular docking: protein-ligand interactions • Two alternate needs • High throughput computing (desktop and clusters) • Speed of light (supercomputers) • Our focus is on High Performance Architectures • Strong scaling: Simulating fixed system on 100-1000s of nodes • Longer time-scales: microsecond or better • Larger simulations size (more realistic systems)

  5. A look at history 2018 Top 500 Computers in the World Historical trends 2015 100 Pflop/s 2012 10 Pflop/s 1 Pflop/s 100 Tflop/s N=1 10 Tflop/s 1 Tflop/s N=500 100 Gflop/s 10 Gflop/s 1 Gflop/s Courtesy: Al Geist (ORNL), Jack Dongarra (UTK)

  6. Concurrency • Fundamental assumptions of system software architecture and application design did not anticipate exponential growth in parallelism Courtesy: Al Geist (ORNL)

  7. Future architectures • Fundamentally different architecture • Different from the traditional MPP (homogeneous) machines • Very high concurrency: Billion way in 2020 • Increased concurrency on a single node • Increased Floating Point (FP) capacity from accelerators • Accelerators (GPUs/FPGAs/Cell/? etc.) will add heterogeneity Significant Challenges: Concurrency, Power and Resiliency 2012  2015  2018 …

  8. MD in heterogeneous environments • “Using FPGA Devices to Accelerate Biomolecular Simulations”,S. R. Alam, P. K. Agarwal, M. C. Smith, J. S. Vetter and D. Caliga, IEEE Computer, 40 (3), 2007. • “Energy Efficient Biomolecular Simulations with FPGA-based Reconfigurable Computing”, A. Nallamuthu, S. S.Hampton, M. C. Smith, S. R. Alam, P. K. Agarwal, In proceedings of the ACM Computing Frontiers 2010, May 2010. • “Towards Microsecond Biological Molecular Dynamics Simulations on Hybrid Processors”, S. S. Hampton, S. R. Alam, P. S. Crozier, P. K. Agarwal, In proceedings of The 2010 International Conference on High Performance Computing & Simulation (HPCS 2010), June 2010. • “Optimal utilization of heterogeneous resources for biomolecular simulations” S. S. Hampton, S. R. Alam, P. S. Crozier, P. K. Agarwal, Supercomputing Conference 2010. Accepted. Multi-core CPUs & GPUs FPGAs = low power solution?

  9. Our target code: LAMMPS • LAMMPS*: Highly scalable MD code • Scales >64 K cores • Open source code • Supports popular force-fields • CHARMM & AMBER • Supports a variety of potentials • Atomic and meso-scale • Chemistry and Materials • Very active GPU-LAMMPS community** NVIDIA GPUs + CUDA Single workstations And GPU-enabled clusters *http://lammps.sandia.gov/ **http://code.google.com/p/gpulammps/

  10. Our Approach: Gain from multi-level parallelism • Off-loading: improving performance in strong scaling • Other Alternative: Entire (or most) MD run on GPU • Computations are free, data localization is not • Host-GPU data exchange is expensive • In a multiple GPU and multi-node: much worse for entire MD on GPU • We propose/believe: Best time-to-solution will come from not using a single resource but most (or all) heterogeneous resources • Keep the parallelism that already LAMMPS has • Spatial decomposition: Multi-core processors/MPI

  11. CPU δt {Bond} {Neigh} {Pair} {Other + Outpt} Host only CPU Compute the electrostatic and Lennard-Jones interaction terms for energy/forces Collect forces, time integration (update positions), adjust temperature/pressure, print/write output If needed update the neighbor list Compute bonded terms {Comm} = communication Loop: i = 1 to N GPU as a Co-processor GPU v1 CPU Collect forces, time integration (update positions), adjust temperature/pressure, print/write output If needed update the neighbor list Compute bonded terms forces GPU atomic positions & neighbor’s list Compute the electrostatic and Lennard-Jones interaction terms for energy/forces Loop: i = 1 to N GPU as an Accelerator With concurrent computations on CPU & GPU GPU v2/v3 CPU Collect forces, time integration (update positions), adjust temperature/pressure, print/write output Compute bonded terms forces GPU Compute the electrostatic and Lennard-Jones interaction terms for energy/forces atomic positions If needed update the neighbor list Loop: i = 1 to N Data locality!

  12. Results: JAC benchmark Single workstation Non-bonded Joint AMBER-CHARMM 23,558 atoms (Cut-off based) Intel Xeon E5540 2.53 GHz

  13. Performance: Single-node (multi-GPUs) • Single workstation with 4 Tesla C1060 cards • 10-50X speed-ups, larger systems – data locality • Super-linear speed-ups for larger systems • Beats 100-200 cores of ORNL Cray XT5 (#1 in Top500) Performance metric (ns/day)

  14. Pipelining: scaling on multi-core/ multi-node with GPUs v4 • Challenge: How do all cores use a single (or limited) GPUs? • Solution: Use pipelining strategy* * = Hampton, S. S.; Alam, S. R.; Crozier, P. S.; Agarwal, P. K. (2010), Optimal utilization of heterogeneous resources for biomolecular simulations. Supercomputing 2010.

  15. Results: Communications overlap • Need to overlap off-node/on-node communications • Very important for strong scaling mode 2 Quad-core Intel Xeon E5540 2.53 GHz Important lesson: Off/on node communication overlap more off-node communication less off-node communication

  16. Results: NVIDIA’s Fermi • Early access to Fermi card: single vs double precision • Fermi: ~6X more double precision capability than Tesla series • Better and more stable MD trajectories Double precision Intel Xeon E5520 (2.27 GHz)

  17. Results: GPU cluster • 24-nodes Linux cluster: 4 quad CPUs + 1 Tesla card per node • AMD Opteron 8356 (2.3 GHz), Infiniband DDR • Pipelining allows all up to 16 cores to off-load to 1 card • Improvement in time-to-solution Protein in water 320,000 atoms (Long range electrostatics)

  18. Results: GPU cluster • Optimal use: matching algorithm with hardware • The best time-to-solution comes from multi-level parallelism • Using CPUs AND GPUs • Data locality makes a significant impact on the performance 96,000 atoms (cut-off) 320,000 atoms (PPPM/PME) Cut-off = short range forces only PME = Particle mesh Ewald method for long range

  19. Performance modeling Theoretical limit (computations only) Off-loading non-bonded (data transfer included) baileybaby222: Entire simulation • Kernel speedup = time to execute the doubly nested for loop of the compute() function (without cost of data transfer) • Procedure speedup = time for the compute() function to execute, including data transfer costs in the GPU setup • Overall speedup = run-time (CPU only) / run-time (CPU + GPU)

  20. Long range electrostatics • Gain from multi-level hierarchy • Matching the hardware features with software requirements • On GPUs • Lennard-Jones and short range terms (direct space) • Less communication and more computationally dense • Long range electrostatics • Particle mesh Ewald (PME) • Requires fast Fourier transforms (FFT) • Significant communication • Keep it on the multi-core

  21. Summary • GPU-LAMMPS (open source) MD software • Highly scalable on HPC machines • Supports popular biomolecular force-fields • Accelerated MD simulations • 10-20X: single workstations & GPU clusters • Beats #1 Cray XT 5 supercomputer (small systems) • Production quality runs • Wide-impact possible • High throughput: Large number of proteins • Longer time-scales • Ongoing work: Heterogeneous architectures • Gain from the multi-level hierarchy • Best time-to-solution: use most (or all) resources

  22. Acknowledgements • Scott Hampton, Sadaf Alam (Swiss Supercomputing Center) • Melissa Smith, Ananth Nallamuthu (Clemson) • Duncan Poole/Peng Wang, NVIDIA • Paul Crozier, Steve Plimpton, SNL • Mike Brown, SNL/ORNL $$$ NIH: R21GM083946 (NIGMS) Questions?

More Related