Evolution of Computational Approaches in NWChem

Evolution of Computational Approaches in NWChem Eric J. Bylaska, W. De Jong, K. Kowalski, N. Govind, M. Valiev, and D. Wang High Performance Software Development Molecular Science Computing Facility

Outline Overview of the capabilities of NWChem Existing terascalepetascale simulations in NWChem Challenges with developing massively parallel algorithm (that work) Using embeddable languages (Python) to develop multifaceted simulations Summary

Overview of NWChem: Background NWChem is part of the Molecular Science Software Suite Developed as part of the construction of EMSL Environmental Molecular Sciences Laboratory, a DOE BER user facility, located at PNNL Designed and developed to be a highly efficient and portable Massively Parallel computational chemistry package Provides computational chemistry solutions that are scalable with respect to chemical system size as well as MPP hardware size

Overview of NWChem: Background More than 2,000,000 lines of code (mostly FORTRAN) About half of it computer generated (Tensor Contraction Engine) A new version is released on a yearly basis: Addition of new modules and bug fixes Ported the software to new hardware Increased performances (serial & parallel) Freely available after signing a user agreement World-wide distribution (downloaded by 1900+ groups) 70% is academia, rest government labs and industry

Originally designed for parallel architectures Scalability to 1000’s of processors (part even 10,000’s) NWChem performs well on small compute clusters Portable – runs on a wide range of computers Supercomputer to Mac or PC with Windows Including Cray XT4, IBM BlueGene Various operating systems, interconnects, CPUs Emphasis on modularity and portability Overview of NWChem: Available on many compute platforms

Overview of NWChem: capabilities NWChem brings a full suite of methodologies to solve large scientific problems High accuracy methods (MP2/CC/CI) Gaussian based density functional theory Plane wave based density functional theory Molecular dynamics (including QMD) Will not list those things standard in most computational chemistry codes Brochure with detailed listing available http://www.emsl.pnl.gov/docs/nwchem/nwchem.html

Overview of NWChem: high accuracy methods Coupled Cluster Closed shell coupled cluster [CCSD and CCSD(T)] Tensor contraction engine (TCE) Spin-orbital formalism with RHF, ROHF, UHF reference CCD, CCSDTQ, CCSD(T)/[T]/t/…, LCCD, LCCSD CR-CCSD(T), LR-CCSD(T) Multireference CC (available soon)EOM-CCSDTQ for excited states MBPT(2-4), CISDTQ, QCISD 8-water cluster on EMSL computer: 1376 bf, 32 elec, 1840 processors achieves 63% of 11 TFlops CF3CHFO CCSD(T) with TCE: 606 bf, 57 electrons, open-shell

Overview of NWChem: Latest capabilities of high accuracy TDDFT infty  5.81 eV EOMCCSD  infty  6.38 eV Surface Exciton Energy: 6.35 +/- 0.10 eV (expt) Dipole polarizabilities (in Å3) C60 molecule in ZM3POL basis set: 1080 basis set functions; 24 correlated Electrons; 1.5 billion of T2 amplitudes; 240 correlated electrons

NWChem capabilities: DFT Density functional theory Wide range of local and non-local exchange-correlation functionals Truhlar’s M05, M06 (Yan Zhao) Hartree-Fock exchange Meta-GGA functionals Self interaction correction (SIC) and OEP Spin-orbit DFT ECP, ZORA, DK Constrained DFT Implemented by Qin Wu and Van Voorhis DFT performance: Si75O148H66, 3554 bf, 2300 elec, Coulomb fitting

Overview of NWChem: TDDFT Density functional theory TDDFT for excited states Expt: 696 nm ; TDDFT: 730 nm Expt: 539 nm ; TDDFT: 572 nm Modeling the λmax (absorption maximum)optical chromophores with dicyanovinyl and 3-phenylioxazolone groups with TDDF, B3LYP, COSMO (8.93 → CH2Cl2), 6-31G**, Andzelm (ARL), et al (in preparation 2008) Development & validation of new Coulomb attenuated (LC) functionals (Andzelm, Baer, Govind, in preparation 2008)

Overview of NWChem: plane wave DFT Silica/Water CPMD Simulation • Plane wave density functional theory • Gamma point pseudopotential and projector augmented wave • Band structure (w/ spin orbit) • Extensive dynamics functionality with Car-Parrinello • CPMD/MM molecular dynamics, e.g. SPC/E, CLAYFF, solid state MD • Various exchange-correlation functionals • LDA, PBE96, PBE0,(B3LYP) • Exact exchange • SIC and OEP • Can handle charged systems • A full range of pseudopotentials and a pseudopotential generator • A choice of state-of-the-art minimizers CCl42-/water CPMD/MM Simulation Spin-Orbit splitting in GaAs

Overview of NWChem: classical molecular dynamics Molecular dynamics Charm and Amber force fields Various types of simulations: Energy minimization Molecular dynamics simulation including ab initio dynamics Free energy calculation Multiconfiguration thermodynamic integration Electron transfer through proton hopping (Q-HOP), i.e. semi-QM in classical MD Implemented by Volkhard group, University of Saarland, Germany Set up and analyze runs with Ecce

Overview of NWChem: QM/MM Seamless integration of molecular dynamics with coupled cluster and DFT Optimization and transition state search QM/MM Potential of Mean Force (PMF) Modeling properties at finite temperature Excited States with EOM-CC Polarizabilities with linear response CC NMR chemical shift with DFT MM QM QM/MM QM/MM CR-EOM-CCSD Excited State Calculations of cytosine base in DNA, Valiev et al., JCP 125 (2006)

Overview of NWChem: Miscellaneous functionality Other functionality available in NWChem Electron transfer Vibrational SCF and DFT for anharmonicity COSMO ONIOM Relativity through spin-orbit ECP, ZORA, and DK NMR shielding and indirect spin-spin coupling Interface with VENUS for chemical reaction dynamics Interface with POLYRATE, Python

Existing TerascalePetascale capabilities NWChem has several modules which are scaling to the terascale and work is ongoing to approach the petascale Speedup 1000  Cray T3E / 900  750  500  250  0 0 250 500 750 1000 Number of nodes NWPW scalability: UO22++122H2O (Ne=500, Ng=96x96x96) CCSD scalability: C60 1080 basis set functions MD scalability: Octanol (216,000 atoms)

Existing terascalepetascale Capabilities: Embarrassingly Parallel Algorithms Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. These types of problems are often called embarrassingly parallel because they are so straight-forward. Very little inter-task (or no) communication is required. Possible to use embeddable languages such as Python to implement • Examples: • Computing Potential Energy Surfaces • Potential of Mean Force (PMF) • Monte-Carlo • QM/MM Free Energy Perturbation • Dynamics Nucleation Theory • Oniom • ….

Challenges: Tackling Amdahl's Law There is a limit to the performance gain we can expect from parallelization Parallel Efficiency My Program f Significant time investment for each 10-fold increase in parallelism! 1-f

Challenges: Gustafson’s Law – make the problem big Gustafson's Law (also known as Gustafson-Barsis' law) states that any sufficiently large problem can be efficiently parallelized My parallel Program Assuming b(n) Then a(n)

Challenges: Extending Time – Failure of Gustafson’s Law? The need for up-scaling in time is especially critical for classical molecular dynamics simulations and ab-initio molecular dynamics, where an up-scaling of about a 1000 times will be needed. Current ab-initio molecular dynamics simulations done over several months are currently done only over 10 to 100 picoseconds, and the processes of interest are on the order of nanoseconds. The step length in ab initio molecular dynamics simulation is on the order of 0.1…0.2 fs/step 1 ns of simulation time  10,000,000 steps at 1 second per step  115 days of computing time At 10 seconds per step  3 years At 30 seconds per step  9 years Classical molecular dynamics simulations done over several months, are only able simulate between 10 to 100 nanoseconds, but many of the physical processes of interest are on the order of at least a microsecond. The step length in molecular dynamics simulation is on the order of 1…2 fs/step 1 us of simulation time  1,000,000,000 steps at 0.01 second per step  115 days of computing time At 0.1 seconds per step  3 years At 1 seconds per step  30 years

Challenges: Need For Communication Tolerant Algorithms Data motion costs increasing due to“processor memory gap” Coping strategies Trade computation for communication [SC03] Overlap Computation and Communication Processor(55%/year) Memory [DRAM](7%/year)

Basic Parallel Computing: tasks Break program into tasks A task is a coherent piece of work (such as Reading data from keyboard or some part of a computation), that is to be done sequentially; it can not be further divided into other tasks and thus can not be parallellised.

Basic Parallel Computing: Task dependency graph A A A A A 2 2 2 2 2 B B B B B 3 3 3 3 3 C[0] C[0] C[0] C[0] C[0] C[1] C[1] C[1] C[1] C[1] C[2] C[2] C[2] C[2] C[3] C[3] C[3] C[4] C[4] a = a + 2; b = b + 3 for i = 0..3 do c[i+1] += c[i]; b = b * a; b = b + c[4]

Basic Parallel Computing: Task dependency graph (2) Synchronisation points Critical path

Basic Parallel Computing: Critical path The critical path determines the time required to execute a program. No matter how many processors are used, the time imposed by the critical path determines an upper limit to performance. The first step in developing a parallel algorithm is to parallelize along the critical path If not enough?

Case Study: Parallel Strategies for Plane-Wave Programs ....... Ne      Three parallization schemes Distribute basis (i.e. FFT grid) • FFT requires communication • <i|j> requires communication proc=1 proc=2 proc=3 • number of k-points is usually small proc=.... slab decomposition Ng basis Ne molecular orbitals Critical path parallelization Distribute Molecular Orbitals (one eigenvector per task) • Minimal load balancing Distribute Brillouin Zone

Case Study: Speeding up plane-wave DFT Old parallel distribution (scheme 3 – critical path) Each box represents a cpu

Case Study: Analyze the algorithm Collect timings for important components of the existing algorithm • For bottlenecks • Remove all-to-all, log(P) global operations beyond 500 cpus • Overlap computation and communication (e.g. pipelining) • Duplicate the data? • Redistribute the data? • ….

Case Study: Try a new strategy New parallel distribution Old parallel distribution

Case Study: analyze new algorithm Can trade efficiency in one component for efficiency in another component if timings are of different orders

Case Study: Success!

Python Example: AIMD Simulation title "propane aimd simulation" start propane2-db memory 600 mb permanent_dir ./perm scratch_dir ./perm geometry C 1.24480654 0.00000000 -0.25583795 C 0.00000000 0.00000000 0.58345271 H 1.27764005 -0.87801632 -0.90315343 mass 2.0 H 2.15111436 0.00000000 0.34795707 mass 2.0 H 1.27764005 0.87801632 -0.90315343 mass 2.0 H 0.00000000 -0.87115849 1.24301935 mass 2.0 H 0.00000000 0.87115849 1.24301935 mass 2.0 C -1.24480654 0.00000000 -0.25583795 H -2.15111436 0.00000000 0.34795707 mass 2.0 H -1.27764005 -0.87801632 -0.90315343 mass 2.0 H -1.27764005 0.87801632 -0.90315343 mass 2.0 end basis * library 3-21G end python from nwmd import * surface = nwmd_run('geometry','dft',10.0, 5000) end task python

Python Example: Header of nwmd.py ## import nwchem specific routines from nwchem import * ## other libraries you might want to use from math import * from numpy import * from numpy.linalg import * from numpy.fft import * import Gnuplot ####### basic rtdb routines to read and write coordinates ########################### def geom_get_coords(name): # # This routine returns a list with the cartesian # coordinates in atomic units for the geometry # of given name # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = rtdb_get('geometry:' + actualname + ':coords') return coords def geom_set_coords(name,coords): # # This routine, given a list with the cartesian # coordinates in atomic units set them in # the geometry of given name. # try: actualname = rtdb_get(name) except NWChemError: actualname = name coords = list(coords) rtdb_put('geometry:' + actualname + ':coords',coords)

Python Example: basic part of Verlet loop #do verlet step-1 verlet steps for s in range(steps-1): for i in range(nion3): rion0[i] = rion1[i] for i in range(nion3): rion1[i] = rion2[i] t += time_step ### set coordinates and calculate energy – nwchem specific #### geom_set_coords(geometry_name,rion1) (v,fion) = task_gradient(theory) ### verlet step ### for i in range(nion3): rion2[i] = 2.0*rion1[i] - rion0[i] - dti[i]*fion[i] ### calculate ke ### vion1 = [] for i in range(nion3): vion1.append(h*(rion2[i] - rion0[i])) ke = 0.0 for i in range(nion3): ke += 0.5*massi[i]*vion1[i]*vion1[i] e = v + ke print ' ' print '@@ %5d %9.1f %19.10e %19.10e %14.5e' % (s+2,t,e,v,ke)

Evolution of Computational Approaches in NWChem