Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

MSc in High Performance ComputingComputational Chemistry ModuleLecture 7Parallel Approaches to Quantum Chemistry (i) Replicated Data Parallelism Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

Outline of the Lecture The GAMESS-UK package • Parallelisation of the SCF process • Expense of computational steps – integral generation and eigensolution • Static load balancing • Dynamic load balancing • Parallelising Linear Algebra • Dealing with I/O requirements, mapping files to disk • Observed performance of the MPI version • Two-level Parallelism • Task Farming in genetic algorithms • QDVE – generation of viable catalysts • Reaction Path methods • Minimisation of reaction path – chorismate mutase enzyme

GAMESS-UK • Generalised Atomic and Molecular Electronic Structure System • Developed and maintained over 20 years as part of CSED’s support for CCP1 • now over 1,200,000 lines of Fortran source • Functionality • HF, MCSCF, MP2, CI wavefunctions • Density Functional Theory (including Hessians) • excitation and ionisation energies (OVGF, 2ph-TDA, RPA...) • a wide range of properties and analysis tools • QM/MM Implementations (CHARMM, ChemShell) • Molec. Phys. 103 (2005) 719-747 • Active developments: • exited state energies and forces from time-dependent DFT (jointly with PNNL) • NMR chemical shifts 261 Licensed Groups see: http://www.cfs.dl.ac.uk

Scaling of Molecular Computations The relative computing power required for molecular computations at four levels of theory. In the absence of screening techniques, the formal scaling for configuration interaction, Hartree-Fock, density functional, and molecular dynamics is: N6, N4 , N3 and N2 , respectively.

1985:HF-SCF module initially parallelised using message passing (TCGMSG, later PVM and MPI) - for iPSC class machines 1990s: Use global memory through the Global Array Tools (PNNL) Data objects can be distributed and accessed without synchronisation e.g. in-core storage of integrals enables a wider range of algorithms to be tackled e.g. parallel MP2, mapping I/O to memory access ... Core of the program still uses a replicated data strategy node memory limits the maximum system size Physically distributed data Single, shared data structure Parallelisation of GAMESS-UK • 2003: Implementation of a partial distributed data strategy • New F90 matrix module, mapping most of the matrices to global memory • can be re-used for other codes, e.g. Crystal • Standards-based (MPI, ScaLAPACK)

Self Consistent Field Method (SCF) • SCF Theory – each electron interacts with a mean potential created by the other (N-1) electrons • SCF method derived by assuming a specific form of the solution to the QM equation – the Schrodinger equation (HF theory) or the Kohn-Sham equation (DFT theory), leading to a set of coupled equations (set of integro-differential equations that could be solved numerically). • More common to expand the solutions in a finite set of primitive functions (the basis set). Set of equations become a set of coupled homogeneous equations that are usually written in matrix form. • Eigen values and eigen vectors of the matrix describing particle interactions are required; because of coupling in the matrix (matrix defined in terms of its solutions) a self-consistent solution is sought. • This implies an iterative process, iterating until the Fock matrix remains constant from iteration to iteration.

Schematic SCF Computationally most expensive steps Integral generation O(N4) to O(N2.5) Exchange-Correlation Quadrature (DFT only) O(N) Fock Matrix construction from Integrals O(N4) to O(N2.5) Diagonalisation O(N3), Orthogonalisation O(N3) Matrix Multiply O(N3) MOAO represents the molecular orbitals P the density matrix and F the Fock or Hamiltonian matrix F=H0+P[()-1/2()]

2-Electron Integral Generation Loop Integrals are computed in a 4-nested loop over basis function shells • Basis functions within a shell share exponents do ish=1, nsh do jsh = 1, ish do k = 1, ish do l = 1, ksh Compute batch of integrals Store (conventional SCF) or Multiply by P to get F (direct SCF) enddo enddo enddo enddo F=H0+P[()-1/2()]

Parallelisation with static task allocation Use node number to assign tasks do ish=1, nshells do jsh = 1, ish do ksh = 1, ish if (mod(nnodes,ksh) .eq. mynode) then do lsh = 1, ksh Compute batch of integrals Multiply integrals by P to get F enddo endif enddo enddo Enddo Call global_sum (F) do ish=1, nshells do jsh = 1, ish do ksh = 1, ish do lsh = 1, ksh if (mod(nnodes,lsh) .eq. mynode) then Compute batch of integrals Multiply integrals by P to get F endif enddo enddo enddo Enddo Call global_sum (F)

Parallel SCF considerations • Conventional SCF, we are creating one file on each node. Some logic is needed to support this (GAMESS-UK uses special file names such as ed2000, ed2001….. etc • Not all tasks are the same size • If ish, jsh, ksh, lsh are indices of shells comprising a single function, there is only a single integral • If ish, jsh, ksh, lsh are indices of shells containing Px,Py,Pz there are potentially 3x3x3x3 integrals (some equivalent by symmetry). • Even more for d, f, g orbitals • All nodes must wait for global sum at the end • “slowest” node controls speed of execution • Dynamic allocation of tasks can reduce load imbalance

Dynamically Load Balanced SCF itask = next_task() icount = 0 do ish=1, nsh do jsh = 1, ish do ksh = 1, ish do lsh = 1, ksh if (icount .eq. itask) then Compute batch of integrals Multiply integrals by P to get F itask = next_task() endif icount = icount + 1 enddo enddo enddo Enddo Call global_sum (F)

Implementing dynamic load balancing • Some toolkits provide a “global counter” • e.g. GAMESS-UK was first parallelised using TCGMSG, which provides NXTVAL() call • Implementation is quite machine dependent • When using MPI-1, an additional task can be assigned to hold the counter and reply to incoming requests • Quite wasteful for small node counts • Basis of GAMESS-UK dynamic load balanced MPI version • Dynamic allocation works better if large tasks are encountered before small ones • More efficient to reverse loop orderings in integral generation • generally better to use g, f, d, p, s ordering of shells rather than s, p, d, f, g….

Further Parallelisation of the SCF Process Parallel Diagonalisation • PEiGS:G.I.Fann, R.J. Littlefield. Parallel inverse iteration with reorthogonalisation, paper presented at the Conference on Parallel Processing for Scientific Computing, SIAM, pp. 409-413 • Solves dense real symmetric problems (Ax=lx) and generalised (Ax=lBx) eigen problems. • Numerical method used is multisection for eigenvalues and repeated inverse iteration and orthogonalisation for eigenvectors. • Guarantees orthogonality of eigenvectors, even for arbitrarily large clusters that span processors. • ScaLAPACK: • PDSYEVX, PDSYEVD, PDSYEV – a variety of routines and algorithms

Parallel Linear Algebra Symmetric Eigensolver Routines • ScaLAPACK • drivers for solving standard and generalized dense symmetric or dense Hermitian Eigenproblems. • PDSYEV (QR Method) (Scalapack 1.5) • PDSYEVX (Bisection & Inverse Iteration) (Scalapack 1.5) • PDSYEVD (D&C Method) (Scalapack (1.7) • BFG (I. Bush) • Block Jacobi Method on full dense symmetric matrix (+ Hermitian) • Plapack • QR method • MRRR ‘Multiple Relatively Robust Representations’

Parallel Diagonalisation - Scalability of Algorithms Real symmetric eigenvalue problems

Parallel Eigensolvers

Further Parallelisation of the SCF Process Algorithm Changes - Alternatives to diagonalisation • The Hartree-Fock (HF) module may also be based on a quadratically convergent SCF (QCSCF) approach [G.B. Bacskay, Chem. Phys. 61 (182) 385]. • The SCF equations are recast as a non-linear minimisationwhich bypasses the diagonalisation step. This scheme consists of only data-parallel operations and matrix multiplications which guarantees high efficiency on parallel machines. • Perhaps more significantly, QCSCF is amenable to several performance enhancements that are not possible in conventional approaches. e.g., • orbital-Hessian vector products may be computed approximately which significantly reduces the computational expense with no effect on the final accuracy.

I/O Considerations • Most ab-initio programs rely heavily on I/O • Integral files (for conventional SCF only) • GAMESS-UK reduces its memory footprint by saving data to disk when not in use (ed3, ed7) • Data saved for restarting and later job steps (ed3) • I/O in Parallel Implementations • All nodes maintain a copy of the file • Good for machines with fast local disk on nodes • A limitation on machines with distributed file systems (e.g. HPCx) • Node 0 maintains a copy of the file on behalf of the parallel job • Keeps I/O to a minimum • Each read operation has to be followed by a broadcast • This is the default for the MPI version of GAMESS-UK • The files can be mapped into memory • Particularly useful when each node maintains a partial copy (e.g. ed2) • Aggregate memory of parallel computer can then be exploited

Characteristics of Integral-Parallel SCF • Very efficient for small node counts • integral cost dominates • Limited by Amdahls law • Still a lot of serial code in the SCF that needs to be addressed (MxM, diagonalisation…) • I/O and memory demands of the program have not changed • considerable effort may be needed to make this efficient • As an example, consider performance of GAMESS-UK MPI implementation

Parallelisation of GAMESS-UK • Consider the performance of the MPI code. • The impact of increasing molecule size and the balance between integral evaluation (parallelised) and the SCF steps (serial) in the MPI code. DFT calculations on; • Morphine (6-31Gdp), 410 basis functions • Cyclosporin (6-31G), 1000 basis functions, and • Cyclosporin (6-31Gdp), 1855 basis functions • All calculations performed on HPCx (Phase2A, p5-575 nodes)

Morphine (6-31Gdp), 410 basis functions Total Time (secs) Computational Step

Morphine (6-31Gdp), 410 basis functions % Contribution of Computational Tasks No. of Processors

Cyclosporin (6-31G), 1000 basis functions Total Time (secs) Computational Step

Cyclosporin (6-31G), 1000 basis functions % Contribution of Computational Tasks No. of Processors

Cyclosporin (6-31Gdp), 1855 basis functions Total Time (secs) Computational Step

Cyclosporin (6-31Gdp), 1855 basis functions % Contribution of Computational Tasks No. of Processors

Task Farming • Trivial Parallelism • processors are divided up into groups • each group works on an independent task • requires scalability to smaller processor counts • e.g. GAMESS-UK • task farm version has been implemented using MPI features (groups, communicators) • used for combinatorial applications, e.g. Genetic algorithm design of catalysts.

Task Farming Example (QDVE) • QDVE on HPCx is a collaboration with Marcus C. Durrant, formerly John Innes Centre, now Northumbria University. • Use a genetic algorithm to find the most effective transition metal complex for catalysing the reduction of N2 to N2H4. • For each potential catalyst, reaction energies for each step in the catalytic cycle are calculated and the most successful complexes go through to the next ‘round’. • Complexes are ‘bred’ and ‘mutated’ to create new complexes that will hopefully combine the most successful attributes of their parents. • Process repeated though a number of successive generations until a complex with a desired efficacy is found. • Implementation: • Reaction energies are calculated in ‘taskfarming mode’ i.e. numerous small child jobs are run concurrently and in parallel on a subset of the total no. of processors allocated to the parent job.

Eight geometries The Catalytic Complexes M=Transition Metal S=Reaction Substrate Each complex consists of: - Core structure – metal M plus set of ligands L1 & L2; - Substrate ligand, representing the different species shown in the cycle. L1= BH2-, CH3-, NH2-, OH-, AlH2-, SiH3-, PH2, SH-, GaH2-, GeH3-,AsH2-, SeH-, NH3, OH2, PH3, SH2, AsH3, SeH2 L2=H-, N3-, O2-, S2-, BH2-, CH3-, NH2-, OH-, F-, AlH2-, SiH3-,Ph2-, SH-, Cl-, GaH2-, GeH3-, AsH2-, SeH- Br-, NH3, OH2, PH3, SH2,AsH3, SeH2

D = E -E -E -10 D =E +E -E +10 2 C B 1/2H2 8 A N2H4 I D =E -E -E +15 1 B A N2 D =E -E +10 3 C D D =E -E -E -10 7 I G 1/2H2 D =(E or E )-E -E -10 4 E F C 1/2H2 D =E -(E or E )-E -10 5 G E F 1/2H2 D =E -E +10 6 G H The Catalytic Cycle

Spin state. Charge on the complex. Unique job identifier. identifies transition metal by specifying the row and column of the periodic table. 21143584D2A Reaction species in the catalytic cycle The geometry of the substrates and ligand around the metal. The secondary ligand L2. The primary ligand L1. Nanogenes In molecular evolution, genes are transcribed into functional molecules (proteins) – survival of a gene determined by ability of its associated protein to carry out target chemical reaction In this project, each complex is described by a nanogene that uniquely identifies each aspect of the complex and also provides a way of automating the breeding and mutation process.

1. Generate initial population of transition metal complexes by random methods. 2. Use GAMESS-UK to calculate the energy of each step in the catalytic cycle. 3. For each complex calculate an overall score (fitness) by comparing the calculated energies with a theoretical ideal value. 6. Breed and mutate survivors to create the next generation of catalytic complexes. 4. Use a set of selection rules to determine which complexes should go through to the next round. No 5. Are the selection criteria fully satisfied? Yes QDVE process is completed. The Genetic Algorithm

root Group running single GAMESS-UK job root/master communicator Intra-group communicator master master master slaves slaves slaves QDVE - Implementation • Execution on large-scale parallel resources requires submission of small number of large jobs - requires task farming harness • dynamically allocate jobs to processors • support for automatic restart if required • Batch processing of many automatically-generated model transition-metal containing structures presents some problems • Conventional SCF convergence schemes are not typically robust enough to run without intervention (tuning of level shifters etc) • Modified driver to automatically choose between convergence schemes based on diagnostics (energy changes, occ-virtual Fock matrix etc).

The Replica Path Method • The method involves the definition of a reaction path via replication of a set of macromolecular atoms. • Entails the simultaneous optimisation of a series of geometries of the reacting system, corresponding to a series of points along the reaction pathway. • The replica path approach has been tested on the chorismate / prephenate rearrangement, illustrating how the PMF approach, based on the constraint forces acting on the non-equilibrium path structures can be used to extract a measure of the thermodynamics of the reaction from the active site atoms. • The intermediates can be estimated by interpolation, as in the chorismate mutase reaction:

P0 P1 P2 P3 P4 P32 P33 P34 P35 P36    E     Reaction Co-ordinate The Replica Path Method • Involves minimisation of the reaction path (end points and e.g. 20 intermediates) at the same time • The target function for the combined minimisation comprises the sum of the configurational energies, together with a series of penalty functions which ensure that the structures represent the reaction path. • We can parallelise over images as well as deploy parallel processing with each image • Effectively exploit 500-1000 processors even with a replicated data code

Replica Path Parallelisation • Classical part of the system, (both replicated and non-replicated MM regions), is computed using the standard CHARMM parallel code. • For the QM calculation, however, the CHARMM communication subsystem is switched such that the processors are grouped into independent sets, each set working on one of the points on the pathway. • The converged wavefunction for each point is maintained ready to initialise the next calculation. H.L. Woodcock, M. Hodoscek, P. Sherwood, Y.S. Lee, H.F. Schaefer III, B.R. Brooks, Theo. Chem. Acc (2003) 109, 140-148.

Test System: Chorismate Mutase • The Chorismate mutase enzyme is well-studied by both theoretical and experimental methods. • The system was solvated, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM.

Chorismate Mutase Test System Chorismate/Prephenate moiety , the only part treated by QM methods (thus avoiding any bonded QM/MM junctions • Solvated system, leading to a total of ca 1500 atoms. Only one of the active sites in the trimeric enzyme was treated by the replica approach, the remainder by MM. • The Chorismate to Prephenate rearrangement found to have H†† and Hrxn values of 14.9 and -19.5 kcal/mol. The activation enthalpy compares favourably with the expt.value of 12.7±0.4 kcal/mol. • Close agreement between the energy profiles obtained from direct energetic analysis and from the PMF integration approach. Replicated part of the system (6 Å cutoff) – with a different geometry at each point on the reaction path - is highlighted.

Computed Energy Profiles

The QM/MM Modelling Approach • Couple QM (quantum mechanics) and MM (molecular mechanics) approaches • QM treatment of the active site • reacting centre • excited state processes (e.g. spectroscopy) • problem structures (e.g. complex transition metal centre) • Classical MM treatment of environment • enzyme structure • zeolite framework • explicit solvent molecules • bulky organometallic ligands

Summary • The GAMESS-UK package • Parallelisation of the SCF process • Static load balancing • Dynamic load balancing • Parallelising Linear Algebra • Dealing with I/O requirements, mapping files to disk • Observed performance of the MPI version • Two-level Parallelism • Task Farming in genetic algorithms • Reaction Path methods

Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

Presentation Transcript

Introduction to CCP4 and ccp4i Martyn Winn CCP4, STFC Daresbury Laboratory m.d.winn@dl.ac.uk Bangalore, Feb 2008

PiMS, xtalPiMS and beyond: proteins, crystals and data Chris Morris STFC Daresbury Laboratory… …and the PIMS development

EMMA STATUS Neil Bliss, STFC Technology, Daresbury Laboratory 6 th January 2009

SRF Development at daresbury laboratory A. Wheelhouse ASTeC, STFC Daresbury Laboratory

Jim Clarke STFC Daresbury Laboratory and The Cockcroft Institute

Roy Lemmon Nuclear Physics Group STFC Daresbury Laboratory

Martyn F Guest, Edo Apra, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Neil Thompson, David Dunning STFC Daresbury Laboratory Brian M c Neil University of Strathclyde

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Daresbury Laboratory

Roy Lemmon Daresbury Laboratory United Kingdom

Computational Chemistry at Daresbury Laboratory

David Holder ASTeC Daresbury Laboratory

Bill Smith Computational Science and Engineering STFC Daresbury Laboratory Warrington WA4 4AD

Computational Chemistry at Daresbury Laboratory

David Meredith STFC e-Science Centre Daresbury Laboratory, UK davidredith@stfc.ac.uk

John Hughes, and Andy van Dam

Daresbury Laboratory Accommodation

Roy Lemmon - STFC Daresbury Laboratory, UK for the ALICE Collaboration

Peter van Dam

H.J.J. van Dam , Martyn Guest and Paul Sherwood,

Jens Thomas STFC Daresbury Laboratory j.m.h.thomas@dl.ac.uk