Computational Chemistry at Daresbury Laboratory

Computational Chemistry at Daresbury Laboratory Quantum Chemistry Group Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam http://www.dl.ac.uk/CFS http://www.cse.clrc.ac.uk/Activity/QUASI Molecular Simulation Group Bill Smith and Maurice Leslie http://www.dl.ac.uk/TCSC/Software/DL_POLY

OutlinePerformance of Computational Chemistry Codes • Serial Applications Benchmarks • GAMESS-UK and DL_POLY • Parallel Performance on High-end and Commodity-class systems • NWChem • Global Array (GA) Tools • Parallel Eigensolver (PeIGS) • GAMESS-UK • SCF, DFT, MP2 and 2nd Derivatives • DL_POLY • Version 2: Replicated data • Version 3: Distributed data (domain decomposition) • CHARMM and QM/MM Calculations • Thrombin and TIM benchmarks

12 Typical QC Calculations ModuleBasis (GTOs) Species 1. SCF STO-3G (124) Morphine 2. SCF 6-31G (154) C6H3(NO2)3 3. ECP Geometry ECPDZ (70) Na7Mg+ 4. Direct-SCF 6-31G (82) Cytosine 5. CAS-geometry TZVP (52) H2CO 6. MCSCF EXT1 (74) H2CO 7. Direct-CI EXT2 (64) H2CO/H2+CO 8. MRD-CI (26M) ECP (59) TiCl4 9. MP2-geometry 6-31G* (70) H3SiNCO 10. SCF 2nd derivs.6-31G (64) C5H5N 11. MP2 2nd derivs.6-31G* (60) C4 12. Direct-MP2 DZP (76) C5H5N Six Typical Simulations Simulation Atoms Time steps 1. Na-K disilicate glass 1080 300 2. Metallic Al with 256 8000 Sutton-Chen potential 3. Valinomycin in 1223 3837 100 water molecules 4. Dynamic shell model 768 1000 water with 1024 sites 5. Dynamic shell model 768 1000 MgCl2 with 1280 sites 6. Model membrane, 2 3148 1000 membrane chains, 202 solute and 2746 solvent molecules GAMESS-UK and DL_POLY Serial Benchmarks DL_POLY GAMESS-UK

The GAMESS-UK Benchmark Performance relative to the Compaq Alpha ES45/68-1000 3.6 minutes

The DL_POLY Benchmark. Performance relative to the Compaq Alpha ES45/68-1000

High-End Systems Evaluated • Cray T3E/1200E • 816 processor system at Manchester (CSAR service) • 600 Mz EV56 Alpha processor with 256 MB memory • IBM SP/WH2-375 and SP/Regatta-H • 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory, 375 MHz Power3-II processors with 8 MB L2 cache • IBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier • IBM SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHZ) at ORNL • Compaq AlphaServer SC • 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); • TCS1 system at PSC (comprising 750 4-way ES45 nodes - 3,000 EV68 CPUs - with 4 GB memory per node, 8MB L2 cache • Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W) • SGI Origin 3800 • SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs • Cray Supercluster at Eagen • Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet)

Commodity Systems (CSx)Prototype / Evaluation Hardware Systems Location CPUs Configuration CS1 Daresbury 32 PentiumIII / 450 MHz + FE (EPSRC) CS2 Daresbury 64 24 X dual UP2000/EV67-667, QSNet Alpha/LINUX cluster, 8 X dual CS20/EV67-833 (“loki”) CS3 RAL 16 Athlon K7 850MHz + myrinetCS4 Sara 32 Athlon K7 1.2 GHz + FE CS6 CLiC 528 PentiumIII / 800 MHz; fast ethernet (Chemnitzer Cluster) CS7 Daresbury 64 AMD K7/1000 MP + SCALI/SCI (“ukcp”) CS8 NCSA 320 160 dual IBM Itanium/800 + Myrinet 2k (“titan”) CS9 Bristol 96 Pentium4 Xeon/2000 + Myrinet 2k (“dirac”) Protoype Systems CS0 Daresbury 10 10 CPUS, Pentium II/266 CS5 Daresbury 16 8 X dual Pentium III/933, SCALI www.cse.clrc.ac.uk/Activity/DisCo

T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx [ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ] T 32-node T3E / T 32-node CS6 Pentium III/800 + FE T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix Performance Metrics: 1999-2001 Attempted to quantify delivered performance from the Commodity-based systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-node platforms (e.g. SGI Origin 3800) i.e. Performance Metric (% 32-node Cray T3E) Performance Metrics: 2002 Performance Metric (% 32-node AlphaServer SC [PSC]) T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k

CSx - Pentium III + FE % of 32-node Cray T3E/1200E GAMESS-UK CS1 CS6 SCF 53-69% 96% DFT 65-85% 130-178% DFT (Jfit) 44-77% 65-131% DFT Gradient 90% 130% MP2 Gradient 44% 73% SCF Forces 80% 127% NWChem (DFT Jfit) 50-60% REALC 67% CRYSTAL 79-145% DL_POLY Ewald-based 95-107% 151-184% bond constraints 34-56% 69% CHARMM96% 172% CASTEP33% 42% CPMD 62% ANGUS60% 68% FLITE3D104% CS2 - QSNet Alpha Linux Cluster % of 32-node Cray T3E and O3800/R14k-500 GAMESS-UK SCF 256% 99% DFT † 301-361% 99% DFT (Jfit) 219-379% 89-100% DFT Gradient † 289% 89% MP2 Gradient 228% 87% SCF Forces 154% 86% NWChem (DFT Jfit) † 150-288% 74-135% CRYSTAL †349% DL_POLY Ewald-based † 363-470% 95% bond constraints 143-260% 82% CHARMM †404% 78% CASTEP 166% 78% ANGUS145% FLITE3D †480% Beowulf Comparisons with the T3E & O3800/R14k-500

High-End Computational ChemistryThe NWChem Software • Capabilities (Direct, Semi-direct and conventional): • RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st and 2nd derivatives. • DFT with a wide variety of local and non-local XC potentials, using up to 10,000 basis functions; analytic 1st and 2nd derivatives. • CASSCF; analytic 1st and numerical 2nd derivatives. • Semi-direct and RI-based MP2 calculations for RHF and UHF wave functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives. • Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. • Classical molecular dynamicsand free energy simulations with the forces obtainable from a variety of sources

Global Arrays

60 50 40 Time (sec) 30 IBMSP Cray T3E 20 10 0 0 16 32 48 64 Number of processors PeIGS 3.0 Parallel Performance (Solution of real symmetric generalized and standard eigensystem problems) • uaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues • Packed Storage • Smaller scratch space requirements • Features (not available elsewhere): • Inverse iteration using Dhillon-Fann-Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling) • Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues • Packed Storage • Smaller scratch space requirements Full eigensolution performed on a matrix generated in a charge density fitting procedure (966 fitting functions for a fluorinated biphenyl).

Scalability of Numerical Algorithms Real symmetric eigenvalue problems SGI Origin 3800/R12k-400 (“green”) Time (sec) Number of processors Number of processors

Parallel Eigensolvers Real symmetric eigenvalue problems CS7 AMD K7/1000 MP + SCALI CS9 P4/2000 Xeon + Myrinet Measured Time (seconds) Measured Time (seconds) Scalapack (PDSYEV) Number of processors Number of processors

Case Studies - Zeolite Fragments • DFT Calculations with Coulomb Fitting • Basis (Godbout et al.) • DZVP - O, Si • DZVP2 - H • Fitting Basis: • DGAUSS-A1 - O, Si • DGAUSS-A2 - H • NWChem & GAMESS-UK • Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals held in core. Si8O7H18 347/832 Si8O25H18 617/1444 Si26O37H36 1199/2818 Si28O67H30 1687/3928

DFT Coulomb Fit - NWChem Si8O25H18 617/1444 Si8O7H18 347/832 Measured Time (seconds) Measured Time (seconds) 88%,104% 76%,95% Number of CPUs Number of CPUs

DFT Coulomb Fit - NWChem Si26O37H36 1199/2818 Si28O67H30 1687/3928 TIBM-SP/P2SC-120 (256) = 1137 TIBM-SP/P2SC-120 (256) = 2766 Measured Time (seconds) Measured Time (seconds) 85%,227% 79%,210% Number of CPUs Number of CPUs

Memory-driven Approaches: NWChem - DFT (LDA) 1. Performance on the SGI Origin 3800 Zeolite ZSM-5 • DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: • AO basis: 3554 • CD basis: 12713 • MIPS R14k-500 CPUs (Teras) • Wall time (13 SCF iterations): • 64 CPUs = 5,242 seconds • 128 CPUs= 3,951 seconds • Est. time on 32 CPUs = 40,000 secs • 3-centre 2e-integrals = 1.00 X 10 12 • Schwarz screening = 5.95 X 10 10 • % 3c 2e-ints. In core = 100%

Memory-driven Approaches: NWChem - DFT (LDA) 2. Performance on the HP/Compaq AlphaServer SC Pyridine in Zeolite ZSM-5 • DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: • AO basis: 5457 • CD basis: 12713 • 256 EV67/6-667 CPUs (64 Compaq AlphaServer SC nodes) • Wall time (10 SCF iterations) on 256 CPUs = 11,960 seconds (60% efficiency) • 3-centre 2e-integrals = 3.79 X 10 12 • Schwarz screening = 2.81 X 10 11 • % 3c 2e-ints. In core = 1.66%

Parallel Implementations of GAMESS-UK • Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) • SCF and DFT • Replicated data, but … • GA Tools for caching of I/O for restart and checkpoint files • Storage of 2-centre 2-e integrals in DFT Jfit • Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) • SCF second derivatives • Distribution of <vvoo> and <vovo> integrals via GAs • MP2 gradients • Distribution of <vvoo> and <vovo> integrals via GAs

GAMESS-UK SCF PerformanceCray T3E/1200E, High-end and Commodity-based Systems Elapsed Time (seconds) T3E128 = 436 95%,135% Impact of Serial Linear Algebra: TIBM-SP(16) = 2656 [1289] TIBM-SP(32) = 2184 [ 821] Number of CPUs Cyclosporin:(3-21G Basis, 1000 GTOS)

GAMESS-UK. DFT B3LYP PerformanceCray T3E/1200, High-end and Commodity-based Systems Elapsed Time (seconds) Basis: 6-31G 70%,117% Cyclosporin, 1000 GTOs Number of CPUs

GAMESS-UK. DFT B3LYP PerformanceThe Cray T3E/1200 and High-end Systems Cyclosporin, 1000 GTOs Basis: 6-31G Elapsed Time (seconds) Speed-up S = 106.7 Number of CPUs Number of CPUs

DFT BLYP Gradient:Cray T3E/1200, High-end and Commodity-based Systems Geometry optimisation of polymerisation catalyst Cl(C3H5O).Pd[(P(CMe3)2)2.C6H4] Basis 3-21G* (446 GTOs): 10 energy + gradient evaluations Elapsed Time (seconds) Speed-up linear 69%,114% Number of CPUs Number of CPUs

Auxilliary Basis Coulomb Fit (I) The approach is based on the expansion of the charge density in an auxiliary basis of Gaussian functions Where V is the matrix of 2-centre 2-electron repulsion integrals in the charge density basis and b are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis. As suggested by Dunlap, a variational choice of the fitting coefficients C can be obtained as follows:

Auxilliary Basis Coulomb Fit (ii) • The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required. • We make use of the Schwarz inequality • Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. • Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core.

GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting Basis: DZV_A2 (Dgauss) A1_DFT Fit: 882/3012 JEXPLICIT Measured Time (seconds) 61%,93% TT3E/1200E (128) = 995 JFIT Measured Time (seconds) Number of CPUs 73%,161% TT3E/1200E (128) = 2139 Number of CPUs

GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: Cray T3E/1200, Cray Super Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500 Basis: DZV_A2 (Dgauss) A1_DFT Fit: 882/3012 Measured Time (seconds) Measured Time (seconds) JEXPLICIT JFIT Number of CPUs Number of CPUs

GAMESS-UK: DFT HCTH on Valinomycin. Speedups for both Explicit and Coulomb Fitting. JEXPLICIT JFIT Speedup Speedup 105+  Number of CPUs Number of CPUs

Memory-driven Approaches - SCF and DFT HF and DFT Energy and Gradient calculation for NiSiMe2CH2CH2C6F13: Basis Ahlrichs DZ (1485 GTOs) Elapsed Time (seconds) Integrals written directly to memory, rather than to disk, and are not re-calculated SGI Origin 3800/R14k-500 Number of CPUs

Conventional integrals written to disk read back, transformed, written out, resorted etc. heavy I/O demands Direct/Semi-direct (Frisch, Head-Gordon & Pople, Hasse and Ahlrichs) replace all/some I/O with batched integral recomputation Poor I/O-to-compute performance of MPPs direct approach Current MPPs have large global memories Store subset of MO integrals reduce number of integral recomputations increase communication overhead Subset includes VOVO, VVOO, VOOO, VVVO-class too large to store compute VVVO-terms in separate step MP2 Gradient Algorithms Serial Parallel

Mn(CO)5H - MP2 geometry optimisation BASIS: TZVP + f (217 GTOs Performance of MP2 Gradient ModuleCray T3E/1200, High-end and Commodity-based Systems Speed-up Elapsed Time (seconds) 59%,78% Number of CPUs

SCF Analytic 2nd Derivatives PerformanceCray T3E/1200, High-end and Commodity-based Systems (C6H4(CF3))2: Basis 6-31G (196 GTO) • Terms from MO 2e-integrals in GA storage (CPHF & pert. Fock matrices); Calculation dominated by CPHF: • Gaussian98 - L1002 (CPU) - 32 nodes: 1181 secs; 64 nodes: 1058 secs. • GAMESS-UK (total job time); 128 nodes: 499 secs. Elapsed Time (seconds) G98: 2271 92%,148% G98: 1706 G98: 1490 CPUs CPUs

Vampir 2.5 Performance Analysis of GA-based Applications using Vampir Visualization and Analysis of MPIPrograms • GAMESS-UK on High-end and Commodity class machines • extensions to handle GA applications

GAMESS-UK / Si8O25H18 : 8 CPUs:One DFT Cycle

GAMESS-UK / Si8O25H18 : 8 CPUs Q†HQ (GAMULT2) and PEIGS

Materials Simulation Codes • Local Gaussian Basis Set Codes: • CRYSTAL • Plane Wave DFT Codes: • CASTEP • VASP • CPMD These codes have similar functionality, power and problems. CASTEP is the flagship code of UKCP and hence subsequent discussions will focus on this. This code presents a different set of problems when considering performance on HPC(x). • SIESTA and CONQUEST: • O(n) scaling codes which will be extremely attractive to users. • Both are currently development rather than production codes.

CRYSTAL - 2000 • Distributed Data implementation • Benchmark: • An Acid Centre in Zeolite-Y (Faujasite) • Single point energy • 145 atoms / cell, No symmetry / 8k-points • 2208 basis functions, (6-21G* ) Elapsed Time (seconds) Speed-up T256 SGI Origin 3800 = 945 secs Number of CPUs

Materials Simulation. Plane Wave Methods: CASTEP • Direct minimisation of the total energy (avoiding diagonalisation) • Pseudopotentials must be used to keep the number of plane waves manageable • Large number of basis functions N~106 (especially for heavy atoms). • The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space. • These are distributed across the processors in various ways. • The actual FFT routines are optimized for the cache size of the processor.

Chabazite Acid sites in a zeolite. (Si11 O24 Al H) Vanderbilt ulatrasoft pseudo-potential Pulay density mixing minimiser scheme single k point total energy, 96 bands 15045 plane waves on 3D FFT grid size = 54x54x54; convergence in 17 SCF cycles CASTEP 4.2 - Parallel Benchmark Measured Time (seconds) G 67%,75% Time (comms) IBM SP/WH2-375 157 Cray T3E/1200E 90 CS6 PIII/800+FE 600 CS7 AMD K7/1000 + SCI 242 CS9 P4/2000 +Myrinet 115 CS2 QSNet Alpha 111 SGI Origin 3800/R14k 71 CPUs

TiN: A TiN 32 atom slab, 8 k-points, single point energy calculation with Mulliken analysis, CASTEP 4.2 - Parallel Benchmark II. Measured Time (seconds) 47%,81% kG CPUs

CPMD Version 3.5.1: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-2001) DFT, plane-waves, pseudo-potentials and FFT's Benchmark Example: Liquid Water Physical Specifications:32 molecules, Simple cubic periodic box of length 9.86 A, Temperature 300K Electronic Structure; BLYP functional, Trouillier Martins pseudopotential, Recriprocal space cutoff 70 Ry = 952 eV CPMD is the base code for the new CCP1 flagship project CPMD - Car-Parrinello Molecular Dynamics Elapsed Time (secs.) 30%,53% CPUs Sprik and Vuilleumier (Cambridge)

DL_POLY Parallel Benchmarks (Cray T3E/1200) V2: Replicated Data 9. Model membrane/Valinomycin (MTS, 18,886) 7. Gramicidin in water (SHAKE, 12,390) 6. K/valinomycin in water (SHAKE, AMBER, 3,838) 1. Metallic Al (19,652 atoms, Sutton Chen) 3. Transferrin in Water (neutral groups + SHAKE, 27,593) 2. Peptide in water (neutral groups + SHAKE, 3993). 4. NaCl; Ewald, 27,000 ions 5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald 8. MgO microcrystal: 5,416 atoms Speed-up Linear Speed-up Linear Number of Nodes Number of Nodes

DL_POLY: Cray/T3E, High-end and Commodity-based Systems Measured Time (seconds) Bench 4. NaCl;27,000 ions, Ewald , 75 time steps, Cutoff=24Å 44%,71% T3E128 =94 Number of CPUs

DL_POLY: Scalability on the T3E, High-end & Commodity Systems Bench 5: NaK-disilicate glass; 8,640 atoms, MTS + Ewald: 270 time steps T3E128 = 75 Speed-up Measured Time (seconds) 53%,84% Number of CPUs Number of CPUs

DL_POLY: Macromolecular Simulations Measured Time (seconds) Bench 7. Gramicidin in water; rigid bonds and SHAKE, 12,390 atoms, 500 time steps 41%,64% T3E128 =166 Number of CPUs

B A C D Migration from Replicated to Distributed dataDL_POLY-3 : Domain Decomposition • Distribute atoms, forces across the nodes • More memory efficient, can address much larger cases (10 5-10 7) • Shake and short-ranges forces require only neighbour communication • communications scale linearly with number of nodes • Coulombic energy remains global • strategy depends on problem and machine characteristics • Adopt Particle Mesh Ewald scheme • includes Fourier transform smoothed charge density (reciprocal space grid typically 64x64x64 - 128x128x128)

Conventional routines (e.g. fftw) assume plane or column distributions A global transpose of the data is required to complete the 3D FFT and additional costs are incurred re-organising the data from the natural block domain decomposition. An alternative FFT algorithm has been designed to reduce communication costs. the 3D FFT are performed as a series of 1D FFTs, each involving communications only between blocks in a given column More data is transferred, but in far fewer messages Rather than all-to-all, the communications are column-wise only Migration from Replicated to Distributed dataDL_POLY-3: Coulomb Energy Evaluation Plane Block

Migration from Replicated to Distributed dataDL_POLY-3: Coulomb Energy Performance NaCl Simulation; DLPOLY_2.11, Ewald summation DLPOLY_3, PMES DLPOLY_2.11 27,000 ions, 500 time steps, Cutoff=24Å DLPOLY_3 27,000 ions, 500 time steps, Cutoff=12Å Speed-up DLPOLY_3 216,000 ions, 200 time steps, Cutoff=12Å Number of CPUs

Migration from Replicated to Distributed dataDL_POLY-3: Macromolecular Simulations DL_POLY 2.11 12,390 atoms, 500 time steps DL_POLY 3 99,120 atoms, 100 time steps Gramicidin in water; rigid bonds + SHAKE Speed-up DLPOLY_3 792,960 ions, 50 time steps Number of CPUs

Computational Chemistry at Daresbury Laboratory

Computational Chemistry at Daresbury Laboratory

Presentation Transcript

Computational Chemistry

Computational Chemistry

CHEMISTRY LABORATORY

Computational Chemistry

Computational Chemistry Group

SRF Development at daresbury laboratory A. Wheelhouse ASTeC, STFC Daresbury Laboratory

Computational Chemistry

Advanced Accelerator Test Facilities at Daresbury Laboratory

Modelling Surface Adrian Wander Computational Materials Group CCLRC Daresbury Laboratory

EMMA at Daresbury

John Simpson Nuclear Physics Group Daresbury Laboratory

Daresbury Laboratory

Computational Chemistry

Roy Lemmon Daresbury Laboratory United Kingdom

David Holder ASTeC Daresbury Laboratory

Computational Chemistry

Bill Smith Computational Science and Engineering STFC Daresbury Laboratory Warrington WA4 4AD

Computational Chemistry at Daresbury Laboratory

Computational Chemistry

Daresbury Laboratory Accommodation

Peslherbe’s Laboratory Theoretical/Computational Chemistry - Research Interests

CHEMISTRY LABORATORY