PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering and Materials Science Dept. UC Davis.

GW/BSE Method Overview DFT Kohn-Sham (SCF and NSCF) {φDFTnk(r), EDFTnk} Compute Dielectric Function { } GW: Quasiparticle Properties {φQPnk(r), EQPnk} BSE: Construct Kernel (coarse grid) K(k,c,v,k',c',v') Interpolate Kernel to Fine Grid / Diagonalize BSE Hamiltonian {Ascvk, Escvk} Expt. G.E. Jellison, M.F. Chisholm, S.M. Gorbatkin, Appl. Phys. Lett. 62, 3348 (1993).

Computational Cost: GW Method for nanotube • 80 carbon atoms, 80x80x4.6au • 160 occupied (valence) bands, 800 unoccupied (conduction) bands • kpoints 1x1x32 (coarse) 1x1x256 (fine) • Running on Cray XE6 Hopper • Generation of empty states ~30% of computational cost and highest in terms of wall clock time • scaling issues for running DFT codes for large number of bands (on relatively small system)

Features of Different Codes for generation of empty states (what to use for GW/BSE ? ) • SIESTA (Spanish Initiative for Electronic Simulations with Thousands of Atoms • Basis set LCAO (Linear Combination of Atomic Orbitals) • Less accurate basis allows larger systems to be studied (thousands of atoms) • Good for non-periodic systems, large molecules • O(N) algorithms implemented in LCAO basis • PARSEC (Pseudopotential Algorithm for Real-Space Electronic structure Calculations) • Grid based real space representation finite-difference approach • Easy to implement non-periodic boundary conditions • Good for large molecules etc. • Quantum Espresso • Plane Wave basis set (same as BerkeleyGW code) • PAW (Projector Augmented Wavefunctions) option • Hybrid Functionals • PARATEC (PARAllel Total Energy Code) • Plane Wave basis set (same as BerkeleyGW code) • Good for periodic systems (crystals etc, metallic systems) • Hybrid Functionals • static-COHSEX • OpenMP/MPI Hybrid implementation

PARATEC (PARAllel Total Energy Code) • PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set • Written in F90 and MPI • Designed to run on large parallel machines Cray, IBM etc. but also runs on PCs • PARATEC uses all-band CG approach to obtain wavefunctions of electrons (blocks comms. Specialized 3dffts) • Generally obtains high percentage of peak on different platforms (uses BLAS3 and 1d FFT libs) • Developed by Louie and Cohen groups (UCB, LBNL) in collaboration with CRD, NERSC

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC N: number of eigenpairs required (lowest in spectrum) M: matrix (Hamiltonian) dimension, basis set size (M ~ 100-200N)

Load Balancing, Parallel Data Layout • Wavefunctions stored as spheres of points (due to energy cutoff) • Data intensive parts (BLAS) proportional to number of Fourier components • Pseudopotential calculation, Orthogonalization scales as N3 (atom system) • FFT part scales as N2logN • Data distribution: load balancing constraints (Fourier Space): • each processor should have same number of Fourier coefficients (N3 calcs.) • each processor should have complete columns of Fourier coefficients (3d FFT) FFT Give out sets of columns of data to each processor

PARATEC: Performance • Grid size 2523 • All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT) • ES achieves highest overall performance : 5.5Tflop/s on 2048 procs (5.3 Tflops on XT4 on 2048 procs in single proc. node mode) • FFT used for benchmark for NERSC procurements (run on up to 18K procs on Cray XT4, weak scaling ) • Vectorisation directives and multiple 1d FFTs required for NEC SX6 Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter

Parallelization in PW DFT codes four levels (k-points, bands, PWs, OpenMP) k-point parallelization: divide k-points among groups of nodes (limited for large systems, molecules, nanostructures etc) Band parallelization: n nodes divided into groups PW parallelization: each group parallelizes over PWs OpenMP, Threaded Libs on the node/chip

OpenMP, Threading for on-node/chip parallelism • fewer mpi messages avoids communication bottlenecks • aggregation of messages per node reduces latency issues • smaller memory footprint (from code and mpi buffers) • no on-node mpi messaging • extra level of parallelism to improve scaling to larger core counts Timing results for threaded version of PARATEC code used to generate VB and CB states for input to GW code PARATEC (Cray XT5 Jaguar) 686 Si atoms Jaguar Cray XT5 at ORNL (224,162 cores) : Node: 2 AMD Istambul 2.6 GHz 6 core chips (Total 12 cores, 2x6cores)

Non-SCF problem to generate empty CB states • Non-SCF problem is like simulation of metallic system (no gap above top of spectrum) • Slow convergence • requires convergence criteria for empty states • NVB+ NCB can be very large • Operations on subspace matrix can dominate • High percent of eigenpairs calculated compared to SCF calc. • Typically almost all the time is for the Non-SCF calc. Solve selfconsistently for NVB valence states Output Solve non-selfconsistently for NVB+ NCB states Output for GW/BSE codes

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC NSCF calculation for GW/BSE (compared to standard SCF) N= NVB+ NCB(NVB): number of eigenpairs required M: matrix (Hamiltonian) dimension, basis set size (M ~10-20N) (M ~100-200N)

PARATEC features for Non-SCF problem • Efficient distributed implementation of operations on subspace matrix using Scalapack • Extra states calculated above the required number to improve convergence of CG solver • Option for using direct solver on Hamiltonian when percentage of eigenpairs required is high (>10%) can be faster than CG iterative solver (P. Zhang) • Scaling of Iterative Solver (e.g. CG) a N2M Compared to Direct (Lapack, Scalapack) a M3 (M = matrix size (basis, number of PWs), N = number of states) Block-block data layout Block size chosen for optimal performance

PARATEC summary and future developments • PARATEC optimized for large parallel machines (Cray, IBM) • OpenMP/Threaded version under development (important to get more parallelism, particularly for small systems for GW/BSE, gives faster time to solution) • Hybrid Functionals, static-COHSEX (starting point for GW/BSE) • Some optimization for generation of empty states for GW/BSE • Direct diagonalization of H for cases when high % of eigenstates required (to be in released version soon) for GW/BSE

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)