PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Download Presentation

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Loading in 2 Seconds...

- 96 Views
- Uploaded on
- Presentation posted in: General

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

PARATEC and the Generation of the Empty States (Starting point for GW/BSE)

Andrew Canning

Computational Research Division, LBNL and Chemical Engineering and Materials Science Dept. UC Davis.

DFT Kohn-Sham (SCF and NSCF)

{φDFTnk(r), EDFTnk}

Compute Dielectric Function

{ }

GW: Quasiparticle Properties

{φQPnk(r), EQPnk}

BSE: Construct Kernel (coarse grid)

K(k,c,v,k',c',v')

Interpolate Kernel to Fine Grid / Diagonalize BSE Hamiltonian

{Ascvk, Escvk}

Expt. G.E. Jellison, M.F. Chisholm, S.M. Gorbatkin, Appl. Phys. Lett. 62, 3348 (1993).

- 80 carbon atoms, 80x80x4.6au
- 160 occupied (valence) bands, 800 unoccupied (conduction) bands
- kpoints 1x1x32 (coarse) 1x1x256 (fine)
- Running on Cray XE6 Hopper
- Generation of empty states ~30% of computational cost and highest in terms of wall clock time
- scaling issues for running DFT codes for large number of bands (on relatively small system)

- SIESTA (Spanish Initiative for Electronic Simulations with Thousands of Atoms
- Basis set LCAO (Linear Combination of Atomic Orbitals)
- Less accurate basis allows larger systems to be studied (thousands of atoms)
- Good for non-periodic systems, large molecules
- O(N) algorithms implemented in LCAO basis

- PARSEC (Pseudopotential Algorithm for Real-Space Electronic structure Calculations)
- Grid based real space representation finite-difference approach
- Easy to implement non-periodic boundary conditions
- Good for large molecules etc.

- Quantum Espresso
- Plane Wave basis set (same as BerkeleyGW code)
- PAW (Projector Augmented Wavefunctions) option
- Hybrid Functionals

- PARATEC (PARAllel Total Energy Code)
- Plane Wave basis set (same as BerkeleyGW code)
- Good for periodic systems (crystals etc, metallic systems)
- Hybrid Functionals
- static-COHSEX
- OpenMP/MPI Hybrid implementation

- PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set
- Written in F90 and MPI
- Designed to run on large parallel machines Cray, IBM etc. but also runs on PCs

- PARATEC uses all-band CG approach to obtain wavefunctions of electrons (blocks comms. Specialized 3dffts)
- Generally obtains high percentage of peak on different platforms (uses BLAS3 and 1d FFT libs)
- Developed by Louie and Cohen groups (UCB, LBNL) in collaboration with CRD, NERSC

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC

N: number of eigenpairs required (lowest in spectrum)

M: matrix (Hamiltonian) dimension, basis set size (M ~ 100-200N)

Load Balancing, Parallel Data Layout

- Wavefunctions stored as spheres of points (due to energy cutoff)
- Data intensive parts (BLAS) proportional to number of Fourier components
- Pseudopotential calculation, Orthogonalization scales as N3 (atom system)
- FFT part scales as N2logN

- Data distribution: load balancing constraints (Fourier Space):
- each processor should have same number of Fourier coefficients (N3 calcs.)
- each processor should have complete columns of Fourier coefficients (3d FFT)

FFT

Give out sets of columns of data to each processor

- Grid size 2523
- All architectures generally achieve high performance due to computational intensity of code (BLAS3, FFT)
- ES achieves highest overall performance : 5.5Tflop/s on 2048 procs (5.3 Tflops on XT4 on 2048 procs in single proc. node mode)
- FFT used for benchmark for NERSC procurements (run on up to 18K procs on Cray XT4, weak scaling )
- Vectorisation directives and multiple 1d FFTs required for NEC SX6

Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter

k-point parallelization: divide k-points among groups of nodes (limited for large systems, molecules, nanostructures etc)

Band parallelization: n nodes divided into groups

PW parallelization: each group parallelizes over PWs

OpenMP, Threaded Libs on the node/chip

- fewer mpi messages avoids communication bottlenecks
- aggregation of messages per node reduces latency issues
- smaller memory footprint (from code and mpi buffers)
- no on-node mpi messaging
- extra level of parallelism to improve scaling to larger core counts

Timing results for threaded version of PARATEC code used to generate VB and CB states for input to GW code PARATEC (Cray XT5 Jaguar) 686 Si atoms

Jaguar Cray XT5 at ORNL (224,162 cores) : Node: 2 AMD Istambul 2.6 GHz 6 core chips (Total 12 cores, 2x6cores)

- Non-SCF problem is like simulation of metallic system (no gap above top of spectrum)
- Slow convergence
- requires convergence criteria for empty states
- NVB+ NCB can be very large
- Operations on subspace matrix can dominate
- High percent of eigenpairs calculated compared to SCF calc.
- Typically almost all the time is for the Non-SCF calc.

Solve selfconsistently for NVB valence states

Output

Solve non-selfconsistently for NVB+ NCB states

Output for GW/BSE codes

Breakdown of Computational Costs for Solving Kohn-Sham LDA/GGA Equations in PARATEC

NSCF calculation for GW/BSE (compared to standard SCF)

N= NVB+ NCB(NVB): number of eigenpairs required

M: matrix (Hamiltonian) dimension, basis set size (M ~10-20N) (M ~100-200N)

- Efficient distributed implementation of operations on subspace matrix using Scalapack
- Extra states calculated above the required number to improve convergence of CG solver
- Option for using direct solver on Hamiltonian when percentage of eigenpairs required is high (>10%) can be faster than CG iterative solver (P. Zhang)
- Scaling of Iterative Solver (e.g. CG) a N2M Compared to Direct (Lapack, Scalapack) a M3 (M = matrix size (basis, number of PWs), N = number of states)

Block-block data layout

Block size chosen for optimal performance

- PARATEC optimized for large parallel machines (Cray, IBM)
- OpenMP/Threaded version under development (important to get more parallelism, particularly for small systems for GW/BSE, gives faster time to solution)
- Hybrid Functionals, static-COHSEX (starting point for GW/BSE)
- Some optimization for generation of empty states for GW/BSE
- Direct diagonalization of H for cases when high % of eigenstates required (to be in released version soon) for GW/BSE