1 / 15

HFODD for Leadership Class Computers

HFODD for Leadership Class Computers. N. Schunck, J. McDonnell, Hai Ah Nam. HFODD. HFODD for Leadership Class Computers. DFT AND HPC COMPUTING. Classes of DFT solvers. Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result

liza
Download Presentation

HFODD for Leadership Class Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam

  2. HFODD

  3. HFODD for Leadership Class Computers DFT AND HPC COMPUTING

  4. Classes of DFT solvers • Coordinate-space:Direct integration of the HFB equations • Accurate: provide “exact” result • Slow and CPU/memory intensive for 2D-3D geometries • Configuration space: Expansion of the solutions on a basis (HO) • Fast and amenable to beyond mean-field extensions • Truncation effects: source of divergences/renormalization issues • Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.) Resources needed for a “standard HFB” calculation

  5. Why High Performance Computing? Core of DFT: Global theory which averages out individual degrees of freedom • From light nuclei to neutron stars • Rich physics • Fast and reliable • Treatment of correlations ? • ~100 keV level precision ? • Extrapolability ? g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing? • Large-scale DFT • Static: fission, shape coexistence, etc. – compute > 100k different configurations • Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations • Optimization of extended functionals on larger sets of experimental data

  6. Computational Challenges for DFT • Self-consistency = iterative process: • Not naturally prone to parallelization (suggests: lots of thinking…) • Computational cost : (number of iterations) × (cost of one iteration) O(everything else) • Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance • Large dense matrices(LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing) • Many long loops (suggests: threading) • Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like • Many nestedloops (suggests: threading) • Precision issues

  7. HFODD • Solve HFB equations in the deformed, Cartesian HO basis • Breaks all symmetries (if needed) • Zero-range and finite-range forces coded • Additional features: cranking, angular momentum projection, etc. • Technicalities: • Fortran 77, Fortran 90 • BLAS, LAPACK • I/O with standard input/output + a few files Redde Caesari quae sunt Caesaris

  8. HFODD for Leadership Class Computers OPTIMIZATIONS

  9. Loop reordering • Fortran: matrices are stored in memory column-wise  elements must be accessed first by column index, then by row index (good stride) • Cost of bad stride grows quickly with number of indexes and dimensions Ex.: Accessing Mijk do i = 1, N do j = 1, N do k = 1, N do k = 1, N do j = 1, N do i = 1, N Time of 10 HF iterations as function of the model space (Skyrme SLy4, 208Pn, HF, exact Coulomb exchange)

  10. Threading (OpenMP) • OpenMP designed to auto-matically parallelize loops • Ex: calculation of density matrix in HO basis • Solutions: • Thread it with OpenMP • When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist) do j = 1, N do i = 1, N do  = 1, N Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152Dy, HFB, 14 full shells)

  11. Parallel Performance (MPI) • DFT = naturally parallel • 1 core = 1 configuration (only if ‘all’ fits into core) • HFODD characteristics • Very little communication overhead • Lots of I/O per processor (specific to that processor): 3 ASCII files/core • Scalability limited by: • File system performance • Usability of the results (handling of thousands of files) • ADIOS library being implemented Time of 10 HFB iterations as function of the cores (Jaguar Cray XT5, no threads – Skyrme SLy4, 152Dy, HFB, 14 full shells)

  12. ScaLAPACK M M M M • Multi-threading: more memory/core available • How about scalability of diagonalization for large model spaces? • ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell) • Current issues: • Needs detailed profiling as no speed-up is observed: bottleneck? • Problem size adequate?

  13. Hybrid MPI/OpenMP Parallel Model • Spread the HFB calculation across a few cores (<12-24) • MPI for task management Threading (OpenMP) 1 HFB calculation MPI sub-communicator (optional) for very large bases needing ScaLapack ScaLAPACK (MPI) Task management (MPI) Cores Threads for loop optimization Time HFB - i/N HFB - (i+1)/N

  14. Conclusions • DFT codes are naturally parallel and can easily scale to 1 M processors or more • High-precision applications of DFT are time- and memory-consuming computations  need for fine-grain parallelization • HFODD benefits from HPC techniques and code examination • Loop-reordering give N ≫1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8) • Multi-threading gives extra factor > 2 (only a few routines have been upgraded) • ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission) • Scaling only average on standard Jaguar file system because of un-optimized I/O

  15. Year 4 – 5 Roadmap • Year 4 • More OpenMP, debugging of ScaLAPACK routine • First tests of ADIOS library (at scale) • First development of a prototype python visualization interface • Tests of large-scale, I/O-briddled, multi-constrained calculations • Year 5 • Full implementation of ADIOS • Set up framework for automatic restart (at scale) • SVN repository (ask Mario for account) http://www.massexplorer.org/svn/HFODDSVN/trunk

More Related