Hfodd for leadership class computers
1 / 15

HFODD for Leadership Class Computers - PowerPoint PPT Presentation

  • Uploaded on

HFODD for Leadership Class Computers. N. Schunck, J. McDonnell, Hai Ah Nam. HFODD. HFODD for Leadership Class Computers. DFT AND HPC COMPUTING. Classes of DFT solvers. Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'HFODD for Leadership Class Computers' - liza

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hfodd for leadership class computers

HFODD for Leadership Class Computers

N. Schunck, J. McDonnell, Hai Ah Nam

Classes of dft solvers
Classes of DFT solvers

  • Coordinate-space:Direct integration of the HFB equations

    • Accurate: provide “exact” result

    • Slow and CPU/memory intensive for 2D-3D geometries

  • Configuration space: Expansion of the solutions on a basis (HO)

    • Fast and amenable to beyond mean-field extensions

    • Truncation effects: source of divergences/renormalization issues

    • Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.)

Resources needed for a “standard HFB” calculation

Why high performance computing
Why High Performance Computing?

Core of DFT: Global theory which averages out individual degrees of freedom

  • From light nuclei to neutron stars

  • Rich physics

  • Fast and reliable

  • Treatment of correlations ?

  • ~100 keV level precision ?

  • Extrapolability ?

g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing?

  • Large-scale DFT

    • Static: fission, shape coexistence, etc. – compute > 100k different configurations

    • Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations

    • Optimization of extended functionals on larger sets of experimental data

Computational challenges for dft
Computational Challenges for DFT

  • Self-consistency = iterative process:

    • Not naturally prone to parallelization (suggests: lots of thinking…)

    • Computational cost :

      (number of iterations) × (cost of one iteration) O(everything else)

  • Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance

    • Large dense matrices(LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing)

    • Many long loops (suggests: threading)

  • Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like

    • Many nestedloops (suggests: threading)

    • Precision issues


  • Solve HFB equations in the deformed, Cartesian HO basis

  • Breaks all symmetries (if needed)

  • Zero-range and finite-range forces coded

  • Additional features: cranking, angular momentum projection, etc.

  • Technicalities:

    • Fortran 77, Fortran 90


    • I/O with standard input/output + a few files

Redde Caesari quae

sunt Caesaris

Loop reordering
Loop reordering

  • Fortran: matrices are stored in memory column-wise  elements must be accessed first by column index, then by row index (good stride)

  • Cost of bad stride grows

    quickly with number of

    indexes and dimensions

Ex.: Accessing Mijk

do i = 1, N

do j = 1, N

do k = 1, N

do k = 1, N

do j = 1, N

do i = 1, N

Time of 10 HF iterations as function of the model space

(Skyrme SLy4, 208Pn, HF, exact Coulomb exchange)

Threading openmp
Threading (OpenMP)

  • OpenMP designed to auto-matically parallelize loops

  • Ex: calculation of density matrix in HO basis

  • Solutions:

    • Thread it with OpenMP

    • When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist)

do j = 1, N

do i = 1, N

do  = 1, N

Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152Dy, HFB, 14 full shells)

Parallel performance mpi
Parallel Performance (MPI)

  • DFT = naturally parallel

  • 1 core = 1 configuration (only if ‘all’ fits into core)

  • HFODD characteristics

    • Very little communication overhead

    • Lots of I/O per processor (specific to that processor): 3 ASCII files/core

  • Scalability limited by:

    • File system performance

    • Usability of the results (handling of thousands of files)

  • ADIOS library being implemented

Time of 10 HFB iterations as function of the cores

(Jaguar Cray XT5, no threads – Skyrme SLy4, 152Dy, HFB, 14 full shells)






  • Multi-threading: more memory/core available

  • How about scalability of diagonalization for large model spaces?

  • ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell)

  • Current issues:

    • Needs detailed profiling as no speed-up is observed: bottleneck?

    • Problem size adequate?

Hybrid mpi openmp parallel model
Hybrid MPI/OpenMP Parallel Model

  • Spread the HFB calculation across a few cores (<12-24)

  • MPI for task management



1 HFB calculation

MPI sub-communicator (optional) for very large bases needing ScaLapack


Task management (MPI)


Threads for loop optimization


HFB - i/N

HFB - (i+1)/N


  • DFT codes are naturally parallel and can easily scale to 1 M processors or more

  • High-precision applications of DFT are time- and memory-consuming computations  need for fine-grain parallelization

  • HFODD benefits from HPC techniques and code examination

    • Loop-reordering give N ≫1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8)

    • Multi-threading gives extra factor > 2 (only a few routines have been upgraded)

    • ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission)

  • Scaling only average on standard Jaguar file system because of un-optimized I/O

Year 4 5 roadmap
Year 4 – 5 Roadmap

  • Year 4

    • More OpenMP, debugging of ScaLAPACK routine

    • First tests of ADIOS library (at scale)

    • First development of a prototype python visualization interface

    • Tests of large-scale, I/O-briddled, multi-constrained calculations

  • Year 5

    • Full implementation of ADIOS

    • Set up framework for automatic restart (at scale)

  • SVN repository (ask Mario for account)