hfodd for leadership class computers
Skip this Video
Download Presentation
HFODD for Leadership Class Computers

Loading in 2 Seconds...

play fullscreen
1 / 15

HFODD for Leadership Class Computers - PowerPoint PPT Presentation

  • Uploaded on

HFODD for Leadership Class Computers. N. Schunck, J. McDonnell, Hai Ah Nam. HFODD. HFODD for Leadership Class Computers. DFT AND HPC COMPUTING. Classes of DFT solvers. Coordinate-space: Direct integration of the HFB equations Accurate: provide “exact” result

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' HFODD for Leadership Class Computers' - liza

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hfodd for leadership class computers

HFODD for Leadership Class Computers

N. Schunck, J. McDonnell, Hai Ah Nam

classes of dft solvers
Classes of DFT solvers
  • Coordinate-space:Direct integration of the HFB equations
    • Accurate: provide “exact” result
    • Slow and CPU/memory intensive for 2D-3D geometries
  • Configuration space: Expansion of the solutions on a basis (HO)
    • Fast and amenable to beyond mean-field extensions
    • Truncation effects: source of divergences/renormalization issues
    • Wrong asymptotic unless different bases are used (WS, PTG, Gamow, etc.)

Resources needed for a “standard HFB” calculation

why high performance computing
Why High Performance Computing?

Core of DFT: Global theory which averages out individual degrees of freedom

  • From light nuclei to neutron stars
  • Rich physics
  • Fast and reliable
  • Treatment of correlations ?
  • ~100 keV level precision ?
  • Extrapolability ?

g.s. of even nucleus can be computed in a matter of minutes on a standard laptop: why bother with supercomputing?

  • Large-scale DFT
    • Static: fission, shape coexistence, etc. – compute > 100k different configurations
    • Dynamics: restoration of broken symmetries, correlations, time-dependent problems – combine > 100k configurations
    • Optimization of extended functionals on larger sets of experimental data
computational challenges for dft
Computational Challenges for DFT
  • Self-consistency = iterative process:
    • Not naturally prone to parallelization (suggests: lots of thinking…)
    • Computational cost :

(number of iterations) × (cost of one iteration) O(everything else)

  • Cost of symmetry breaking: triaxiality, reflection asymmetry, time-reversal invariance
    • Large dense matrices(LAPACK) constructed and diagonalized many times – size of the order of (2,000 x 2,000) – (10,000 x 10,000) (suggests: message passing)
    • Many long loops (suggests: threading)
  • Finite-range forces/non-local functionals: exact Coulomb, Yukawa-, Gogny-like
    • Many nestedloops (suggests: threading)
    • Precision issues
  • Solve HFB equations in the deformed, Cartesian HO basis
  • Breaks all symmetries (if needed)
  • Zero-range and finite-range forces coded
  • Additional features: cranking, angular momentum projection, etc.
  • Technicalities:
    • Fortran 77, Fortran 90
    • I/O with standard input/output + a few files

Redde Caesari quae

sunt Caesaris

loop reordering
Loop reordering
  • Fortran: matrices are stored in memory column-wise  elements must be accessed first by column index, then by row index (good stride)
  • Cost of bad stride grows

quickly with number of

indexes and dimensions

Ex.: Accessing Mijk

do i = 1, N

do j = 1, N

do k = 1, N

do k = 1, N

do j = 1, N

do i = 1, N

Time of 10 HF iterations as function of the model space

(Skyrme SLy4, 208Pn, HF, exact Coulomb exchange)

threading openmp
Threading (OpenMP)
  • OpenMP designed to auto-matically parallelize loops
  • Ex: calculation of density matrix in HO basis
  • Solutions:
    • Thread it with OpenMP
    • When possible, replace all such manual linear algebra with BLAS/LAPACK calls (threaded version exist)

do j = 1, N

do i = 1, N

do  = 1, N

Time of 10 HFB iterations as function of the number of threads (Jaguar Cray XT5 – Skyrme SLy4, 152Dy, HFB, 14 full shells)

parallel performance mpi
Parallel Performance (MPI)
  • DFT = naturally parallel
  • 1 core = 1 configuration (only if ‘all’ fits into core)
  • HFODD characteristics
    • Very little communication overhead
    • Lots of I/O per processor (specific to that processor): 3 ASCII files/core
  • Scalability limited by:
    • File system performance
    • Usability of the results (handling of thousands of files)
  • ADIOS library being implemented

Time of 10 HFB iterations as function of the cores

(Jaguar Cray XT5, no threads – Skyrme SLy4, 152Dy, HFB, 14 full shells)






  • Multi-threading: more memory/core available
  • How about scalability of diagonalization for large model spaces?
  • ScaLAPACK successfully implemented for simplex-breaking HFB calculations (J. McDonnell)
  • Current issues:
    • Needs detailed profiling as no speed-up is observed: bottleneck?
    • Problem size adequate?
hybrid mpi openmp parallel model
Hybrid MPI/OpenMP Parallel Model
  • Spread the HFB calculation across a few cores (<12-24)
  • MPI for task management



1 HFB calculation

MPI sub-communicator (optional) for very large bases needing ScaLapack


Task management (MPI)


Threads for loop optimization


HFB - i/N

HFB - (i+1)/N

  • DFT codes are naturally parallel and can easily scale to 1 M processors or more
  • High-precision applications of DFT are time- and memory-consuming computations  need for fine-grain parallelization
  • HFODD benefits from HPC techniques and code examination
    • Loop-reordering give N ≫1 speed-up (Coulomb exchange: N ~ 3, Gogny force, N ~ 8)
    • Multi-threading gives extra factor > 2 (only a few routines have been upgraded)
    • ScaLAPACK implemented: very large bases (Nshell > 25) can now be used (Ex.: near scission)
  • Scaling only average on standard Jaguar file system because of un-optimized I/O
year 4 5 roadmap
Year 4 – 5 Roadmap
  • Year 4
    • More OpenMP, debugging of ScaLAPACK routine
    • First tests of ADIOS library (at scale)
    • First development of a prototype python visualization interface
    • Tests of large-scale, I/O-briddled, multi-constrained calculations
  • Year 5
    • Full implementation of ADIOS
    • Set up framework for automatic restart (at scale)
  • SVN repository (ask Mario for account)