1 / 48

A. Castro, J. Alberdi and A. Rubio

First principles modeling with Octopus: massive parallelization towards petaflop computing and more. A. Castro, J. Alberdi and A. Rubio . Outline. Theoretical Spectroscopy The octopus code Parallelization. Outline. Theoretical Spectroscopy The octopus code Parallelization.

arnold
Download Presentation

A. Castro, J. Alberdi and A. Rubio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and A. Rubio

  2. Outline Theoretical Spectroscopy The octopus code Parallelization

  3. Outline Theoretical Spectroscopy The octopus code Parallelization

  4. Theoretical Spectroscopy

  5. Theoretical Spectroscopy • Electronic excitations: • Optical absorption • Electron energy loss • Inelastic X-ray scattering • Photoemission • Inverse photoemission • …

  6. Theoretical Spectroscopy Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):

  7. Theoretical Spectroscopy Role: interpretation of (complex) experimental findings

  8. Theoretical Spectroscopy • Role: interpretation of (complex) experimental findings • Theoretical atomistic structures, and corresponding TEM images.

  9. Theoretical Spectroscopy

  10. Theoretical Spectroscopy

  11. Theoretical Spectroscopy The European Theoretical Spectroscopy Facility (ETSF)

  12. Theoretical Spectroscopy • The European Theoretical Spectroscopy Facility (ETSF) • Networking • Integration of tools (formalism, software) • Maintenance of tools • Support, service, formation

  13. Theoretical Spectroscopy • The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF: • abinit • octopus • dp

  14. Outline Theoretical Spectroscopy The octopus code Parallelization

  15. The octopus code • Targets: • Optical absorption spectra of molecules, clusters, nanostructures, solids. • Response to lasers (non-perturbative response to high-intensity fields) • Dichroic spectra, and other mixed (electric-magnetic responses) • Adiabatic and non-adiabatic Molecular Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions). • Quantum Optimal Control Theory for molecular processes.

  16. The octopus code • Physical approximations and techniques: • Density-Functional Theory, Time-Dependent Density-Functional Theory to describe the electron structure. • Comprehensive set of functionals through the libxc library. • Mixed quantum-classical systems. • Both real-time and frequency domain response (“Casida” and “Sternheimer” formulations).

  17. The octopus code • Numerics: • Basic representation: real space grid. • Usually regular and rectangular, occasionally curvilinear. • Plane waves for some procedures (especially for periodic systems) • Atomic orbitals for some procedures

  18. The octopus code Derivative in a point: sum over neighbor points. Cij depend on the points used: the stencil. More points -> more precision. Semi-local operation.

  19. The octopus code • The key equations • Ground-state DFT: Kohn-Sham equations. • Time-dependent DFT: time-dependent KS eqs:

  20. The octopus code • Key numerical operations: • Linear systems with sparse matrices. • Eigenvalue systems with sparse matrices. • Non-linear eigenvalue systems. • Propagation of “Schrödinger-like” equations. • The dimension can go up to 10 million points. • The storage needs can go up to 10 Gb.

  21. The octopus code • Use of libraries: • BLAS, LAPACK • GNU GSL mathematical library. • FFTW • NetCDF • ETSF input/output library • Libxc exchange and correlation library • Other optional libraries.

  22. www.tddft.org/programs/octopus/

  23. Outline Theoretical Spectroscopy The octopus code Parallelization

  24. Objective Reach petaflops computing, with a scientific code Simulate photosynthesis of the light in chlorophyll

  25. Multi­levelparallelization

  26. Target systems: • Massive number of execution units • Multi­core processors with vectorial FPUs • IBM Blue Gene architecture • Graphical processing units

  27. High Level Parallelization MPI parallelization

  28. Parallelization by states/orbitals Assign each processor a group of states Time­propagation is independent for each state Little communication required Limited by the number of states in the system

  29. Domain parallelization Assign each processor a set of grid points Partition libraries: Zoltan or Metis

  30. Main operations in domain parallelization

  31. Low level paralelization and vectorization OpenMP and GPU

  32. Two approaches OpenMP OpenCL Hundreds of execution units High memory bandwidth but with long latency Behaves like a vector processor (length > 16) Separated memory: copy from/to main memory • Thread programming based on compiler directives • In­node parallelization • Little memory overhead compared to MPI • Scaling limited by memory bandwidth • Multithreaded Blas and Lapack

  33. Supercomputers • Corvo cluster • X86_64 • VARGAS (in IDRIS) • Power6 • 67 teraflops • MareNostrum • PowerPC 970 • 94 teraflops • Jugene (image) • 1 petaflops

  34. Test Results

  35. Laplacian operator Comparison in performance of the finite difference Laplacian operator CPU uses 4 threads GPU is 4 times faster Cache effects are visible

  36. Timepropagation Comparison in performance for a time propagation Fullerene molecule The GPU is 3 times faster Limited by copying and non­GPU code

  37. Multi­level parallelization • Clorophyll molecule: 650 atoms • Jugene ­ Blue Gene/P • Sustained throughput: > 6.5 teraflops • Peak throughput: 55 teraflops

  38. Scaling

  39. Scaling (II) • Comparison of two atomic system in Jugene

  40. Target system • Jugene all nodes • 294 912 processor cores = 73 728 nodes • Maximum theoretical performance of 1002 MFlops • 5879 atoms chlorophyll system • Complete molecule of spinach

  41. Tests systems • Smaller molecules • 180 atoms • 441 atoms • 650 atoms • 1365 atoms • Partition of machines • Jugene and Corvo

  42. Profiling • Profiled within the code • Profiled with Paraver tool • www.bsc.es/paraver

  43. 1 TD iteration

  44. Some “inner” iterations

  45. One “inner” iteration Ireceive Isend Iwait

  46. Poisson solver Allgather 2 xAlltoall Allgather Scatter

  47. Improvements • Memory improvements in GS • Split the memory among the nodes • Use of ScaLAPACK • Improvements in the Poisson solver for TD • Pipeline execution • Execute Poisson while continues with an approximation • Use new algorithms like FFM • Use of parallel FFTs

  48. Conclusions Kohn­Sham scheme is inherently parallel It can be exploited for parallelization and vectorization Suited to current and future computer architectures Theoretical improvements for large system modeling

More Related