1 / 15

Parallelization of CPAIMD using Charm++

Parallelization of CPAIMD using Charm++. Parallel Programming Lab. CPAIMD. Collaboration with Glenn Martyna and Mark Tuckerman MPI code – PINY Scalability problems When #procs >= #orbitals Charm++ approach Better scalability using virtualization Further divide orbitals. The Iteration.

rustye
Download Presentation

Parallelization of CPAIMD using Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelization of CPAIMD using Charm++ Parallel Programming Lab

  2. CPAIMD • Collaboration with Glenn Martyna and Mark Tuckerman • MPI code – PINY • Scalability problems • When #procs >= #orbitals • Charm++ approach • Better scalability using virtualization • Further divide orbitals

  3. The Iteration

  4. The Iteration (contd.) • Start with 128 “states” • State – spatial representation of electron • FFT each of 128 states • In parallel • Planar decomposition => transpose • Compute densities (DFT) • Compute energies using density • Compute Forces and move electrons • Orthonormalize states • Start over

  5. Parallel View

  6. Optimized Parallel 3D FFT • To perform 3D FFT • 1d followed by 2d instead of 2d followed by 1d • Lesser computation • Lesser communication

  7. Orthonormalization • All-pairs operation • The data of each state has to meet with the data of all other states • Our approach (picture follows) • A virtual processor acts as meeting point for several pairs of states • Create lots of these • The number of pairs meeting at a VP: n • Communication decreases with n • Computation increases with n • Balance required

  8. VP based approach

  9. Performance • Existing MPI code – PINY • Does not scale beyond 128 processors • Best per-iteration: 1.7s • Our performance:

  10. Load balancing • Load imbalance due to distribution of data in orbitals • Planes are sections of a sphere • Hence imbalance • Computation – more points • Communication – more data to send

  11. Load Imbalance Iteration time: 900ms on 1024 procs

  12. Improvement - I Improvement by pairing heavily loaded planes with lightly loaded planes. Iteration time: 590ms

  13. Charm++ Load Balancing Load balancing provided by the system, iteration time: 600ms

  14. Improvement - II Improvement by using a load vector based scheme to map planes to processors. The number of “light” planes per processor is corresponding lesser than that of the number of “heavy” planes. Iteration time: 480ms

  15. Scope for Improvement • Load balancing • Charm++ load balancer shows encouraging results on 512 pes • Combination of automated and manual load-balancing • Avoiding copying when sending messages • In ffts • Sending large read-only messages • FFTs can be made more efficient • Use double packing • Make assumption about data distribution when performing FFTs • Alternative implementation of orthonormalization

More Related