1 / 31

Kernel-assisted MPI Communication on Multi-core Clusters

Kernel-assisted MPI Communication on Multi-core Clusters. Teng Ma ICL, Knoxville, July-15-2011. Outline. Motivation Kernel-assisted collective comm.(linear KNEM Coll) Topology-aware approach Dodging algorithm Conclusion. Motivation. Increase node complexity

aziza
Download Presentation

Kernel-assisted MPI Communication on Multi-core Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel-assisted MPI Communication on Multi-core Clusters Teng Ma ICL, Knoxville, July-15-2011

  2. Outline • Motivation • Kernel-assisted collective comm.(linear KNEM Coll) • Topology-aware approach • Dodging algorithm • Conclusion

  3. Motivation • Increase node complexity • Unsatisfied programming models.

  4. Shared-memory approach • Shared-memory region • Copy-in/copy-out • problems

  5. Related work • SMARTMAP(Catamount) • Single-copy large message comm. • BIP-SMP, Myricom MX, Qlogic IPath, and Bull MPI • Kernel-assisted approach • KNEM(Kernel-nemesis) • LIMIC

  6. KNEM: inter-process single-copy MPICH2 since v1.1.1 mpich2-Nemesis LMT Open MPI since v1.5 SM/KNEM BTL.

  7. Outline • Motivation • Kernel-assisted collective comm. (linear KNEM collective) • Topology-aware approach • Dodging algorithm • Conclusion

  8. Architecture & Operations • MPI_Bcast, MPI_Gather(v), MPI_Scatter(v), • MPI_Allgather(v) and MPI_Alltoall(v)

  9. Linear KNEM Bcast • 1. root process declares sbuf to KNEM device to get a cookie back. mca_coll_knem_create_region(buff, count, datatype, knem_module, PROT_READ, &cookie); • 2. root process distributes this cookie around non-root processes. • 3. each non-root process triggers a KNEM copy simultaneously. mca_coll_knem_inline_copy(buff, count , datatype, knem_module, cookie, 0, 0) • 4. each non-root process sends an ack back to root process. • 5. root process deregisters sbuf from KNEM device

  10. Linear KNEM Scatter & Gather MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm ) • The same with Bcast’s 5 steps except: • 1) mca_coll_knem_create_region(buff, count*N, datatype, knem_module, PROT_READ, &cookie); • 3) mca_coll_knem_inline_copy(buff, count , datatype, knem_module, cookie, rank* count*datatype_size, 0);

  11. Linear KNEM Allgather & Alltoall MPI_Alltoall • Linear KNEM Allgather • = KNEM Gather + Bcast

  12. Experiment: Platform • Zoot(4-socket with a quad-core Intel Tigerton) • Saturn(2 sockets with an octo-core Intel Nehalem-EX) • Dancer(2 sockets with a quad-core Intel Nehalem-EP) • IG(8 sockets with a six-core AMD Opteron)

  13. Experiment: software • KNEM version 0.9.2 • IMB-3.2(enable cache_off) • Open MPI trunk version 1.7a1 and MPICH2-1.3.1 • Always use the same mapping between cores and processes. • Process number is the same with core number inside each node.

  14. Broadcast & Gather Broadcast Gather

  15. Alltoallv & Allgather Alltoallv Allgather

  16. Outline • Motivation • Kernel-assisted collective comm. (linear KNEM collective) • Topology-aware approach • Dodging algorithm • Conclusion

  17. Kernel-assisted topology-aware hierarchical collective • Building a collective topology according to underlying hardware, especially memory hierarchies inside nodes. (by HWLoc.) • MPI_Bcast( hierarchical tree algorithm) • MPI_Allgather(v) ( ring algorithm) Target • Balance memory accesses across NUMA nodes. • Minimize remote memory accesses As Small As Possibly.

  18. Broadcast on IG(48 cores)

  19. Allgather • T. Ma, T. Herault, G. Bosilca,J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, in Proceedings of IEEE Cluster 2011, Austin, Texas.

  20. Hierarchical effect and tuning the pipeline size(Bcast, IG, 48 cores)

  21. Bcast on IG(48 cores).

  22. Allgather on IG( 48 cores).

  23. CPMD on IG(48 cores) CPMD’s methan-fd-nosymm test • T. Ma, A. Bouteiller, G. Bosilca and J. J. Dongarra, Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW. In Proceedings of the 18th EuroMPI(2011) conference, Santorini, Greece.

  24. Outline • Motivation • Kernel-assisted collective comm. (linear KNEM collective) • Topology-aware approach • Dodging algorithm • Conclusion

  25. Hierarchical+tuned+knem • Tackling MPI on multi-core cluster • Leader-based Hierarchical algorithm( 2 layers + pipelining) • MPI_Bcast, MPI_Allgather and MPI_Reduce • Offloading intra-node communication to non-leader processes by kernel assisted approach and overlapping with inter-node message forwarding. • Target: reducing scale of collective from number of cores to number of nodes.

  26. Bcast on Dancer(64 cores

  27. Broadcast on dancer(64 cores)

  28. Scale with core # in nodes(dancer 8 nodes)

  29. Reduce on Dancer(64 cores)

  30. Conclusion • Kernel-assisted approach provides a scalable solution for MPI on multi-core nodes. • Kernel-assisted approach can cooperate with other algorithms. • Kernel-assisted approach can be used to dodge intra-node communication overhead of MPI on multi-core clusters.

  31. T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra. Kernel Assisted Collective Intra-node Communication Among Multi-core and Many-core CPUs. In Proceedings of the 40th International Conference on Parallel Processing (ICPP-2011),Taipei, Taiwan, 2011. • T. Ma, A. Bouteiller, G. Bosilca and J. J. Dongarra, Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW. In Proceedings of the 18th EuroMPI(2011) conference, Santorini, Greece.

More Related