1 / 28

Impact of Kernel-assisted MPI Communication on Scientific Applications

Impact of Kernel-assisted MPI Communication on Scientific Applications. Teng Ma , Aurelien Bouteiller, George Bosilca, Jack J. Dongarra. Intel micro-architecture (1978 -2013). 8086 5-10 MHz. 80186 6-25 MHz. 80286 6-25 MHz. i386 12-40 MHz. i486 16-100 MHz. P5 60-300 MHz. P6

mervin
Download Presentation

Impact of Kernel-assisted MPI Communication on Scientific Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impact of Kernel-assisted MPI Communication on Scientific Applications Teng Ma, Aurelien Bouteiller, George Bosilca, Jack J. Dongarra

  2. Intel micro-architecture (1978 -2013) 8086 5-10 MHz 80186 6-25 MHz 80286 6-25 MHz i386 12-40 MHz i486 16-100 MHz P5 60-300 MHz P6 200-500 MHz NetBurst 1-3.2 GHz 1978 1993 1995 2000-2005 1982 1982 1985 1989 core Core-2 duo dual-core Xeon 2.1-3GHZ 65nm or 45nm 1-6 cores/processor Nehalem Core-i3, i5, i7 1.8-3.2GHz 45nm or 32nm 2-10 cores/processor Sandy Bridge 2nd generation Core-i3, i5, i7 1.8-3.2GHz 32nm or 22nm 2-8 cores/processor Haswell ? GHz 22nm 10 cores/processor 2006 2008 2011 2013

  3. Introduction • Multi-core Clusters, Large NUMA machines • Programming Model: MPI • Performance portability • Communication is critical to MPI applications’ performance A server equipped with 8 Six-core AMD Opteron 8439 SE

  4. Existing Issues in MPI implementations • Programming model designed for one rank per process, mostly used as one process per core. • Lack of efficient shared memory message delivering approaches • Mainly due to limitation in the OS • Topology-unaware MPI implementations

  5. MPI Intra-node Communication Shared Memory: Copy-in/Copy-out Approach SM BTL(Open MPI), Nemesis(MPICH2) Pros: Low latency, Portable Cons: waste CPU cycles, memory Bandwidth & reduce cache reuse Inter-process Single-copy Approach SMARTMAP (Catamount) Kernel-assisted Approach KNEM(Open MPI and MPICH2), LiMIC (MVAPICH2), XPMEM (Cray MPI and Open MPI), CMA(Open MPI) Send Buffer Shared Buffer Receive Buffer Sender Copy Receiver Copy Send Buffer Receive Buffer Sender Declaring Send buff. Receiver Copy

  6. KNEM: inter-process one-sided single-copy User Space 3. Send cookie Proc A Proc B 6. Send ACK 5. KNEM Copy 7. Deregister Src. Buffer Dest. Buffer 4. ioctl(.., COPY, &icopy) 1. Register Buffer 2. Return cookie Kernel Device /dev/knem Hash Table Kernel Space cookie

  7. Development of kernel-assisted approach in MPI stacks • Intra-node p2p comm. Open MPI(SM/KNEM BTL, SM/CMA BTL, vader BTL), MPICH2-LMT(KNEM), MVAPICH2(LiMIC), • Intra-node collective comm. KNEM Coll.(Open MPI) • Inter- and intra-node collective comm. HierKNEM Coll.(Open MPI)

  8. Intra-node MPI Collective(KNEM Collectives) MPI API Coll Framework (Collectives) KNEM Coll Basic Coll Tuned Coll SM Coll BTL Framework (Point-to-point) SM/KNEM BTL SM BTL KNEM driver KNEM driver Shared Memory Operating System MPI_Bcast, MPI_Gather(v), MPI_Scatter(v), MPI_Alltoall(v), and MPI_Allgather(v).

  9. KNEM Collectives Proc C User Space 3. Send cookie Root Proc Proc B 6. Send ACK 5. KNEM Copy 7. Deregister Src. Buffer Dest. Buffer 4. ioctl(.., COPY, &icopy) 1. Register Buffer 2. Return cookie Kernel Device /dev/knem Hash Table Kernel Space cookie September 17, 2014 9

  10. Experiment Environment • Hardware • Zoot(old) • Quad-socket • Quad-core Intel Tigerton • UMA/SMP • L2 cache shared by pairs of cores • Dancer(very common) • Dual-socket • Quad-core Intel Nehalem • Small NUMA • L3 cache shared by socket • IG(large) • Octo-socket • Hexa-core AMD Istanbul • large NUMA • L3 cache shared by socket • Software • Open MPI Trunk • KNEM:KNEM collectives • OMPI Tuned collectives • OMPI-SM: without KNEM pt2pt • OMPI-KNEM: with KNEM pt2tp • KNEM 0.9.2 • IMB 3.2(Cache reuse off) • MPICH2 1.3.1 • MPICH-SM: without KNEM pt2pt • MPICH-KNEM: with KNEM pt2tp

  11. Bcast Comparison on SMP machine Zoot: Quad-socket Intel quad-core machine Aggregate Bcast Bandwidth on Zoot, 16 processes on 16 cores

  12. Bcast Comparison on small NUMA machine Aggregate Bcast Bandwidth on Dancer, 8 processes on 8 cores Dancer: dual-socket Intel quad-socket machine

  13. Bcast Comparison on large NUMA machine IG: 8-socket AMD 6-core Istanbul machine Aggregate Bcast Bandwidth on IG, 48 processes on 48 cores September 17, 2014 13

  14. Alltoallv Comparison on SMP machine Aggregate Alltoallv Bandwidth on Zoot, 16 processes on 16 cores Zoot: Quad-socket Intel quad-core machine

  15. Alltoallv comparison on small NUMA machine Aggregate Alltoallv Bandwidth on dancer, 8 processes on 8 cores Dancer: dual-socket Intel quad-socket machine

  16. Alltoallv Comparison on large NUMA machine Aggregate Alltoallv Bandwidth on IG, 48 processes on 48 cores IG: 8-socket AMD 6-core Istanbul machine

  17. Inter- and Intra-node Collectives (HierKNEM*) • Tackling MPI on multi-core/many-core clusters • Leader-based Hierarchical algorithm • 2 layers: inter-node and intra-node. • Pipelining strategy between two layers: message is split. • Offloading intra-node communication to non-leader processes by kernel assisted approach and overlapping with inter-node message forwarding. • MPI_Bcast, MPI_Allgather and MPI_Reduce • Target: reducing scale of collective from number of cores to number of nodes. *[Ma, Bosilca, Bouteiller, Dongarra, IPDPS 12]

  18. Performance profiling of HierKNEM Broadcast Performance Profiling of HierKNEM Broadcast with a 512KB message on dancer cluster with 8 nodes interconnected by a 10G IB network.( -bycore binding, 64 processes on 64 cores(8 nodes * 8cores/ node).

  19. Distributed Experimental Environment Hardware • Stremi & Parapluie Cluster • Both have 32 nodes • Node: 2 AMD 12-core processors • 1 Gigabit Ethernet (Stremi) • 20 G IB (Parapluie) Software • Open MPI trunk, MPICH2-1.4 and MVAPICH2-1.7 • KNEM version 0.9.6, LIMIC 0.5.5 • IMB-3.2(cache on)

  20. HierKNEM Broadcast Performance 20x speedup Aggregate Broadcast Bandwidth of Collective Modules on two Multicore Clusters (768 processes, 32 nodes, 24 cores/node).

  21. MPI Applications: CPMD, FFTW & ASP Car-Parrinello Molecular Dynamics (CPMD) Fastest Fourier Transform in the West (FFTW) All-pairs-shortest-path (ASP) Communication-intensive Application, especially collectives. CPMD(Alltoall), FFTW with verification(Bcast), ASP(Bcast)

  22. Hardware Environment: IG & Stremi Stremi: 1 G Ethernet cluster, 32 nodes, each node has 2 12-core processors. IG: AMD 48 core large NUMA machine

  23. Software Environment KNEM (version 0.9.6) Open MPI (trunk r24549) mpiP (version 3.2.1) Shared Memory Environment: CPMD (version 3.13.2) configured with LAPACK 3.3.0 --vibrational analysis tests: methan-fd-nosymm.inp(input) Mpi-bench in FFTW-3.2.alpha –input [1500*1500 20 20] and -y(verification) MPI setup: Shared memory setup: SM BTL + Tuned Coll. KNEM setup: SM/KNEM BTL + KNEM Coll. Distributed Environment: ASP in MagPie -input two integer matrices 16384*16384 and 32768*32768 MPI setup: HierKNEM Coll. vs OMPI_Hier vs OMPI_Tuned vs MPICH2 1.4.1

  24. CPMD on IG(48-cores) Strong Scaling for CPMD's methan-fd-nosymm over KNEM and shared memory. Processes are bound to IG's cores in a compact way (rank i is bound to core i).

  25. CPMD: MPI Usage 4 times Sum of all processes' execution time for the 5 most used MPI functions in CPMD's methan-fd-nosymm using shared memory and KNEM (48 processes on IG's 48 cores).

  26. FFTW on IG FFTW: MPI Usage 6 times

  27. Breakdown of ASP runtime on Stremi(768 procs, 32 nodes* 24 cores/node, 1 G Ethernet)

  28. Conclusion • Kernel Assisted (KNEM-aware) MPI collective communication greatly improves scientific throughput • The scalability of the approach will increase with the number of cores • Kernel Assisted approach reduces the gap between MPI and other light-weight multi-core programming models.

More Related