Impact of Kernel-assisted MPI Communication on Scientific Applications

Impact of Kernel-assisted MPI Communication on Scientific Applications Teng Ma, Aurelien Bouteiller, George Bosilca, Jack J. Dongarra

Intel micro-architecture (1978 -2013) 8086 5-10 MHz 80186 6-25 MHz 80286 6-25 MHz i386 12-40 MHz i486 16-100 MHz P5 60-300 MHz P6 200-500 MHz NetBurst 1-3.2 GHz 1978 1993 1995 2000-2005 1982 1982 1985 1989 core Core-2 duo dual-core Xeon 2.1-3GHZ 65nm or 45nm 1-6 cores/processor Nehalem Core-i3, i5, i7 1.8-3.2GHz 45nm or 32nm 2-10 cores/processor Sandy Bridge 2nd generation Core-i3, i5, i7 1.8-3.2GHz 32nm or 22nm 2-8 cores/processor Haswell ? GHz 22nm 10 cores/processor 2006 2008 2011 2013

Introduction • Multi-core Clusters, Large NUMA machines • Programming Model: MPI • Performance portability • Communication is critical to MPI applications’ performance A server equipped with 8 Six-core AMD Opteron 8439 SE

Existing Issues in MPI implementations • Programming model designed for one rank per process, mostly used as one process per core. • Lack of efficient shared memory message delivering approaches • Mainly due to limitation in the OS • Topology-unaware MPI implementations

MPI Intra-node Communication Shared Memory: Copy-in/Copy-out Approach SM BTL(Open MPI), Nemesis(MPICH2) Pros: Low latency, Portable Cons: waste CPU cycles, memory Bandwidth & reduce cache reuse Inter-process Single-copy Approach SMARTMAP (Catamount) Kernel-assisted Approach KNEM(Open MPI and MPICH2), LiMIC (MVAPICH2), XPMEM (Cray MPI and Open MPI), CMA(Open MPI) Send Buffer Shared Buffer Receive Buffer Sender Copy Receiver Copy Send Buffer Receive Buffer Sender Declaring Send buff. Receiver Copy

KNEM: inter-process one-sided single-copy User Space 3. Send cookie Proc A Proc B 6. Send ACK 5. KNEM Copy 7. Deregister Src. Buffer Dest. Buffer 4. ioctl(.., COPY, &icopy) 1. Register Buffer 2. Return cookie Kernel Device /dev/knem Hash Table Kernel Space cookie

Development of kernel-assisted approach in MPI stacks • Intra-node p2p comm. Open MPI(SM/KNEM BTL, SM/CMA BTL, vader BTL), MPICH2-LMT(KNEM), MVAPICH2(LiMIC), • Intra-node collective comm. KNEM Coll.(Open MPI) • Inter- and intra-node collective comm. HierKNEM Coll.(Open MPI)

Intra-node MPI Collective(KNEM Collectives) MPI API Coll Framework (Collectives) KNEM Coll Basic Coll Tuned Coll SM Coll BTL Framework (Point-to-point) SM/KNEM BTL SM BTL KNEM driver KNEM driver Shared Memory Operating System MPI_Bcast, MPI_Gather(v), MPI_Scatter(v), MPI_Alltoall(v), and MPI_Allgather(v).

KNEM Collectives Proc C User Space 3. Send cookie Root Proc Proc B 6. Send ACK 5. KNEM Copy 7. Deregister Src. Buffer Dest. Buffer 4. ioctl(.., COPY, &icopy) 1. Register Buffer 2. Return cookie Kernel Device /dev/knem Hash Table Kernel Space cookie September 17, 2014 9

Experiment Environment • Hardware • Zoot(old) • Quad-socket • Quad-core Intel Tigerton • UMA/SMP • L2 cache shared by pairs of cores • Dancer(very common) • Dual-socket • Quad-core Intel Nehalem • Small NUMA • L3 cache shared by socket • IG(large) • Octo-socket • Hexa-core AMD Istanbul • large NUMA • L3 cache shared by socket • Software • Open MPI Trunk • KNEM:KNEM collectives • OMPI Tuned collectives • OMPI-SM: without KNEM pt2pt • OMPI-KNEM: with KNEM pt2tp • KNEM 0.9.2 • IMB 3.2(Cache reuse off) • MPICH2 1.3.1 • MPICH-SM: without KNEM pt2pt • MPICH-KNEM: with KNEM pt2tp

Bcast Comparison on SMP machine Zoot: Quad-socket Intel quad-core machine Aggregate Bcast Bandwidth on Zoot, 16 processes on 16 cores

Bcast Comparison on small NUMA machine Aggregate Bcast Bandwidth on Dancer, 8 processes on 8 cores Dancer: dual-socket Intel quad-socket machine

Bcast Comparison on large NUMA machine IG: 8-socket AMD 6-core Istanbul machine Aggregate Bcast Bandwidth on IG, 48 processes on 48 cores September 17, 2014 13

Alltoallv Comparison on SMP machine Aggregate Alltoallv Bandwidth on Zoot, 16 processes on 16 cores Zoot: Quad-socket Intel quad-core machine

Alltoallv comparison on small NUMA machine Aggregate Alltoallv Bandwidth on dancer, 8 processes on 8 cores Dancer: dual-socket Intel quad-socket machine

Alltoallv Comparison on large NUMA machine Aggregate Alltoallv Bandwidth on IG, 48 processes on 48 cores IG: 8-socket AMD 6-core Istanbul machine

Inter- and Intra-node Collectives (HierKNEM*) • Tackling MPI on multi-core/many-core clusters • Leader-based Hierarchical algorithm • 2 layers: inter-node and intra-node. • Pipelining strategy between two layers: message is split. • Offloading intra-node communication to non-leader processes by kernel assisted approach and overlapping with inter-node message forwarding. • MPI_Bcast, MPI_Allgather and MPI_Reduce • Target: reducing scale of collective from number of cores to number of nodes. *[Ma, Bosilca, Bouteiller, Dongarra, IPDPS 12]

Performance profiling of HierKNEM Broadcast Performance Profiling of HierKNEM Broadcast with a 512KB message on dancer cluster with 8 nodes interconnected by a 10G IB network.( -bycore binding, 64 processes on 64 cores(8 nodes * 8cores/ node).

Distributed Experimental Environment Hardware • Stremi & Parapluie Cluster • Both have 32 nodes • Node: 2 AMD 12-core processors • 1 Gigabit Ethernet (Stremi) • 20 G IB (Parapluie) Software • Open MPI trunk, MPICH2-1.4 and MVAPICH2-1.7 • KNEM version 0.9.6, LIMIC 0.5.5 • IMB-3.2(cache on)

HierKNEM Broadcast Performance 20x speedup Aggregate Broadcast Bandwidth of Collective Modules on two Multicore Clusters (768 processes, 32 nodes, 24 cores/node).

MPI Applications: CPMD, FFTW & ASP Car-Parrinello Molecular Dynamics (CPMD) Fastest Fourier Transform in the West (FFTW) All-pairs-shortest-path (ASP) Communication-intensive Application, especially collectives. CPMD(Alltoall), FFTW with verification(Bcast), ASP(Bcast)

Hardware Environment: IG & Stremi Stremi: 1 G Ethernet cluster, 32 nodes, each node has 2 12-core processors. IG: AMD 48 core large NUMA machine

Software Environment KNEM (version 0.9.6) Open MPI (trunk r24549) mpiP (version 3.2.1) Shared Memory Environment: CPMD (version 3.13.2) configured with LAPACK 3.3.0 --vibrational analysis tests: methan-fd-nosymm.inp(input) Mpi-bench in FFTW-3.2.alpha –input [1500*1500 20 20] and -y(verification) MPI setup: Shared memory setup: SM BTL + Tuned Coll. KNEM setup: SM/KNEM BTL + KNEM Coll. Distributed Environment: ASP in MagPie -input two integer matrices 16384*16384 and 32768*32768 MPI setup: HierKNEM Coll. vs OMPI_Hier vs OMPI_Tuned vs MPICH2 1.4.1

CPMD on IG(48-cores) Strong Scaling for CPMD's methan-fd-nosymm over KNEM and shared memory. Processes are bound to IG's cores in a compact way (rank i is bound to core i).

CPMD: MPI Usage 4 times Sum of all processes' execution time for the 5 most used MPI functions in CPMD's methan-fd-nosymm using shared memory and KNEM (48 processes on IG's 48 cores).

FFTW on IG FFTW: MPI Usage 6 times

Breakdown of ASP runtime on Stremi(768 procs, 32 nodes* 24 cores/node, 1 G Ethernet)

Conclusion • Kernel Assisted (KNEM-aware) MPI collective communication greatly improves scientific throughput • The scalability of the approach will increase with the number of cores • Kernel Assisted approach reduces the gap between MPI and other light-weight multi-core programming models.

Impact of Kernel-assisted MPI Communication on Scientific Applications

Impact of Kernel-assisted MPI Communication on Scientific Applications

Presentation Transcript

Impact of Nanotechnology on Wireless Communication

User / Kernel Communication Model

Scientific Communication

On the Efficacy of GPU-Integrated MPI for Scientific Applications

The Impact of Mobile Devices on Communication

Technology’s Impact on Communication

Scientific Impact

Scientific Impact of Descopes

Scientific Communication

Scientific Impact of Descopes

MPI Collective Communication Kadin Tseng Scientific Computing and Visualization Group

Scientific Applications of XML

Kernel-assisted MPI Communication on Multi-core Clusters

Supporting MPI applications on the EGEE Grid

MPI Collective Communication

TECHNOLOGY’S IMPACT ON COMMUNICATION

The Impact of Technology on Communication

PVM – forerunner of MPI Development of the MPI standard MPI minimum Point-to-point communication

MPI implementation – collective communication