Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture

Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BETM Architecture Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1, S. Kapoor2, A. Srinivasan3 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin, skapoor@us.ibm.com 3 Florida State University, asriniva@cs.fsu.edu Goals Determine the feasibility of Intra-Cell MPI Evaluate the impact of different design choices on performance

Cell Architecture DMA put times • A PowerPC core, with 8 co-processors (SPE) with 256 K local store each • Shared 512 MB - 2 GB main memory - SPEs can DMA • Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs • 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory • Two Cell processors can be combined to form a Cell blade with global shared memory • Memory to memory copy using: • SPE local store • memcpy by PPE

Intra-Cell MPI Design Choices • Cell features • In order execution, but DMAs can be out of order • Over 100 simultaneous DMAs can be in flight • Constraints • Unconventional, heterogeneous architecture • SPEs have limited functionality, and can act directly only on local stores • SPEs access main memory through DMA • Use of PPE should be limited to get good performance • MPI design choices • Application data in: (i) local store or (ii) main memory • MPI meta-data in: (i) local store or (ii) main memory • PPE involvement: (i) active or (ii) only during initialization and finalization • Point-to-point communication mode: (i) synchronous or (ii) buffered

Blocking Point-to-Point Communication Performance • Results are from a 3.2 GHz Cell Blade, at IBM Rochester • The final version uses buffered mode for small messages and synchronous mode for long messages • Threshold to switch to Synchronous mode is set to 2KB • In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement

Comparison of MPICELL with MPI on Other Hardware

Collective Communication Example – Broadcast • Broadcast on 16 SPEs (2 processors) • TREE: Pipelined tree structured communication based on LS • TREEMM: Tree structured Send/Recv type implementation • AG: Each SPE is responsible for a different portion of data • OTA: Each SPE copies data to its location • G: Root copies all data • Broadcast with good choice of algorithms for each data size and SPE count • Maximum main memory bandwidth is also shown

Application Performance – Matrix-Vector Multiplication • Used a 1-D decomposition (not very efficient) • Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024 • The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication • The Opteron results used LAM MPI Performance of Double Precision matrix-vector multiplication

Conclusions and Future Work Conclusions The Cell processor has good potential for MPI applications. PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Good performance for collectives even with two Cell processors Current and future work Implemented Collective communication operations optimized for contiguous data Blocking and non-blocking communication Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc

Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture

Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture

Presentation Transcript

A Case Study of Linux Scalability on Multi-core Platform

Octopus: A Multi-core implementation

Heterogeneous Multi-Core Processors

Photonic Many-Core Architecture Study

The Feasibility Study

Multi-core systems System Architecture COMP25212

Multi-core and Network Aware MPI Topology Functions

Multi-core systems System Architecture COMP25212

Core Architecture Optimization for Heterogeneous Chip Multiprocessors

PAN Feasibility: The BodyLAN TM Experience

Heterogeneous multi-core

Feasibility study on

Kernel-assisted MPI Communication on Multi-core Clusters

Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture

Optimizing the Fast Fourier Transform on a Multi-core Architecture

Multi-core systems System Architecture COMP25212

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

Multi-core/Cell Game Engine Design

Multi-core systems COMP25212 System Architecture

The Feasibility Study

Heterogeneous Multi-Core Processors