High Performance Computing on the Cell Broadband Engine

High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical & Computer Engineering Carnegie Mellon University Dec 3 2008

Designing “faster” processors • Need for speed • Parallelism: forms • Superscalar • Pipelining • Vector • Multi-core • Multi-node

Designing “faster” processors • Need for speed • Parallelism: forms (limitations) • Superscalar (power density) • Pipelining (latch overhead: frequency scaling, branching) • Vector (programming, only numeric) • Multi-core (memory wall, programming) • Multi-node (interconnects, reliability)

Multi-core Parallelism • Future is definitely multi-core parallelism • But what problems/limitations do multi-cores have? • Increased programming burden • Scaling issues: power, interconnects etc.

The Cell BE Approach Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles programmer driven DMA in background Cell BE Chip PPE SPE LS LS SPE SPE LS LS SPE EIB SPE LS LS SPE SPE LS LS SPE Main Mem

Presentation Overview • Cell Broadband Engine: Design • Programming on the Cell • Exercise: implement addition of vectors • Wrap-up

Cell Broadband Engine EIB • Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner) • Compute: • Heterogeneous multi-core (1 PPE + 8 SPEs) • 204 Gflop/s (only SPEs) • High-speed on-chip interconnect • Memory system: • Explicit scratchpad-type “local store” • DMA based programming • Challenges: • Parallelization, vectorization, explicit memory • New design: new programming paradigm LS LS LS LS SPE SPE SPE SPE SPE SPE SPE SPE LS LS LS LS Main Mem

Cell BE Processor: A Closer Look Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS) Cell BE Chip PPE SPE LS LS SPE SPE LS LS SPE EIB SPE LS LS SPE SPE LS LS SPE Main Mem

Power Processing Element (PPE) Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions Virtualization, address translation/protection, exception handling

Synergistic Processing Element (SPE) SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)

Synergistic Processing Unit (SPU) Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE) 25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel 128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic

Local Stores (LS) and Memory Flow Cont. (MFC) Local Stores Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU) Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals

Element Interconnect Bus (EIB) 4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak

Direct Memory Access (DMA) Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance • DMA lists • Get, put: SPE-centric view • Mailboxes/signals are also DMAs

Systems using the Cell Sony PlayStation 3 6 available SPEs 7th: hypervisor 8th: defective (yield issues) Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet

IBM Roadrunner Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells) Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband)

Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up

Programming on the Cell: Philosophy Major differences to traditional processors Not designed for scalar performance Explicit memory access Heterogeneous multi-core Using the SPEs SPMD model (Single Program Multiple Data) Streaming model

Programming Tips • What kind of code good/bad for SPEs? • No branching (no prediction) • Use branch hinting • No scalar (no support) • Use intrinsics for vectorization, DMA • Context switches are expensive • Program + data reside in LS. These have to be swapped in/out • DMA code: alignment, alignment, alignment! • Libraries available to emulate software-managed cache

DMA Programming Main idea: hide memory accesses with multibuffering • Compute on one buffer in LS • Write back / read in other batches of data • Like a completely controlled cache Inter-chip communication • Message boxes • Signals • DMA

Tools for Cell Programming IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library Other tools: Assembly visualizer Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)

Program Design Use knowledge of architecture to model Back of the envelope calculations Cost of processing? Cost of communication? Trends? Limits? How close is the model? What programming improvements can be made to fit the architecture better?

Creating PPE Program, SPE Threads • Each program consists of PPE and SPE sections • Program is started up on PPE • PPE creates SPE threads • pthreads implementation • Not full • PPE data structure to keep track of SPE threads • PPE/SPE shared data structure for argument passing • X, Y, Z addresses • Thread id • Returned cycle count

DMA Access spu_writech(MFC_WrTagMask, -1); spu_mfcdma64(source_address, dest_high_address, dest_low_address, size_in_byes, tag_id, MFC_GET_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); Use my DMA_BL_GET, DMA_BL_PUT macros

Compiling • Compile ppe, spe programs separately • Details: specify SPE program name, call from PPE • 32/64 bit (watch out for pointer sizes etc.) • Cell SDK has sample Makefiles • We will use a simple Makefile

Performance Evaluation: Timing Performance measure: runtime, Gflop/s Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the real-word scenario the best?

Exercise 1: Add/Mul Two Arrays Goal: X[] += Y[] * Z[] Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs

Part 1 Goal: Understand skeleton code Get infrastructure up and running (compiler, basic code) Evaluate: scalar, sequential code performance PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU Your tasks: • Compile • Transform code • Timer code

Part 2 Goal Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy) Evaluate: Parallel code performance Vectorized parallel code performance PPU: Start up 4 SPU threads Performance evaluation: how? SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization (vector float) d = spu_madd(a,b,c); Your tasks: • Parallelize • Vectorize • Performance?

Part 3 Goal: hide memory accesses How?

Exercise Debriefing How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference Do our optimizations work for a large size range? Smaller sizes: lower packet sizes? Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?

WHT on the Cell Vectorization: as before Parallelization: locality-aware! Explicit memory access Provide code Multibuffering? How? Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier

WHT: Data Exchange

DMA Issues External multibuffering (streaming) Strategies for problem sizes Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts Using all memory banks

Cell Philosophy Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism

Wrap-Up Programming Cell BE for high-performance computing • Cell: chip multiprocessor designed for HPC Applications from video gaming to supercomputers • Programming burden is factor for performance Parallelization, vectorization, memory handling Automated tools yield limited performance • Programmers must understand μ-arch., tradeoffs For performance (esp. on Cell)

High Performance Computing on the Cell Broadband Engine

High Performance Computing on the Cell Broadband Engine

Presentation Transcript

HIGH PERFORMANCE COMPUTING

High Performance Computing

Dependable Multiprocessing with the Cell Broadband Engine

Performance of Graph and Biological Analytics on the IBM Cell Broadband Engine Processor

Cell Broadband Engine Architecture Overview

Performance of Graph and Biological Analytics on the IBM Cell Broadband Engine Processor

High Performance Computing on the Internet

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine

Cell Broadband Engine Architecture

Cell Broadband Engine Applications

Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

High Performance Computing

High Performance Computing on an IBM Cell Processor

High Performance Computing on an IBM Cell Processor Bioinformatics

High Performance Computing on an IBM Cell Processor --- Bioinfomatics

High Performance Computing on an IBM Cell Processor Bioinformatics

Echo cancellation on Cell Broadband Engine

HIGH PERFORMANCE COMPUTING

Cell Broadband Engine

Dependable Multiprocessing with the Cell Broadband Engine

High Performance Simulations of Electrochemical Models on the Cell Broadband Engine