1 / 41

High Performance Computing on the Cell Broadband Engine

High Performance Computing on the Cell Broadband Engine. Vas Chellappa Electrical & Computer Engineering Carnegie Mellon University Dec 3 2008. Designing “faster” processors. Need for speed Parallelism: forms Superscalar Pipelining Vector Multi-core Multi-node.

Download Presentation

High Performance Computing on the Cell Broadband Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical & Computer Engineering Carnegie Mellon University Dec 3 2008

  2. Designing “faster” processors • Need for speed • Parallelism: forms • Superscalar • Pipelining • Vector • Multi-core • Multi-node

  3. Designing “faster” processors • Need for speed • Parallelism: forms (limitations) • Superscalar (power density) • Pipelining (latch overhead: frequency scaling, branching) • Vector (programming, only numeric) • Multi-core (memory wall, programming) • Multi-node (interconnects, reliability)

  4. Multi-core Parallelism • Future is definitely multi-core parallelism • But what problems/limitations do multi-cores have? • Increased programming burden • Scaling issues: power, interconnects etc.

  5. The Cell BE Approach Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles programmer driven DMA in background Cell BE Chip PPE SPE LS LS SPE SPE LS LS SPE EIB SPE LS LS SPE SPE LS LS SPE Main Mem

  6. Presentation Overview • Cell Broadband Engine: Design • Programming on the Cell • Exercise: implement addition of vectors • Wrap-up

  7. Cell Broadband Engine EIB • Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner) • Compute: • Heterogeneous multi-core (1 PPE + 8 SPEs) • 204 Gflop/s (only SPEs) • High-speed on-chip interconnect • Memory system: • Explicit scratchpad-type “local store” • DMA based programming • Challenges: • Parallelization, vectorization, explicit memory • New design: new programming paradigm LS LS LS LS SPE SPE SPE SPE SPE SPE SPE SPE LS LS LS LS Main Mem

  8. Cell BE Processor: A Closer Look Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS) Cell BE Chip PPE SPE LS LS SPE SPE LS LS SPE EIB SPE LS LS SPE SPE LS LS SPE Main Mem

  9. Power Processing Element (PPE) Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions Virtualization, address translation/protection, exception handling

  10. Synergistic Processing Element (SPE) SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)

  11. Synergistic Processing Unit (SPU) Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE) 25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel 128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic

  12. Local Stores (LS) and Memory Flow Cont. (MFC) Local Stores Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU) Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals

  13. Element Interconnect Bus (EIB) 4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak

  14. Direct Memory Access (DMA) Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance • DMA lists • Get, put: SPE-centric view • Mailboxes/signals are also DMAs

  15. Systems using the Cell Sony PlayStation 3 6 available SPEs 7th: hypervisor 8th: defective (yield issues) Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet

  16. IBM Roadrunner Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells) Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband)

  17. Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up

  18. Programming on the Cell: Philosophy Major differences to traditional processors Not designed for scalar performance Explicit memory access Heterogeneous multi-core Using the SPEs SPMD model (Single Program Multiple Data) Streaming model

  19. Programming Tips • What kind of code good/bad for SPEs? • No branching (no prediction) • Use branch hinting • No scalar (no support) • Use intrinsics for vectorization, DMA • Context switches are expensive • Program + data reside in LS. These have to be swapped in/out • DMA code: alignment, alignment, alignment! • Libraries available to emulate software-managed cache

  20. DMA Programming Main idea: hide memory accesses with multibuffering • Compute on one buffer in LS • Write back / read in other batches of data • Like a completely controlled cache Inter-chip communication • Message boxes • Signals • DMA

  21. Tools for Cell Programming IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library Other tools: Assembly visualizer Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)

  22. Program Design Use knowledge of architecture to model Back of the envelope calculations Cost of processing? Cost of communication? Trends? Limits? How close is the model? What programming improvements can be made to fit the architecture better?

  23. Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up

  24. Creating PPE Program, SPE Threads • Each program consists of PPE and SPE sections • Program is started up on PPE • PPE creates SPE threads • pthreads implementation • Not full • PPE data structure to keep track of SPE threads • PPE/SPE shared data structure for argument passing • X, Y, Z addresses • Thread id • Returned cycle count

  25. DMA Access spu_writech(MFC_WrTagMask, -1); spu_mfcdma64(source_address, dest_high_address, dest_low_address, size_in_byes, tag_id, MFC_GET_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); Use my DMA_BL_GET, DMA_BL_PUT macros

  26. Compiling • Compile ppe, spe programs separately • Details: specify SPE program name, call from PPE • 32/64 bit (watch out for pointer sizes etc.) • Cell SDK has sample Makefiles • We will use a simple Makefile

  27. Performance Evaluation: Timing Performance measure: runtime, Gflop/s Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the real-word scenario the best?

  28. Exercise 1: Add/Mul Two Arrays Goal: X[] += Y[] * Z[] Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs

  29. Part 1 Goal: Understand skeleton code Get infrastructure up and running (compiler, basic code) Evaluate: scalar, sequential code performance PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU Your tasks: • Compile • Transform code • Timer code

  30. Part 2 Goal Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy) Evaluate: Parallel code performance Vectorized parallel code performance PPU: Start up 4 SPU threads Performance evaluation: how? SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization (vector float) d = spu_madd(a,b,c); Your tasks: • Parallelize • Vectorize • Performance?

  31. Part 3 Goal: hide memory accesses How?

  32. Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up

  33. Exercise Debriefing How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference Do our optimizations work for a large size range? Smaller sizes: lower packet sizes? Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?

  34. WHT on the Cell Vectorization: as before Parallelization: locality-aware! Explicit memory access Provide code Multibuffering? How? Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier

  35. WHT: Data Exchange

  36. WHT: Data Exchange

  37. WHT: Data Exchange

  38. WHT: Data Exchange

  39. DMA Issues External multibuffering (streaming) Strategies for problem sizes Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts Using all memory banks

  40. Cell Philosophy Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism

  41. Wrap-Up Programming Cell BE for high-performance computing • Cell: chip multiprocessor designed for HPC Applications from video gaming to supercomputers • Programming burden is factor for performance Parallelization, vectorization, memory handling Automated tools yield limited performance • Programmers must understand μ-arch., tradeoffs For performance (esp. on Cell)

More Related