CBE Architecture Overview

Martin Kreibe: mk433@drexel.edu Matthew Longley: mrl37@drexel.edu Paul Snyder: plsnyder@drexel.edu CBE Architecture Overview

What is CBE? • A new interpretation of Multi-core processors • Development motivated by heavy graphics based applications • Game Consoles • Graphics Rendering Applications • Developed by a collaboration between Sony, Toshiba, and IBM (known as STI) in 2001.

Architecture Components • PPE • Main processing unit. • Controls SPE units • EIB • Communication Bus • SPEs • Non-control Processor Elements • 8 on chip • BEI • Engine Interface

CBE Endian-ness • The CBE Architecture is big endian Cell Broadband Engine Architecture

Power PC Processor Element • 64bit, Dual-thread PowerPC Architecture • 32KB L1 cache size • 512 KB L2 cache size • Instruction set extensions: • Vector/SIMD multimedia (“Altivec”) • PPU to SPU communication • Classic CPU Architecture

Synergistic Processor Elements • Operations must be allocated by PPU • “[O]ptimized for data-rich operations” • Programming Tutorial (DRAFT) • RISC core • 256 KB Local Store (“LS”, holds both Instructions and Data) • Unified 128-bit, 128-entry register file. • Manual branch hinting • Special SIMD instruction set • Vector operations • DMA control • Interprocessor messaging and synchonization • “[N]ot intended to run an operating system.” • Programming Tutorial (DRAFT)

Cell Broadband Engine Architecture SPU Registers

SPU Latencies • Simple fixed point - 2 cycles • Complex fixed point - 4 cycles • Load - 6 cycles • Single-precision (ER) float - 6 cycles • Integer multiply - 7 cycles • Branch miss (no penalty for correct hint) - 20 cycles • DP (IEEE) float (partially pipelined) - 13 cycles • Enqueue DMA Command - 20 cycles

Cell Broadband Engine Architecture Element Interconnect Bus • Peak bandwidth: 96 bytes/cycle • Four 16-byte data rings • 100+ outstanding DMA requests • Source: http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

Cell Broadband Engine Architecture Platform Details • Cross-Unit Communication • Mailbox mechanism for synchronization • 32-bit messages between SPE • Signal Notification (inbound) • 32-bit signal notification register • PPU and SPUs can retrieve data from Memory into a SPU DMA • DMA loads are asynchronous

Environment • Hardware • PS3 (may have dead SPUs)‏ • Multi-processor blades • Workstations and accelerator cards • Simulator • Cycle-accurate emulation of SPUs • TCL and GUI interfaces • Modified Linux environment

Cell Broadband Engine Architecture Using the environment • Project development was performed using GNU GCC-based cross-compilation toolchain • Executables were tested on both the IBM Cell simulator and on PS3s running Yellow Dog Linux • Simulator is slow but functional • Code ran smoothly on PS3s • Thanks to the University of Delaware for providing access to their PS3s

Toolset • Dual GNU binutils/gcc toolchains (for PPU and SPU) • IBM XLC++ compiler (automatic vectorization) • Currently generates poorly-optimized code • Static and dynamic analysis tools • Multithreaded debugger (gdb) • Cell Simulator and toolchain are provided only for Fedora Linux • We used VMware virtual machines to ease installation • A Gentoo installation package exists, but is poorly supported

Cell Broadband Engine Architecture Toolchain Challenges • Cell SDK Makefiles use custom include footers for Makefiles • These interface POORLY with GNU Autotools • Spiral-WHT uses GNU Autotools • MUCH time was spent analyzing the operation of the Cell Makefiles and mixing this functionality with the Autotools compilation framework • Considered trying to drop Autotools for this project but: • (1) This is just as much work as trying to go the other way, and • (2) Ideally, Cell target can be rolled into the Spiral-WHT package, so this way the porting effort is not wasted

Cell Broadband Engine Architecture More Toolchain Challenges • Best course was to analyze commands run by Cell Makefiles, then add those to the Automake configuration • Initially, scripts were used to munge the Makefiles • Later, cell-specific options were added to Autoconf frameworks • SPU uses separate toolchain; our current implementation is hackish • Still more work needed to implement cleanly

Cell Broadband Engine Architecture Architectural Challenges • Keep SPUs processing at capacity. • PPU needs to run the OS and allocate jobs to SPUs • Exploit multiple levels of parallelism • Vector (SIMD) operations • PPE + 8 SPEs • Dual pipelines • Multiple processors on a blade • Multiple blades! • Exploit data locality

More Architectural Challenges • Distributed architecture basics • Shared memory • Message passing • Synchronization • Manual DMA Scheduling for • Vectorization Issues • PPU and SPUs have different vector intrinsics • Most operators have a direct mapping between SSE and SPU/PPU intrinsics. Exceptions: Shuffle and permutations (due to endianness)

Cell Broadband Engine Architecture Implementation Strategy • Utilize Vector Constructs • SPUs allow vectorization of doubles as well as floats; PPU is single-precision only • Implement a distributed ‘split’ across SPUs: splitcell[] • Use reference ‘d_split’ code as implementation guide

Cell Broadband Engine Architecture Vector Intrinsics • Vector Integer • vector arithmetic, compare, logical, rotate, and shift • Vector Floating-Point • floating-point arithmetic, multiply/add, rounding and conversion, compare, and estimate instructions. • Vector Load and Store • basic integer and floating-point load and store instructions. No update forms of load and store • Vector Permutation and Formatting • vector pack, unpack, merge, splat, permute, select, and shift

Cell Broadband Engine Architecture Vectorizing Details • Conversion from SSE vectors to SIMD style vectors is non-trivial • PPU and SPU have different vector intrinsics • Many SSE intrinsics do have a SIMD intrinsic except for memory interactions and permutations • Care must be takes to maintain the correct endian model

Cell Broadband Engine Architecture splitcell[] Strategy • Pairing transpose blocks by flipping upper and lower address halves • Limited to 22×n block sizes each cell will calculate 2 blocks. Values n = 1,2 xx…x yy…y xx…x ≠ yy…y Move Blocks xx…x yy…y yy…y xx…x xx…x = yy…y Don’t Move Blocks xx…x yy…y xx…x yy…y

Cell Broadband Engine Architecture splitcell[] Mapping Visualized n = 1 n = 2

Cell Broadband Engine Architecture More Intrisics • Processor Control • read and write the vector status and control register (VSCR) • Memory Control • instructions for managing caches (user-level and supervisor-level)

Cell Broadband Engine Architecture Some DMA Memory Interaction tag = mfc_tag_reserve(); // reserve a single tag for exclusive use mfc_get(ls, ea, size, tag, tid, rid); // move data from main memory to local store mfc_write_tag_mask(mask); // mfc_put(ls, ea, size, tag, tid, rid); // move data from local store to main memory mfc_read_tag_status_all(); // wait for all write commands to finish ls - local storage location ea – effective main memory address tag – status id for memory operations tid – transfer id rid – replace id

Cell Broadband Engine Architecture Vector Intrinsics Mapping SSE Intrinsic -> PPE intrinsic;SPE intrinsic --------------------------------------------- _mm_add_ps -> vec_add; spu_add _mm_sub_ps -> vec_sub; spu_sub _mm_load_ps -> vec_ld; (no SPE equiv) _mm_store_ps -> vec_st; (no SPE equiv) _mm_shuffle_ps -> vec_perm; spu_shuffle (both require custom macro for permuation mask)

Cell Broadband Engine Architecture Starting SPU Programs #include <libspe2.h> #include <pthread.h> spe_context_ptr_t ctx; // the SPU construct pthread_t thd; // the thread construct extern spe_program_handle_tspuProgram; // the binary SPU program handle int main() { ctx = spe_context_create(0, NULL); // create the context if(!program_load(ctx, &spuProgram) { // load the SPU program if(pthread_create(&thd, NULL, &thdFunction, &ctx)) { // spawn the thread phthread_join(thd, NULL); // wait for the SPU program to finish } } spe_context_destroy(ctx); // clean up the context return 0; }

Cell Broadband Engine Architecture SPU Threads and Programs // Thread function void* thdFunction(void* arg) { spe_context_ptr_t ctx; // the SPU construct unsigned long longspuArg; // argument pointer to pass to the SPU unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(ctx, &entry, 0, spuArg, NULL, NULL); pthread_exit(NULL); } // SPU program this will be linked in as ‘spuProgram’ int main(unsignedlong long spuId, unsigned long long argv) { // SPU code… return 0; }

Cell Broadband Engine Architecture Combining PPU and SPU code • PPU and SPU code are compiled as separate object files using separate compilers • ppu-embedspu is used to embed SPU-compiled object code into PPU ELF binaries • Unfortunately, the technical challenges of integrating this into Spiral-WHT were not overcome within our timeframes • Key problem: embedded SPU code has to be dynamically linked, while our Makefile hackery was using static libraries • This is very poorly documented, and was finally diagnosed too late to allow us to resolve the problem

Cell Broadband Engine Architecture Results • Spiral-WHT successfully ported to Cell platform • Implemented codelets for PPU and SPU • Initial modifications to codelet generator • The most difficult issues were toolchain-related, and these limited our ability to generate empirical performance data • …So, what sort of performance improvement should we expect when the bugs are ironed out?

Cell Broadband Engine Architecture Expected Performance Gains • SPU has eight SPUs; on a PS3, six are available for use • Thus, we would hope for a 6-8x performance increase over using just the PPU • Williams et al. 2006 project a 12.7x speedup for 2-D FFTs over a 64-bit Intel CPU • With some minor architectural modifications, they project a 20x speedup!

Cell Broadband Engine Architecture Future Work and Alternate Strategies • Implementing split/splitddl on the SPU • Use multi-buffered DMA scheduling to maximize throughput • PPU can handle two simultaneous hardware threads • Allow the PPU to run codelets in parallel with SPUs • Additional levels of parallelization: splitting over multiple CBE processors • Cell Blades have 2 CBE processors each

Cell Broadband Engine Architecture References • http://www.ibm.com/developerworks/power/cell/index.html • IBM Full-System Simulator User’s Guide • Cell Broadband Engine Programming Handbook Version 1.1 • Programming Tutorial (DRAFT) • S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yellick, “The Potential of the Cell Processor for Scientific Computing”, CF06 • http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf • Links to these and other useful Cell programming resources are on our group’s website: • http://www.cs.drexel.edu/~pls29/cell/

Questions?

CBE Architecture Overview