1 / 33

CBE Architecture Overview

Martin Kreibe: mk433@drexel.edu Matthew Longley: mrl37@drexel.edu Paul Snyder: plsnyder@drexel.edu. CBE Architecture Overview. What is CBE?. A new interpretation of Multi-core processors Development motivated by heavy graphics based applications Game Consoles

mab
Download Presentation

CBE Architecture Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Martin Kreibe: mk433@drexel.edu Matthew Longley: mrl37@drexel.edu Paul Snyder: plsnyder@drexel.edu CBE Architecture Overview

  2. What is CBE? • A new interpretation of Multi-core processors • Development motivated by heavy graphics based applications • Game Consoles • Graphics Rendering Applications • Developed by a collaboration between Sony, Toshiba, and IBM (known as STI) in 2001.

  3. Architecture Components • PPE • Main processing unit. • Controls SPE units • EIB • Communication Bus • SPEs • Non-control Processor Elements • 8 on chip • BEI • Engine Interface

  4. CBE Endian-ness • The CBE Architecture is big endian Cell Broadband Engine Architecture

  5. Power PC Processor Element • 64bit, Dual-thread PowerPC Architecture • 32KB L1 cache size • 512 KB L2 cache size • Instruction set extensions: • Vector/SIMD multimedia (“Altivec”) • PPU to SPU communication • Classic CPU Architecture

  6. Synergistic Processor Elements • Operations must be allocated by PPU • “[O]ptimized for data-rich operations” • Programming Tutorial (DRAFT) • RISC core • 256 KB Local Store (“LS”, holds both Instructions and Data) • Unified 128-bit, 128-entry register file. • Manual branch hinting • Special SIMD instruction set • Vector operations • DMA control • Interprocessor messaging and synchonization • “[N]ot intended to run an operating system.” • Programming Tutorial (DRAFT)

  7. Cell Broadband Engine Architecture SPU Registers

  8. SPU Latencies • Simple fixed point - 2 cycles • Complex fixed point - 4 cycles • Load - 6 cycles • Single-precision (ER) float - 6 cycles • Integer multiply - 7 cycles • Branch miss (no penalty for correct hint) - 20 cycles • DP (IEEE) float (partially pipelined) - 13 cycles • Enqueue DMA Command - 20 cycles

  9. Cell Broadband Engine Architecture Element Interconnect Bus • Peak bandwidth: 96 bytes/cycle • Four 16-byte data rings • 100+ outstanding DMA requests • Source: http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

  10. Cell Broadband Engine Architecture Platform Details • Cross-Unit Communication • Mailbox mechanism for synchronization • 32-bit messages between SPE • Signal Notification (inbound) • 32-bit signal notification register • PPU and SPUs can retrieve data from Memory into a SPU DMA • DMA loads are asynchronous

  11. Environment • Hardware • PS3 (may have dead SPUs)‏ • Multi-processor blades • Workstations and accelerator cards • Simulator • Cycle-accurate emulation of SPUs • TCL and GUI interfaces • Modified Linux environment

  12. Cell Broadband Engine Architecture Using the environment • Project development was performed using GNU GCC-based cross-compilation toolchain • Executables were tested on both the IBM Cell simulator and on PS3s running Yellow Dog Linux • Simulator is slow but functional • Code ran smoothly on PS3s • Thanks to the University of Delaware for providing access to their PS3s

  13. Toolset • Dual GNU binutils/gcc toolchains (for PPU and SPU) • IBM XLC++ compiler (automatic vectorization) • Currently generates poorly-optimized code • Static and dynamic analysis tools • Multithreaded debugger (gdb) • Cell Simulator and toolchain are provided only for Fedora Linux • We used VMware virtual machines to ease installation • A Gentoo installation package exists, but is poorly supported

  14. Cell Broadband Engine Architecture Toolchain Challenges • Cell SDK Makefiles use custom include footers for Makefiles • These interface POORLY with GNU Autotools • Spiral-WHT uses GNU Autotools • MUCH time was spent analyzing the operation of the Cell Makefiles and mixing this functionality with the Autotools compilation framework • Considered trying to drop Autotools for this project but: • (1) This is just as much work as trying to go the other way, and • (2) Ideally, Cell target can be rolled into the Spiral-WHT package, so this way the porting effort is not wasted

  15. Cell Broadband Engine Architecture More Toolchain Challenges • Best course was to analyze commands run by Cell Makefiles, then add those to the Automake configuration • Initially, scripts were used to munge the Makefiles • Later, cell-specific options were added to Autoconf frameworks • SPU uses separate toolchain; our current implementation is hackish • Still more work needed to implement cleanly

  16. Cell Broadband Engine Architecture Architectural Challenges • Keep SPUs processing at capacity. • PPU needs to run the OS and allocate jobs to SPUs • Exploit multiple levels of parallelism • Vector (SIMD) operations • PPE + 8 SPEs • Dual pipelines • Multiple processors on a blade • Multiple blades! • Exploit data locality

  17. More Architectural Challenges • Distributed architecture basics • Shared memory • Message passing • Synchronization • Manual DMA Scheduling for • Vectorization Issues • PPU and SPUs have different vector intrinsics • Most operators have a direct mapping between SSE and SPU/PPU intrinsics. Exceptions: Shuffle and permutations (due to endianness)

  18. Cell Broadband Engine Architecture Implementation Strategy • Utilize Vector Constructs • SPUs allow vectorization of doubles as well as floats; PPU is single-precision only • Implement a distributed ‘split’ across SPUs: splitcell[] • Use reference ‘d_split’ code as implementation guide

  19. Cell Broadband Engine Architecture Vector Intrinsics • Vector Integer • vector arithmetic, compare, logical, rotate, and shift • Vector Floating-Point • floating-point arithmetic, multiply/add, rounding and conversion, compare, and estimate instructions. • Vector Load and Store • basic integer and floating-point load and store instructions. No update forms of load and store • Vector Permutation and Formatting • vector pack, unpack, merge, splat, permute, select, and shift

  20. Cell Broadband Engine Architecture Vectorizing Details • Conversion from SSE vectors to SIMD style vectors is non-trivial • PPU and SPU have different vector intrinsics • Many SSE intrinsics do have a SIMD intrinsic except for memory interactions and permutations • Care must be takes to maintain the correct endian model

  21. Cell Broadband Engine Architecture splitcell[] Strategy • Pairing transpose blocks by flipping upper and lower address halves • Limited to 22×n block sizes each cell will calculate 2 blocks. Values n = 1,2 xx…x yy…y xx…x ≠ yy…y Move Blocks xx…x yy…y yy…y xx…x xx…x = yy…y Don’t Move Blocks xx…x yy…y xx…x yy…y

  22. Cell Broadband Engine Architecture splitcell[] Mapping Visualized n = 1 n = 2

  23. Cell Broadband Engine Architecture More Intrisics • Processor Control • read and write the vector status and control register (VSCR) • Memory Control • instructions for managing caches (user-level and supervisor-level)

  24. Cell Broadband Engine Architecture Some DMA Memory Interaction tag = mfc_tag_reserve(); // reserve a single tag for exclusive use mfc_get(ls, ea, size, tag, tid, rid); // move data from main memory to local store mfc_write_tag_mask(mask); // mfc_put(ls, ea, size, tag, tid, rid); // move data from local store to main memory mfc_read_tag_status_all(); // wait for all write commands to finish ls - local storage location ea – effective main memory address tag – status id for memory operations tid – transfer id rid – replace id

  25. Cell Broadband Engine Architecture Vector Intrinsics Mapping SSE Intrinsic -> PPE intrinsic;SPE intrinsic --------------------------------------------- _mm_add_ps -> vec_add; spu_add _mm_sub_ps -> vec_sub; spu_sub _mm_load_ps -> vec_ld; (no SPE equiv) _mm_store_ps -> vec_st; (no SPE equiv) _mm_shuffle_ps -> vec_perm; spu_shuffle (both require custom macro for permuation mask)

  26. Cell Broadband Engine Architecture Starting SPU Programs #include <libspe2.h> #include <pthread.h> spe_context_ptr_t ctx; // the SPU construct pthread_t thd; // the thread construct extern spe_program_handle_tspuProgram; // the binary SPU program handle int main() { ctx = spe_context_create(0, NULL); // create the context if(!program_load(ctx, &spuProgram) { // load the SPU program if(pthread_create(&thd, NULL, &thdFunction, &ctx)) { // spawn the thread phthread_join(thd, NULL); // wait for the SPU program to finish } } spe_context_destroy(ctx); // clean up the context return 0; }

  27. Cell Broadband Engine Architecture SPU Threads and Programs // Thread function void* thdFunction(void* arg) { spe_context_ptr_t ctx; // the SPU construct unsigned long longspuArg; // argument pointer to pass to the SPU unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(ctx, &entry, 0, spuArg, NULL, NULL); pthread_exit(NULL); } // SPU program this will be linked in as ‘spuProgram’ int main(unsignedlong long spuId, unsigned long long argv) { // SPU code… return 0; }

  28. Cell Broadband Engine Architecture Combining PPU and SPU code • PPU and SPU code are compiled as separate object files using separate compilers • ppu-embedspu is used to embed SPU-compiled object code into PPU ELF binaries • Unfortunately, the technical challenges of integrating this into Spiral-WHT were not overcome within our timeframes • Key problem: embedded SPU code has to be dynamically linked, while our Makefile hackery was using static libraries • This is very poorly documented, and was finally diagnosed too late to allow us to resolve the problem

  29. Cell Broadband Engine Architecture Results • Spiral-WHT successfully ported to Cell platform • Implemented codelets for PPU and SPU • Initial modifications to codelet generator • The most difficult issues were toolchain-related, and these limited our ability to generate empirical performance data • …So, what sort of performance improvement should we expect when the bugs are ironed out?

  30. Cell Broadband Engine Architecture Expected Performance Gains • SPU has eight SPUs; on a PS3, six are available for use • Thus, we would hope for a 6-8x performance increase over using just the PPU • Williams et al. 2006 project a 12.7x speedup for 2-D FFTs over a 64-bit Intel CPU • With some minor architectural modifications, they project a 20x speedup!

  31. Cell Broadband Engine Architecture Future Work and Alternate Strategies • Implementing split/splitddl on the SPU • Use multi-buffered DMA scheduling to maximize throughput • PPU can handle two simultaneous hardware threads • Allow the PPU to run codelets in parallel with SPUs • Additional levels of parallelization: splitting over multiple CBE processors • Cell Blades have 2 CBE processors each

  32. Cell Broadband Engine Architecture References • http://www.ibm.com/developerworks/power/cell/index.html • IBM Full-System Simulator User’s Guide • Cell Broadband Engine Programming Handbook Version 1.1 • Programming Tutorial (DRAFT) • S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yellick, “The Potential of the Cell Processor for Scientific Computing”, CF06 • http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf • Links to these and other useful Cell programming resources are on our group’s website: • http://www.cs.drexel.edu/~pls29/cell/

  33. Questions?

More Related