1 / 76

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006. First Consumer Product. Play Station 3!. Introduction. Developed through partnership of

callum-pace
Download Presentation

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the Architecture, Circuit Design, andPhysical Implementation of a First-Generation Cell Processor IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

  2. First Consumer Product • Play Station 3!

  3. Introduction • Developed through partnership of • SONY Computer Entertainment. • Toshiba. • IBM. • Aim • Highly tuned for media processing. • Expected demands for complex and larger data handling.

  4. What is Cell? • Cell is an architecture for high performance distributed computing.   • It is comprised of hardware and software cells. • Implementation of a wide range of single or multiple processor and memory configurations.

  5. “Supercomputer” in daily life • Parallelism with high frequency. • Real time response. • Supports Multiple operating system. • 10 simultaneous threads. • 128 memory requests. • Optimally address many different system and application requirements.

  6. Architecture Overview • 8 SPE’s with Local Storage (LS). • PPE with its L2 cache. • Internal element interconnect bus (EIB). • Memory Interface Controller (MIC). • Bus Interface Controller (BIC). • Power Management Unit (PMU). • Thermal Management Unit (TMU). • Pervasive Unit.

  7. High Level Diagram

  8. Die Photograph

  9. Synergistic Processing Elements (SPE) (1/2) • Share system memory with PPE through DMA. • Data and instructions in a private real address space supported by a 256 K LS. • According to IBM a single SPE can perform as well as a top end (single core) desktop CPU given the right task.

  10. Synergistic Processing Elements (SPE) (2/2) • Access main storage by issuing DMA commands to the associated MFC block (asynchronous transfer). • Fully pipelined 128 bit wide dual issue SIMD. • SPE’s in a Cell can be chained together to act as a stream processor.  

  11. Power Processor Element (PPE) (1/2) • 32-kB instruction and data cache. • 64 bit “Power Architecture” with 512kB L2 cache.

  12. Power Processor Element (PPE) (2/2) • Through MMIO control registers can intiate DMA for SPE. • Hyepervisor extension. • Moderate length of pipeline.

  13. Element Interconnect Bus(EIB) • Can transfer upto 96bytes per cycle. • 4 16byte wide rings • Two rings going clockwise. • Two rings going counterclockwise. • Separate address and command network. • 12on/off ramps.

  14. Memory Interface Controller (MIC) • Two 36 bit wide XDR memory banks. • Can also support just a single bank. • Speed matching SRAM and two clocks.

  15. Power Reduction • Power Management Unit. • PMU allows software controls to reduce chip power. • Can cause OS to throttle, pause or stop for single or multiple units.

  16. Thermal Monitoring • Thermal Sensors and Thermal Monitoring Unit. • One sensor located at relatively constant temp. location, for external cooling. • 10 DTS at various critical locations.

  17. Optimum Point (1/3) • Triple constraint : Power, Performance, Area. • Gate Oxide thickness • Thinner oxide • Higher performance. • Higher gate tunneling too. • Reliabilty concerns.

  18. Optimum Point (2/3) • Channel Length • Short channel length • Improved performance. • Increased leakage current too. • Supply Voltage • Higher voltage • Improved performance. • Higher AC/DC power.

  19. Optimum Point (3/3) • Wire Levels • Few levels • Increased chip area. • Many levels • More cost.

  20. Final Technology Parameters

  21. Chip Integration • 241M transistors. • 8912 discrete flour planned blocks. • Custom tailored nets. • 20 separate power domains.

  22. POWER-CONSCIOUS DESIGN OFTHE CELL PROCESSOR’SSPE Osamu Takahashi IBM Systems and Technology Group Scott Cottier Sang H. Dhong Brian Flachs Joel Silberman IBM T.J. Watson Research Center

  23. The CELL Processor - Properties • Mostly CMOS static gates. • Dynamic gates used for time critical paths. • Tight coupling of • ISA • uArchitecture • Physical implementation achieves Compact and Power efficient design.

  24. APPLICATIONS • To name a few (list goes endless) • Image processing for high definition TV • Image processing for medical usages • High performance computing • Gaming • Flexible enough to be a GP uP that supports HLL programming.

  25. Cell processor - Architecture • 64-bit power core • Eight Synergistic Processor Elements(SPEs) • L2 Cache • Interconnection bus • I/O Controller • Rambus Flex I/O

  26. Architecture contd. • SPE has two clock domains: • one with an 11FO4 cycle time. • other with a 22FO4 cycle time. • Implementation using custom design - high-frequency domain. • The SPE contains • 256 Kbytes of dedicated local store memory. • The 128-bit, 128-entry general-purpose register file with six read ports and two write ports.

  27. SPE • The SMF operates at half the SPE’s frequency. • The SPE operates at operations of up to 5.6 GHz at a 1.4 V supply and 56° C. • The SPE’s measured power consumption is in the range of 1 W to 11 W, depending on • Operating clock frequency. • Temperature. • Workload.

  28. Triple design constraints • Cell contains eight copies of the SPE. • Optimization of the SPE’s power and area is critical to the overall chip design. • Conscious effort to reduce SPE area and power while meeting the 11 FO4 cycle time performance objectives. • Optimized design to balance three constraints of • Power. • Area. • Performance. • Tradeoffs to achieve the overall best results • Some techniques used • latch selection. • fine-grained clock-gating scheme. • multiclock-domain design. • use of dual-threshold voltage. • Selective use of dynamic circuits.

  29. Latch selection • Logic has 8-9FO4 time. • Rest of the time used by latches. • Several Latches with various insertion delays used.

  30. Transmission Gate Latch • SPE’s main workhorse latch. • Come in two varieties • Scannable. • Non scannable. • Each has several power levels. • Used almost throughout the SPE.

  31. Pulsed Clock Latch • Non scannable. • Small insertion delay. • Small Area. • Relatively low power consumption. • Used in • Most timing. • Power critical areas.

  32. Dynamic multiplexer latch • Scannable. • Multiplexing widths from 4-10. • Small insertion delay. • Used in • Time critical. • Multiplexing requiring areas. • Typical use in dataflow operand latches.

  33. Dynamic PLA Latch • Scannable latch. • Used to generate control signals (clock gating signals). • The last two latches use slightly higher power. • Complete complex task in critical time. • Example of a tradeoff among triple constraints.

  34. Fine-grained clock gating • Effective method of reducing power -used extensively in the CELL. • Use of local clock buffer (LCB) • Supplies clock to bank of latches. • If enable signal fired LCB buffers the global clock and sends to the bank of latches. • SPE activates only necessary pipeline stages. • Registers are turned off normally. • Functional blocks were simulated and verified. • 50% active power reduction using this design process.

  35. Multiple clock frequency domains • High frequency increases performance. • Has some penalties • Higher clock power. • Higher percentage of clock insertion delays. • Shorter distance that a signal can travel. • SPE has some units whose performance does not solely depend on frequency. • SMF operates at half the frequency.

  36. Multiple clock frequency domains • 11 FO4 blocks • Register file. • Fixed point unit. • Floating point unit. • Data forwarding. • Load/Store. • 22 FO4 blocks • Direct memory access unit. • Bus control. • Distribution of one clock to both domains. • SMF activated every second clock cycle.

  37. Multiple clock frequency domains • Avoids physical implementation difficulties. • Helps escape • Latch insertion delay. • Travel distance penalties. • Advantages • Large percentage of clock dedicated to logic. • Most of SMF paths become non-critical. • Smaller transistors can be used. • SMF optimized for both area and power without sacrificing performance.

  38. Dual-threshold-voltage devices • Leakage – significant portion of power consumption for deep micron technology. • Cannot be solved by clock gating or two clock domains. • Use high-threshold-voltage transistors. • Penalty – slower switching time. • Used in paths with enough timing slack. • Non critical paths from SMF because of two clock domains were replaced with these.

  39. Selective use of dynamic circuits • Advantages of static circuits over dynamic • Design ease. • Low switching factor. • Tool compatibility. • Technology independence. • Advantages of dynamic circuits over static counterparts • Faster speed due to low cap at dynamic nodes. • Larger gains because of invertors after logic. • Micro architecture efficiency – fewer stages. • Smaller area.

  40. Selective use of dynamic circuits • Dynamic logic requires a clock – higher power consumption. • Requires both true and complementary signals. • Static implementation tends to hit speed wall earlier. • Approach for design • Implement logic circuits in static CMOS as much as possible. • Alternatives when static did not meet the speed requirements.

  41. Selective use of dynamic circuits • Dynamic logic requires a clock – higher power consumption. • Requires both true and complementary signals. • Static implementation tends to hit speed wall earlier. • Approach for design • Implement logic circuits in static CMOS as much as possible. • Alternatives when static did not meet the speed requirements.

  42. Selective use of dynamic circuits • Dynamic circuits have static interfaces. • 19 percent of the non-SRAM area. • Include the following macros • Dataflow forwarding. • Multiport register file. • Floating point unit. • Dynamic PLL. • Multiplexer latch. • Instruction line buffer.

  43. SPE hardware measurements • Tested for complicated 3D picture rendering. • The fastest operation ran at 5.6 GHz with a 1.4 V supply at 56° C. • The global clock mesh’s measured power is 1.3 W per SPE at a 1.2V supply and 2.0-GHz clock frequency. • The Cell architecture is compatible with the 64b Power architecture so that applications can be built on the Power investments. • It can be considered as a non-homogenous coherent chip multiprocessor. • High design frequency has been achieved through highly optimized implementation. • Its streaming DMA architecture helps to enhance memory effectiveness of a processor. • Refer to shmoo plot for power analysis

  44. SPE shmoo plot

  45. Applications of the CELL ProcessorAnd Its Potential For Scientific Computing

  46. r

  47. FOLDING@HOME Broke the Guinness world record for the “worlds most powerful distributed network” with computing power of > 1 PF(thousand trillion floating point operations per second). Blue Gene is 500 TF THE POWER!

  48. Cell combines the considerable floating point resources required for demanding numerical algorithms with a power efficient software-controlled memory hierarchy. • Contains a powerful 64-bit Dual-threaded IBM PowerPC core and eight proprietary 'Synergistic Processing Elements' (SPEs), - eight more highly specialized mini-computers on the same die. • Cell’s peak double precision performance is very impressive relative to its commodity peers (14.6Gflop/s@3.2GHz), WHY THE POWER?

  49. Quantitative Performance comparison of the cell to AMD Opteron(superscalar), Intel Itanium 2(VLIW) and Cray X1E(vector)‏ Minor Architectural Changes (CELL +) to improve DP performance. Complexity of mapping scientific algorithms onto the CELL. A few interesting Applications OVERVIEW

More Related