1 / 40

The Future of Computer Architecture

The Future of Computer Architecture. Shekhar Borkar Intel Corp. June 9, 2012. Outline. Compute performance goals Challenges & solutions for: Compute , Memory , Interconnect Importance of resiliency Paradigm shift Summary. Performance Roadmap. Client. Hand-held.

rolf
Download Presentation

The Future of Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of Computer Architecture Shekhar Borkar Intel Corp. June 9, 2012

  2. Outline • Compute performance goals • Challenges & solutions for: • Compute, • Memory, • Interconnect • Importance of resiliency • Paradigm shift • Summary

  3. Performance Roadmap Client Hand-held

  4. From Giga to Exa, via Tera & Peta

  5. Energy per Operation 100 pJ/bit 75 pJ/bit 10 pJ/bit 25 pJ/bit

  6. Where is the Energy Consumed? Decode and control Address translations… Power supply losses Bloated with inefficient architectural features 2.5KW 10TB disk @ 1TB/disk @10W ~3KW 100W Goal Disk 100pJ com per FLOP 100W Com 5W ~20W 0.1B/FLOP @ 1.5nJ per Byte ~3W 150W Memory ~5W 2W 100W 100pJ per FLOP Compute 5W

  7. Voltage Scaling When designed to voltage scale

  8. 1 450 10 65nm CMOS, 50°C 65nm CMOS, 50°C 400 350 300 1 250 Energy Efficiency (GOPS/Watt) Active Leakage Power (mW) 200 9.6X Subthreshold Region -1 150 10 100 50 320mV 320mV -2 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Near Threshold-Voltage (NTV) < 3 Orders 4 Orders H. Kaul et al, 16.6: ISSCC08

  9. NTV Across Technology Generations NTV operation improves energy efficiency across 45nm-22nm CMOS

  10. Impact of Variation on NTV 5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance

  11. Mitigating Impact of Variation Variation control with body biasing Body effect is substantially reduced in advanced technologies Energy cost of body biasing could become substantial Fully-depleted transistors have no body left f/2 f f/4 f f f f/2 f f f/4 f f f/2 f f f/4 f f f f/2 f/4 f f f 2. Variation tolerance at the system level Example: Many-core System Running all cores at full frequency exceeds energy budget Run cores at the native frequency Law of large numbers—averaging

  12. Subthreshold Leakage at NTV Increasing Variations NTV operation reduces total power, improves energy efficiency Subthreshold leakage power is substantial portion of the total

  13. Experimental NTV Processor 1.1 mm Scan IA-32 Core Logic 1.8 mm Level Shifters + clk spine 951 Pin FCBGA Package L1$-D ROM L1$-I Custom Interposer Legacy Socket-7 Motherboard S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

  14. Power and Performance Super-threshold Subthreshold NTV Logic leakage Memory leakage Logic dynamic

  15. Observations Leakage power dominates Fine grain leakage power management is required

  16. Memory & Storage Technologies (Endurance issue)

  17. Traditional DRAM New DRAM architecture RAS Addr Page Page Page Page Page Page CAS Addr Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D) Revise DRAM Architecture Energy cost today: ~150 pJ/bit

  18. Logic Buffer Package 3D Integration of DRAM Thin Logic and DRAM die Through silicon vias Power delivery through logic die Energy efficient, high speed IO to logic buffer Detailed interface signals to DRAMs created on the logic die The most promising solution for energy efficient BW

  19. Communication Energy Between cabinets Board to Board Chip to chip On Die

  20. 21.4mm 26.5mm On-Die Communication Power 80 Core TFLOP Chip (2006) 48 Core Single Chip Cloud (2009) Clock Dual TILE C C dist. FPMACs DDR3 MC DDR3 MC C C 11% 36% PLL IMEM + 8 X 10 Mesh 32 bit links 320 GB/sec bisection BW @ 5 GHz 2 Core clusters in 6 X 4 Mesh (why not 6 x 8?) 128 bit links 256 GB/sec bisection BW @ 2 GHz JTAG Router + DMEM Links 10-port TILE 21% 28% RF DDR3 MC DDR3 MC 4% VRC System Interface + I/O

  21. On-die Communication Energy Traditional Homogeneous Network 0.08 pJ/bit/switch 0.04 pJ/mm (wire) Assume Byte/Flop 27 pJ/Flop on-die communication energy • Network power too high (27MW for EFLOP) • Worse if link width scales up each generation • Cache coherency mechanism is complex

  22. STOP STOP STOP STOP 5mm 10mm 3-5 Clk <1 Clk 3-5 Clk 2-3 Clk 3-5 Clk 3-5 Clk 5 Clk 6-7 Clk 0.5mW 0.5mW 0.5mW 0.5mW 1mW 0.5mW Packet Switched Interconnect • Router acts like STOP signs—adds latency • Each hop consumes power (unnecessary)

  23. Mesh—Retrospective • Bus: Good at board level, does not extend well • Transmission line issues: loss and signal integrity, limited frequency • Width is limited by pins and board area • Broadcast, simple to implement • Point to point busses: fast signaling over longer distance • Board level, between boards, and racks • High frequency, narrow links • 1D Ring, 2D Mesh and Torus to reduce latency • Higher complexity and latency in each node • Hence, emergence of packet switched network But, pt-to-pt packet switched network on a chip?

  24. Interconnect Delay & Energy

  25. Bus—The Other Extreme… Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism

  26. Repeated Bus (Circuit Switched) Arbitration: Each cycle for the next cycle Decision visible to all nodes Repeaters: Align repeater direction No driving contention O R R R R R R R R Assume: 10mm die, 1.5u bus pitch 50ps repeater delay Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008 Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

  27. A Circuit Switched Network • Circuit-switched NoC eliminates intra-route data storage • Packet-switching used only for channel requests • High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit) Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

  28. C C C C C C C C C C R R Bus Bus Bus Bus Bus C C C C C C C C C C R R 2nd Level Bus Bus to connect over short distances Or hierarchical circuit and packet switched networks Hierarchy of Busses Hierarchical & Heterogeneous

  29. Tapered Interconnect Fabric Tapered, but over-provisioned bandwidth Pay (energy) as you go (afford)

  30. But wait, what about Optical? 65nm Optical: Pre-driver Driver VCSEL TIA LA DeMUX CDR Clk Buffer Possible today Hopeful Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers

  31. Almost flat because Vdd close to Vt 4X increase in the number of cores (Parallelism) Impact of Exploding Parallelism Increased communication and related energy Increased HW, and unreliability 1. Strike a balance between Com & Computation 2. Resiliency (Gradual, Intermittent, Permanent faults)

  32. Road to Unreliability? Resiliency will be the corner-stone

  33. Soft Errors and Reliability Soft error/bit reduces each generation Nominal impact of NTV on soft error rate Soft error at the system level will continue to increase Positive impact of NTV on reliability Low V  lower E fields, low power  lower temperature Device aging effects will be less of a concern Lower electromigration related defects N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006

  34. Resiliency Minimal overhead for resiliency Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt Applications System Software Programming system Microcode, Platform Microarchitecture Circuit & Design

  35. Compute vs Data Movement 1 bit Xfer over 3G 1mJ 1 bit Xfer over WiFi 1 bit Xfer over Ethernet 1 bit Xfer using Bluetooth 1nJ 1 bit over link R/W Operands (RF) Read a bit from internal SRAM Computation Read a bit from DRAM Instruction Execution 1pJ Data movement energy will dominate

  36. System Level Optimization Global Interconnect Compute Energy Supply Voltage Compute energy reduces faster than global interconnect energy For constant throughput, NTV demands more parallelism Increases data movement at the system level System level optimization is required to determine NTV operating point

  37. Architecture needs a Paradigm Shift Architect’s past and present priorities— Architect’s future priorities should be— Must revisit and evaluate each (even legacy) architecture feature

  38. A non-architect’s biased view… Soft-errors Power, energy Interconnects Bandwidth Variations Resiliency

  39. Appealing the Architects Exploit transistor integration capacity Dark Silicon is a self-fulfilling prophecy Compute is cheap, reduce data movement Use futuristic work-loads for evaluation Fearlessly break the legacy shackles

  40. Summary • Power & energy challenge continues • Opportunistically employ NTV operation • 3D integration for DRAM • Hierarchical, heterogeneous, tapered interconnect • Resiliency spanning the entire stack • Does computer architecture have a future? • Yes, if you acknowledge issues & challenges, and embrace the paradigm shift • No, if you keep “head buried in the sand”!

More Related