The Future of Computer Architecture

The Future of Computer Architecture Shekhar Borkar Intel Corp. June 9, 2012

Outline • Compute performance goals • Challenges & solutions for: • Compute, • Memory, • Interconnect • Importance of resiliency • Paradigm shift • Summary

Performance Roadmap Client Hand-held

From Giga to Exa, via Tera & Peta

Energy per Operation 100 pJ/bit 75 pJ/bit 10 pJ/bit 25 pJ/bit

Where is the Energy Consumed? Decode and control Address translations… Power supply losses Bloated with inefficient architectural features 2.5KW 10TB disk @ 1TB/disk @10W ~3KW 100W Goal Disk 100pJ com per FLOP 100W Com 5W ~20W 0.1B/FLOP @ 1.5nJ per Byte ~3W 150W Memory ~5W 2W 100W 100pJ per FLOP Compute 5W

Voltage Scaling When designed to voltage scale

1 450 10 65nm CMOS, 50°C 65nm CMOS, 50°C 400 350 300 1 250 Energy Efficiency (GOPS/Watt) Active Leakage Power (mW) 200 9.6X Subthreshold Region -1 150 10 100 50 320mV 320mV -2 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Near Threshold-Voltage (NTV) < 3 Orders 4 Orders H. Kaul et al, 16.6: ISSCC08

NTV Across Technology Generations NTV operation improves energy efficiency across 45nm-22nm CMOS

Impact of Variation on NTV 5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance

Mitigating Impact of Variation Variation control with body biasing Body effect is substantially reduced in advanced technologies Energy cost of body biasing could become substantial Fully-depleted transistors have no body left f/2 f f/4 f f f f/2 f f f/4 f f f/2 f f f/4 f f f f/2 f/4 f f f 2. Variation tolerance at the system level Example: Many-core System Running all cores at full frequency exceeds energy budget Run cores at the native frequency Law of large numbers—averaging

Subthreshold Leakage at NTV Increasing Variations NTV operation reduces total power, improves energy efficiency Subthreshold leakage power is substantial portion of the total

Experimental NTV Processor 1.1 mm Scan IA-32 Core Logic 1.8 mm Level Shifters + clk spine 951 Pin FCBGA Package L1$-D ROM L1$-I Custom Interposer Legacy Socket-7 Motherboard S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

Power and Performance Super-threshold Subthreshold NTV Logic leakage Memory leakage Logic dynamic

Observations Leakage power dominates Fine grain leakage power management is required

Memory & Storage Technologies (Endurance issue)

Traditional DRAM New DRAM architecture RAS Addr Page Page Page Page Page Page CAS Addr Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D) Revise DRAM Architecture Energy cost today: ~150 pJ/bit

Logic Buffer Package 3D Integration of DRAM Thin Logic and DRAM die Through silicon vias Power delivery through logic die Energy efficient, high speed IO to logic buffer Detailed interface signals to DRAMs created on the logic die The most promising solution for energy efficient BW

Communication Energy Between cabinets Board to Board Chip to chip On Die

21.4mm 26.5mm On-Die Communication Power 80 Core TFLOP Chip (2006) 48 Core Single Chip Cloud (2009) Clock Dual TILE C C dist. FPMACs DDR3 MC DDR3 MC C C 11% 36% PLL IMEM + 8 X 10 Mesh 32 bit links 320 GB/sec bisection BW @ 5 GHz 2 Core clusters in 6 X 4 Mesh (why not 6 x 8?) 128 bit links 256 GB/sec bisection BW @ 2 GHz JTAG Router + DMEM Links 10-port TILE 21% 28% RF DDR3 MC DDR3 MC 4% VRC System Interface + I/O

On-die Communication Energy Traditional Homogeneous Network 0.08 pJ/bit/switch 0.04 pJ/mm (wire) Assume Byte/Flop 27 pJ/Flop on-die communication energy • Network power too high (27MW for EFLOP) • Worse if link width scales up each generation • Cache coherency mechanism is complex

STOP STOP STOP STOP 5mm 10mm 3-5 Clk <1 Clk 3-5 Clk 2-3 Clk 3-5 Clk 3-5 Clk 5 Clk 6-7 Clk 0.5mW 0.5mW 0.5mW 0.5mW 1mW 0.5mW Packet Switched Interconnect • Router acts like STOP signs—adds latency • Each hop consumes power (unnecessary)

Mesh—Retrospective • Bus: Good at board level, does not extend well • Transmission line issues: loss and signal integrity, limited frequency • Width is limited by pins and board area • Broadcast, simple to implement • Point to point busses: fast signaling over longer distance • Board level, between boards, and racks • High frequency, narrow links • 1D Ring, 2D Mesh and Torus to reduce latency • Higher complexity and latency in each node • Hence, emergence of packet switched network But, pt-to-pt packet switched network on a chip?

Interconnect Delay & Energy

Bus—The Other Extreme… Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism

Repeated Bus (Circuit Switched) Arbitration: Each cycle for the next cycle Decision visible to all nodes Repeaters: Align repeater direction No driving contention O R R R R R R R R Assume: 10mm die, 1.5u bus pitch 50ps repeater delay Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008 Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

A Circuit Switched Network • Circuit-switched NoC eliminates intra-route data storage • Packet-switching used only for channel requests • High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit) Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

C C C C C C C C C C R R Bus Bus Bus Bus Bus C C C C C C C C C C R R 2nd Level Bus Bus to connect over short distances Or hierarchical circuit and packet switched networks Hierarchy of Busses Hierarchical & Heterogeneous

Tapered Interconnect Fabric Tapered, but over-provisioned bandwidth Pay (energy) as you go (afford)

But wait, what about Optical? 65nm Optical: Pre-driver Driver VCSEL TIA LA DeMUX CDR Clk Buffer Possible today Hopeful Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers

Almost flat because Vdd close to Vt 4X increase in the number of cores (Parallelism) Impact of Exploding Parallelism Increased communication and related energy Increased HW, and unreliability 1. Strike a balance between Com & Computation 2. Resiliency (Gradual, Intermittent, Permanent faults)

Road to Unreliability? Resiliency will be the corner-stone

Soft Errors and Reliability Soft error/bit reduces each generation Nominal impact of NTV on soft error rate Soft error at the system level will continue to increase Positive impact of NTV on reliability Low V  lower E fields, low power  lower temperature Device aging effects will be less of a concern Lower electromigration related defects N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006

Resiliency Minimal overhead for resiliency Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt Applications System Software Programming system Microcode, Platform Microarchitecture Circuit & Design

Compute vs Data Movement 1 bit Xfer over 3G 1mJ 1 bit Xfer over WiFi 1 bit Xfer over Ethernet 1 bit Xfer using Bluetooth 1nJ 1 bit over link R/W Operands (RF) Read a bit from internal SRAM Computation Read a bit from DRAM Instruction Execution 1pJ Data movement energy will dominate

System Level Optimization Global Interconnect Compute Energy Supply Voltage Compute energy reduces faster than global interconnect energy For constant throughput, NTV demands more parallelism Increases data movement at the system level System level optimization is required to determine NTV operating point

Architecture needs a Paradigm Shift Architect’s past and present priorities— Architect’s future priorities should be— Must revisit and evaluate each (even legacy) architecture feature

A non-architect’s biased view… Soft-errors Power, energy Interconnects Bandwidth Variations Resiliency

Appealing the Architects Exploit transistor integration capacity Dark Silicon is a self-fulfilling prophecy Compute is cheap, reduce data movement Use futuristic work-loads for evaluation Fearlessly break the legacy shackles

Summary • Power & energy challenge continues • Opportunistically employ NTV operation • 3D integration for DRAM • Hierarchical, heterogeneous, tapered interconnect • Resiliency spanning the entire stack • Does computer architecture have a future? • Yes, if you acknowledge issues & challenges, and embrace the paradigm shift • No, if you keep “head buried in the sand”!

The Future of Computer Architecture

The Future of Computer Architecture

Presentation Transcript

Computer Architecture

Computer Architecture

Future of Computer Architecture

The Future of Software Architecture

Computer Architecture

The Concept of Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture The Concept

Computer Architecture

THE CONCEPT OF COMPUTER ARCHITECTURE

Computer Architecture

Review of Computer Architecture

Computer Architecture

Computer Architecture

The Future of Computer Science

Overview of Computer Architecture

Computer Architecture

Computer Architecture

Future of Computer Architecture

The Concept of Computer Architecture