The future of computer architecture
Download
1 / 40

The Future of Computer Architecture - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

The Future of Computer Architecture. Shekhar Borkar Intel Corp. June 9, 2012. Outline. Compute performance goals Challenges & solutions for: Compute , Memory , Interconnect Importance of resiliency Paradigm shift Summary. Performance Roadmap. Client. Hand-held.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Future of Computer Architecture' - rolf


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The future of computer architecture

The Future of Computer Architecture

Shekhar Borkar

Intel Corp.

June 9, 2012


Outline
Outline

  • Compute performance goals

  • Challenges & solutions for:

    • Compute,

    • Memory,

    • Interconnect

  • Importance of resiliency

  • Paradigm shift

  • Summary


Performance roadmap
Performance Roadmap

Client

Hand-held



Energy per operation
Energy per Operation

100 pJ/bit

75 pJ/bit

10 pJ/bit

25 pJ/bit


Where is the energy consumed
Where is the Energy Consumed?

Decode and control

Address translations…

Power supply losses

Bloated with inefficient

architectural features

2.5KW

10TB disk @ 1TB/disk @10W

~3KW

100W

Goal

Disk

100pJ com per FLOP

100W

Com

5W

~20W

0.1B/FLOP @ 1.5nJ per Byte

~3W

150W

Memory

~5W

2W

100W

100pJ per FLOP

Compute

5W


Voltage scaling
Voltage Scaling

When designed to voltage scale


Near threshold voltage ntv

1

450

10

65nm CMOS, 50°C

65nm CMOS, 50°C

400

350

300

1

250

Energy Efficiency (GOPS/Watt)

Active Leakage Power (mW)

200

9.6X

Subthreshold Region

-1

150

10

100

50

320mV

320mV

-2

0

10

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Supply Voltage (V)

Near Threshold-Voltage (NTV)

< 3 Orders

4 Orders

H. Kaul et al, 16.6: ISSCC08


Ntv across technology generations
NTV Across Technology Generations

NTV operation improves energy efficiency across 45nm-22nm CMOS


Impact of variation on ntv
Impact of Variation on NTV

5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance


Mitigating impact of variation
Mitigating Impact of Variation

Variation control with body biasing

Body effect is substantially reduced in advanced technologies

Energy cost of body biasing could become substantial

Fully-depleted transistors have no body left

f/2

f

f/4

f

f

f

f/2

f

f

f/4

f

f

f/2

f

f

f/4

f

f

f

f/2

f/4

f

f

f

2. Variation tolerance at the system level

Example: Many-core System

Running all cores at full frequency exceeds energy budget

Run cores at the native frequency

Law of large numbers—averaging


Subthreshold leakage at ntv
Subthreshold Leakage at NTV

Increasing Variations

NTV operation reduces total power, improves energy efficiency

Subthreshold leakage power is substantial portion of the total


Experimental ntv processor
Experimental NTV Processor

1.1 mm

Scan

IA-32 Core

Logic

1.8 mm

Level Shifters + clk spine

951 Pin FCBGA Package

L1$-D

ROM

L1$-I

Custom Interposer

Legacy Socket-7 Motherboard

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012


Power and performance
Power and Performance

Super-threshold

Subthreshold

NTV

Logic leakage

Memory leakage

Logic dynamic


Observations
Observations

Leakage power dominates

Fine grain leakage power management is required



Revise dram architecture

Traditional DRAM

New DRAM architecture

RAS

Addr

Page

Page

Page

Page

Page

Page

CAS

Addr

Activates many pages

Lots of reads and writes (refresh)

Small amount of read data is used

Requires small number of pins

Activates few pages

Read and write (refresh) what is needed

All read data is used

Requires large number of IO’s (3D)

Revise DRAM Architecture

Energy cost today:

~150 pJ/bit


3d integration of dram

Logic Buffer

Package

3D Integration of DRAM

Thin Logic and DRAM die

Through silicon vias

Power delivery through logic die

Energy efficient, high speed IO to logic buffer

Detailed interface signals to DRAMs created on the logic die

The most promising solution for energy efficient BW


Communication energy
Communication Energy

Between

cabinets

Board to Board

Chip to chip

On Die


On die communication power

21.4mm

26.5mm

On-Die Communication Power

80 Core TFLOP Chip (2006)

48 Core Single Chip Cloud (2009)

Clock

Dual

TILE

C

C

dist.

FPMACs

DDR3 MC

DDR3 MC

C

C

11%

36%

PLL

IMEM +

8 X 10 Mesh

32 bit links

320 GB/sec bisection BW @ 5 GHz

2 Core clusters in 6 X 4 Mesh

(why not 6 x 8?)

128 bit links

256 GB/sec bisection BW @ 2 GHz

JTAG

Router +

DMEM

Links

10-port

TILE

21%

28%

RF

DDR3 MC

DDR3 MC

4%

VRC

System Interface + I/O


On die communication energy
On-die Communication Energy

Traditional Homogeneous Network

0.08 pJ/bit/switch

0.04 pJ/mm (wire)

Assume Byte/Flop

27 pJ/Flop on-die communication energy

  • Network power too high (27MW for EFLOP)

  • Worse if link width scales up each generation

  • Cache coherency mechanism is complex


Packet switched interconnect

STOP

STOP

STOP

STOP

5mm

10mm

3-5 Clk

<1 Clk

3-5 Clk

2-3 Clk

3-5 Clk

3-5 Clk

5 Clk

6-7 Clk

0.5mW

0.5mW

0.5mW

0.5mW

1mW

0.5mW

Packet Switched Interconnect

  • Router acts like STOP signs—adds latency

  • Each hop consumes power (unnecessary)


Mesh retrospective
Mesh—Retrospective

  • Bus: Good at board level, does not extend well

    • Transmission line issues: loss and signal integrity, limited frequency

    • Width is limited by pins and board area

    • Broadcast, simple to implement

  • Point to point busses: fast signaling over longer distance

    • Board level, between boards, and racks

    • High frequency, narrow links

    • 1D Ring, 2D Mesh and Torus to reduce latency

    • Higher complexity and latency in each node

  • Hence, emergence of packet switched network

But, pt-to-pt packet switched network on a chip?



Bus the other extreme
Bus—The Other Extreme…

Issues:

Slow, < 300MHz

Shared, limited scalability?

Solutions:

Repeaters to increase freq

Wide busses for bandwidth

Multiple busses for scalability

Benefits:

Power?

Simpler cache coherency

Move away from frequency, embrace parallelism


Repeated bus circuit switched
Repeated Bus (Circuit Switched)

Arbitration:

Each cycle for the next cycle

Decision visible to all nodes

Repeaters:

Align repeater direction

No driving contention

O

R

R

R

R

R

R

R

R

Assume:

10mm die,

1.5u bus pitch

50ps repeater delay

Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008

Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008


A circuit switched network
A Circuit Switched Network

  • Circuit-switched NoC eliminates intra-route data storage

  • Packet-switching used only for channel requests

    • High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit)

Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008


Hierarchical heterogeneous

C

C

C

C

C

C

C

C

C

C

R

R

Bus

Bus

Bus

Bus

Bus

C

C

C

C

C

C

C

C

C

C

R

R

2nd Level Bus

Bus to connect over short distances

Or hierarchical circuit and packet switched networks

Hierarchy of Busses

Hierarchical & Heterogeneous


Tapered interconnect fabric
Tapered Interconnect Fabric

Tapered, but over-provisioned bandwidth

Pay (energy) as you go (afford)


But wait what about optical
But wait, what about Optical?

65nm

Optical:

Pre-driver

Driver

VCSEL

TIA

LA

DeMUX

CDR

Clk Buffer

Possible today

Hopeful

Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers


Impact of exploding parallelism

Almost flat because Vdd close to Vt

4X increase in the number of cores (Parallelism)

Impact of Exploding Parallelism

Increased communication and related energy

Increased HW, and unreliability

1. Strike a balance between Com & Computation

2. Resiliency (Gradual, Intermittent, Permanent faults)


Road to unreliability
Road to Unreliability?

Resiliency will be the corner-stone


Soft errors and reliability
Soft Errors and Reliability

Soft error/bit reduces each generation

Nominal impact of NTV on soft error rate

Soft error at the system level will continue to increase

Positive impact of NTV on reliability

Low V  lower E fields, low power  lower temperature

Device aging effects will be less of a concern

Lower electromigration related defects

N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006


Resiliency
Resiliency

Minimal overhead for resiliency

Error detection

Fault isolation

Fault confinement

Reconfiguration

Recovery & Adapt

Applications

System Software

Programming system

Microcode, Platform

Microarchitecture

Circuit & Design


Compute vs data movement
Compute vs Data Movement

1 bit Xfer over 3G

1mJ

1 bit Xfer over WiFi

1 bit Xfer over Ethernet

1 bit Xfer using Bluetooth

1nJ

1 bit over link

R/W Operands (RF)

Read a bit from internal SRAM

Computation

Read a bit from DRAM

Instruction Execution

1pJ

Data movement energy will dominate


System level optimization
System Level Optimization

Global Interconnect

Compute

Energy

Supply Voltage

Compute energy reduces faster than global interconnect energy

For constant throughput, NTV demands more parallelism

Increases data movement at the system level

System level optimization is required to determine NTV operating point


Architecture needs a paradigm shift
Architecture needs a Paradigm Shift

Architect’s past and present priorities—

Architect’s future priorities should be—

Must revisit and evaluate each (even legacy) architecture feature


A non architect s biased view
A non-architect’s biased view…

Soft-errors

Power, energy

Interconnects

Bandwidth

Variations

Resiliency


Appealing the architects
Appealing the Architects

Exploit transistor integration capacity

Dark Silicon is a self-fulfilling prophecy

Compute is cheap, reduce data movement

Use futuristic work-loads for evaluation

Fearlessly break the legacy shackles


Summary
Summary

  • Power & energy challenge continues

  • Opportunistically employ NTV operation

  • 3D integration for DRAM

  • Hierarchical, heterogeneous, tapered interconnect

  • Resiliency spanning the entire stack

  • Does computer architecture have a future?

    • Yes, if you acknowledge issues & challenges, and embrace the paradigm shift

    • No, if you keep “head buried in the sand”!


ad