The future of computer architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

The Future of Computer Architecture PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

The Future of Computer Architecture. Shekhar Borkar Intel Corp. June 9, 2012. Outline. Compute performance goals Challenges & solutions for: Compute , Memory , Interconnect Importance of resiliency Paradigm shift Summary. Performance Roadmap. Client. Hand-held.

Download Presentation

The Future of Computer Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The future of computer architecture

The Future of Computer Architecture

Shekhar Borkar

Intel Corp.

June 9, 2012


Outline

Outline

  • Compute performance goals

  • Challenges & solutions for:

    • Compute,

    • Memory,

    • Interconnect

  • Importance of resiliency

  • Paradigm shift

  • Summary


Performance roadmap

Performance Roadmap

Client

Hand-held


From giga to exa via tera peta

From Giga to Exa, via Tera & Peta


Energy per operation

Energy per Operation

100 pJ/bit

75 pJ/bit

10 pJ/bit

25 pJ/bit


Where is the energy consumed

Where is the Energy Consumed?

Decode and control

Address translations…

Power supply losses

Bloated with inefficient

architectural features

2.5KW

10TB disk @ 1TB/disk @10W

~3KW

100W

Goal

Disk

100pJ com per FLOP

100W

Com

5W

~20W

0.1B/FLOP @ 1.5nJ per Byte

~3W

150W

Memory

~5W

2W

100W

100pJ per FLOP

Compute

5W


Voltage scaling

Voltage Scaling

When designed to voltage scale


Near threshold voltage ntv

1

450

10

65nm CMOS, 50°C

65nm CMOS, 50°C

400

350

300

1

250

Energy Efficiency (GOPS/Watt)

Active Leakage Power (mW)

200

9.6X

Subthreshold Region

-1

150

10

100

50

320mV

320mV

-2

0

10

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Supply Voltage (V)

Near Threshold-Voltage (NTV)

< 3 Orders

4 Orders

H. Kaul et al, 16.6: ISSCC08


Ntv across technology generations

NTV Across Technology Generations

NTV operation improves energy efficiency across 45nm-22nm CMOS


Impact of variation on ntv

Impact of Variation on NTV

5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance


Mitigating impact of variation

Mitigating Impact of Variation

Variation control with body biasing

Body effect is substantially reduced in advanced technologies

Energy cost of body biasing could become substantial

Fully-depleted transistors have no body left

f/2

f

f/4

f

f

f

f/2

f

f

f/4

f

f

f/2

f

f

f/4

f

f

f

f/2

f/4

f

f

f

2. Variation tolerance at the system level

Example: Many-core System

Running all cores at full frequency exceeds energy budget

Run cores at the native frequency

Law of large numbers—averaging


Subthreshold leakage at ntv

Subthreshold Leakage at NTV

Increasing Variations

NTV operation reduces total power, improves energy efficiency

Subthreshold leakage power is substantial portion of the total


Experimental ntv processor

Experimental NTV Processor

1.1 mm

Scan

IA-32 Core

Logic

1.8 mm

Level Shifters + clk spine

951 Pin FCBGA Package

L1$-D

ROM

L1$-I

Custom Interposer

Legacy Socket-7 Motherboard

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012


Power and performance

Power and Performance

Super-threshold

Subthreshold

NTV

Logic leakage

Memory leakage

Logic dynamic


Observations

Observations

Leakage power dominates

Fine grain leakage power management is required


Memory storage technologies

Memory & Storage Technologies

(Endurance issue)


Revise dram architecture

Traditional DRAM

New DRAM architecture

RAS

Addr

Page

Page

Page

Page

Page

Page

CAS

Addr

Activates many pages

Lots of reads and writes (refresh)

Small amount of read data is used

Requires small number of pins

Activates few pages

Read and write (refresh) what is needed

All read data is used

Requires large number of IO’s (3D)

Revise DRAM Architecture

Energy cost today:

~150 pJ/bit


3d integration of dram

Logic Buffer

Package

3D Integration of DRAM

Thin Logic and DRAM die

Through silicon vias

Power delivery through logic die

Energy efficient, high speed IO to logic buffer

Detailed interface signals to DRAMs created on the logic die

The most promising solution for energy efficient BW


Communication energy

Communication Energy

Between

cabinets

Board to Board

Chip to chip

On Die


On die communication power

21.4mm

26.5mm

On-Die Communication Power

80 Core TFLOP Chip (2006)

48 Core Single Chip Cloud (2009)

Clock

Dual

TILE

C

C

dist.

FPMACs

DDR3 MC

DDR3 MC

C

C

11%

36%

PLL

IMEM +

8 X 10 Mesh

32 bit links

320 GB/sec bisection BW @ 5 GHz

2 Core clusters in 6 X 4 Mesh

(why not 6 x 8?)

128 bit links

256 GB/sec bisection BW @ 2 GHz

JTAG

Router +

DMEM

Links

10-port

TILE

21%

28%

RF

DDR3 MC

DDR3 MC

4%

VRC

System Interface + I/O


On die communication energy

On-die Communication Energy

Traditional Homogeneous Network

0.08 pJ/bit/switch

0.04 pJ/mm (wire)

Assume Byte/Flop

27 pJ/Flop on-die communication energy

  • Network power too high (27MW for EFLOP)

  • Worse if link width scales up each generation

  • Cache coherency mechanism is complex


Packet switched interconnect

STOP

STOP

STOP

STOP

5mm

10mm

3-5 Clk

<1 Clk

3-5 Clk

2-3 Clk

3-5 Clk

3-5 Clk

5 Clk

6-7 Clk

0.5mW

0.5mW

0.5mW

0.5mW

1mW

0.5mW

Packet Switched Interconnect

  • Router acts like STOP signs—adds latency

  • Each hop consumes power (unnecessary)


Mesh retrospective

Mesh—Retrospective

  • Bus: Good at board level, does not extend well

    • Transmission line issues: loss and signal integrity, limited frequency

    • Width is limited by pins and board area

    • Broadcast, simple to implement

  • Point to point busses: fast signaling over longer distance

    • Board level, between boards, and racks

    • High frequency, narrow links

    • 1D Ring, 2D Mesh and Torus to reduce latency

    • Higher complexity and latency in each node

  • Hence, emergence of packet switched network

But, pt-to-pt packet switched network on a chip?


Interconnect delay energy

Interconnect Delay & Energy


Bus the other extreme

Bus—The Other Extreme…

Issues:

Slow, < 300MHz

Shared, limited scalability?

Solutions:

Repeaters to increase freq

Wide busses for bandwidth

Multiple busses for scalability

Benefits:

Power?

Simpler cache coherency

Move away from frequency, embrace parallelism


Repeated bus circuit switched

Repeated Bus (Circuit Switched)

Arbitration:

Each cycle for the next cycle

Decision visible to all nodes

Repeaters:

Align repeater direction

No driving contention

O

R

R

R

R

R

R

R

R

Assume:

10mm die,

1.5u bus pitch

50ps repeater delay

Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008

Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008


A circuit switched network

A Circuit Switched Network

  • Circuit-switched NoC eliminates intra-route data storage

  • Packet-switching used only for channel requests

    • High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit)

Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008


Hierarchical heterogeneous

C

C

C

C

C

C

C

C

C

C

R

R

Bus

Bus

Bus

Bus

Bus

C

C

C

C

C

C

C

C

C

C

R

R

2nd Level Bus

Bus to connect over short distances

Or hierarchical circuit and packet switched networks

Hierarchy of Busses

Hierarchical & Heterogeneous


Tapered interconnect fabric

Tapered Interconnect Fabric

Tapered, but over-provisioned bandwidth

Pay (energy) as you go (afford)


But wait what about optical

But wait, what about Optical?

65nm

Optical:

Pre-driver

Driver

VCSEL

TIA

LA

DeMUX

CDR

Clk Buffer

Possible today

Hopeful

Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers


Impact of exploding parallelism

Almost flat because Vdd close to Vt

4X increase in the number of cores (Parallelism)

Impact of Exploding Parallelism

Increased communication and related energy

Increased HW, and unreliability

1. Strike a balance between Com & Computation

2. Resiliency (Gradual, Intermittent, Permanent faults)


Road to unreliability

Road to Unreliability?

Resiliency will be the corner-stone


Soft errors and reliability

Soft Errors and Reliability

Soft error/bit reduces each generation

Nominal impact of NTV on soft error rate

Soft error at the system level will continue to increase

Positive impact of NTV on reliability

Low V  lower E fields, low power  lower temperature

Device aging effects will be less of a concern

Lower electromigration related defects

N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006


Resiliency

Resiliency

Minimal overhead for resiliency

Error detection

Fault isolation

Fault confinement

Reconfiguration

Recovery & Adapt

Applications

System Software

Programming system

Microcode, Platform

Microarchitecture

Circuit & Design


Compute vs data movement

Compute vs Data Movement

1 bit Xfer over 3G

1mJ

1 bit Xfer over WiFi

1 bit Xfer over Ethernet

1 bit Xfer using Bluetooth

1nJ

1 bit over link

R/W Operands (RF)

Read a bit from internal SRAM

Computation

Read a bit from DRAM

Instruction Execution

1pJ

Data movement energy will dominate


System level optimization

System Level Optimization

Global Interconnect

Compute

Energy

Supply Voltage

Compute energy reduces faster than global interconnect energy

For constant throughput, NTV demands more parallelism

Increases data movement at the system level

System level optimization is required to determine NTV operating point


Architecture needs a paradigm shift

Architecture needs a Paradigm Shift

Architect’s past and present priorities—

Architect’s future priorities should be—

Must revisit and evaluate each (even legacy) architecture feature


A non architect s biased view

A non-architect’s biased view…

Soft-errors

Power, energy

Interconnects

Bandwidth

Variations

Resiliency


Appealing the architects

Appealing the Architects

Exploit transistor integration capacity

Dark Silicon is a self-fulfilling prophecy

Compute is cheap, reduce data movement

Use futuristic work-loads for evaluation

Fearlessly break the legacy shackles


Summary

Summary

  • Power & energy challenge continues

  • Opportunistically employ NTV operation

  • 3D integration for DRAM

  • Hierarchical, heterogeneous, tapered interconnect

  • Resiliency spanning the entire stack

  • Does computer architecture have a future?

    • Yes, if you acknowledge issues & challenges, and embrace the paradigm shift

    • No, if you keep “head buried in the sand”!


  • Login