On chip optical communication for multicore processors l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

On-Chip Optical Communication for Multicore Processors PowerPoint PPT Presentation


  • 165 Views
  • Updated On :
  • Presentation posted in: General

On-Chip Optical Communication for Multicore Processors. Jason Miller Carbon Research Group MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB. “Moore’s Gap”. Tiled Multicore. The GOPS Gap. Multicore. SMT, FGMT, CGMT. OOO. Superscalar. Pipelining. Performance (GOPS). 1000.

Download Presentation

On-Chip Optical Communication for Multicore Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


On chip optical communication for multicore processors l.jpg

On-Chip Optical Communication for Multicore Processors

Jason Miller

Carbon Research Group

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB


Moore s gap l.jpg

“Moore’s Gap”

Tiled Multicore

The GOPS Gap

Multicore

SMT, FGMT, CGMT

OOO

Superscalar

Pipelining

Performance

(GOPS)

1000

Transistors

100

10

1

  • Diminishing returns from single CPU mechanisms (pipelining, caching, etc.)

  • Wire delays

  • Power envelopes

0.1

0.01

time

1992

1998

2002

2006

2010


Multicore scaling trends l.jpg

Multicore Scaling Trends

Today

A few large cores on each chip

Diminishing returns prevent cores from getting more complex

Only option for future scaling is to add more cores

Still some shared global structures: bus, L2 caches

m

m

m

m

p

p

p

p

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

Tomorrow

100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007]

Simple cores are more power and area efficient

Global structures do not scale; all resources must be distributed

p

p

c

c

BUS

L2 Cache


The future of multicore l.jpg

The Future of Multicore

Number of cores doubles

every 18 months

Parallelism replaces clock frequency scaling and core complexity

Resulting Challenges…

Scalability

Programming

Power

IBM XCell 8i

Tilera TILE64

MIT RAW

Sun Ultrasparc T2


Multicore challenges l.jpg

Multicore Challenges

Scalability

How do we turn additional cores into additional performance?

Must accelerate single apps, not just run more apps in parallel

Efficient core-to-core communication is crucial

Architectures that grow easily with each new technology generation

Programming

Traditional parallel programming techniques are hard

Parallel machines were rare and used only by rocket scientists

Multicores are ubiquitous and must be programmable by anyone

Power

Already a first-order design constraint

More cores and more communication  more power

Previous tricks (e.g. lower Vdd) are running out of steam


Multicore communication today l.jpg

Multicore Communication Today

Single shared resource

Uniform communication cost

Communication through memory

Doesn’t scale to many cores due to contention and long wires

Scalable up to about 8 cores

p

p

c

c

BUS

L2 Cache

Bus-based Interconnect

DRAM


Multicore communication tomorrow l.jpg

Multicore Communication Tomorrow

m

m

m

m

p

p

p

p

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

Point-to-Point Mesh Network

Examples: MIT Raw, Tilera TILEPro64, Intel Terascale Prototype

Neighboring tiles are connected

Distributed communication resources

Non-uniform costs:

Latency depends on distance

Encourages direct communication

More energy efficient than bus

Scalable to hundreds of cores

DRAM

DRAM

DRAM

DRAM


Multicore programming trends l.jpg

Multicore Programming Trends

Meshes and small cores solve the physical scaling challenge, but programming remains a barrier

Parallelizing applications to thousands of cores is hard

Task and data partitioning

Communication becomes critical as latencies increase

Increasing contention for distant communication

Degraded performance, higher energy

Inefficient broadcast-style communication

Major source of contention

Expensive to distribute signal electrically


Multicore programming trends9 l.jpg

Multicore Programming Trends

For high performance, communication and locality must be managed

Tasks and data must be both partitioned and placed

Analyze communication patterns to minimize latencies

Place data near the code that needs it most

Place certain code near critical resources (e.g. DRAM, I/O)

Dynamic, unpredictable communication is impossible to optimize

Orchestrating communication and locality increases programming difficulty exponentially


Improving programmability l.jpg

Improving Programmability

Observations:

  • A cheap broadcast communication mechanism can make programming easier

    • Enables convenient programming models (e.g., shared memory)

    • Reduces the need to carefully manage locality

  • On-chip optical components enable cheap, energy-efficient broadcast


Atac architecture l.jpg

ATAC Architecture

m

m

m

m

p

p

p

p

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

switch

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

m

m

m

m

p

p

p

p

Electrical Mesh Interconnect

Optical Broadcast WDM Interconnect


Optical broadcast network l.jpg

Optical Broadcast Network

Waveguide passes through every core

Multiple wavelengths (WDM) eliminates contention

Signal reaches all cores in <2ns

Same signal can be received by all cores

optical waveguide


Optical broadcast network13 l.jpg

Optical Broadcast Network

  • Electronic-photonic integration using standard CMOS process

  • Cores communicate via optical WDM broadcast and select network

  • Each core sends on its own dedicated wavelength using modulators

  • Cores can receive from some set of senders using optical filters

N cores


Optical bit transmission l.jpg

Optical bit transmission

multi-wavelength source waveguide

modulator

filter

data waveguide

modulator

driver

transimpedance

amplifier

flip-flop

flip-flop

photodetector

receiving core

sending core

  • Each core sends data using a different wavelength  no contention

  • Data is sent once, any or all cores can receive it  efficient broadcast


Core to core communication l.jpg

Core-to-core communication

  • 32-bit data words transmitted across several parallel waveguides

  • Each core contains receive filters and a FIFO buffer for every sender

  • Data is buffered at receiver until needed by the processing core

  • Receiver can screen data by sender (i.e. wavelength) or message type

32

32

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

32

32

Processor

Core

Processor

Core

Processor Core

receiving core

sending core A

sending core B


Atac bandwidth l.jpg

ATAC Bandwidth

64 cores, 32 lines, 1 Gb/s

Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s

Receive-Weighted BW: 2 Tb/s * 63 receivers= 126 Tb/s

Good metric for broadcast networks – reflects WDM

ATAC allows better utilization of computational resources because less time is spent performing communication


System capabilities and performance l.jpg

System Capabilities and Performance

Baseline: Raw Multicore Chip

  • Leading-edge tiled multicore

    64-core system (65nm process)

  • Peak performance: 64 GOPS

  • Chip power: 24 W

  • Theoretical power eff.: 2.7 GOPS/W

  • Effective performance: 7.3 GOPS

  • Effective power eff: 0.3 GOPS/W

  • Total system power: 150 W

ATAC Multicore Chip

  • Future optical interconnect multicore

    64-core system (65nm process)

  • Peak performance: 64 GOPS

  • Chip power: 25.5 W

  • Theoretical power eff.: 2.5 GOPS/W

  • Effective performance: 38.0 GOPS

  • Effective power eff.: 1.5 GOPS/W

  • Total system power: 153 W

Optical communications require a small amount of additional system power but allow for much better utilization of computational resources.


Programming atac l.jpg

Programming ATAC

Cores can directly communicate with anyother core in one hop (<2ns)

Broadcasts require just one send

No complicated routing on network required

Cheap broadcast enables frequent global communications

Broadcast-based cache update/remote store protocol

All “subscribers” are notified when a writing core issues a store (“publish”)

Uniform communication latency simplifies scheduling


Communication centric computing l.jpg

Communication-centric Computing

500pJ

3pJ

500pJ

500pJ

500pJ

3pJ

3pJ

p

p

3pJ

c

c

BUS

L2 Cache

  • ATAC reduces off-chip memory calls, and hence energy and latency

  • View of extended global memory can be enabled cheaply with on-chip distributed cache memory and ATAC network

memory

Bus-Based Multicore

ATAC


Summary l.jpg

Summary

ATAC uses optical networks to enable multicore programming and performance scaling

ATAC encourages communication-centric architecture, which helps multicore performance and power scalability

ATAC simplifies programming with a contention-free all-to-all broadcast network

ATAC is enabled by recent advances in CMOS integration of optical components


Backup slides l.jpg

Backup Slides


What does the future look like l.jpg

What Does the Future Look Like?

Corollary of Moore’s law: Number of cores will double every 18 months

‘05

‘08

‘11

‘14

‘02

Research

64

256

1024

4096

16

Industry

16

64

256

1024

4

1K cores by 2014! Are we ready?

(Cores minimally big enough to run a self respecting OS!)


Scaling to 1000 cores l.jpg

Scaling to 1000 Cores

Purely optical design scales to about 64 cores

After that, clusters of cores share optical hubs

ENet and BNet move data to/from optical hub

Dedicated, special-purpose electrical networks

P

r

o

c

$

D

i

r

$

memory

BNet

ONet

HUB

ENet

NET

memory

Electrical Networks Connect 16 Cores to Optical Hub

64 Optically-Connected Clusters


Atac is an efficient network l.jpg

ATAC is an Efficient Network

  • Modulators are Primary Source of Power Consumption

    • Receive Power: Require only ~2 fJ/bit even with -5dB link loss

    • Modulator Power:

      • Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver)

  • Example: 64-Core Communication

  • (i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core)

    • Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W

    • Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W

    • Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit

  • Comparison: Electrical Broadcast Across 64 Cores

    • Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)

    • (Assumes 150fJ/mm/bit, 1-mm spaced tiles)


  • Login