Advisor avinash kodi
This presentation is the property of its rightful owner.
Sponsored Links
1 / 49

Advisor: Avinash Kodi PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

PROPEL : Power & Area-Efficient, Scalable Opto -Electronic Network-on-Chips ( NoCs ) . Thesis Defense. Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: [email protected] Advisor: Avinash Kodi. Outline. Motivation & Background PROPEL: Architecture

Download Presentation

Advisor: Avinash Kodi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Advisor avinash kodi

PROPEL :

Power & Area-Efficient, Scalable Opto-Electronic Network-on-Chips (NoCs)

Thesis Defense

Randy W. Morris, Jr.

Affiliation: EECS, Ohio University

E-mail: [email protected]

Advisor: AvinashKodi


Outline

Outline

  • Motivation & Background

  • PROPEL: Architecture

  • PROPEL: Implementation

  • Performance Analysis

  • Conclusion


Why chip multi processor 1 2

Why Chip Multi-Processor? (1/2)

After 2002 diminishing returns from single core designs!!

Courtesy: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative

Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.


Why chip multi processor 2 2

Why Chip Multi-Processor? (2/2)

Courtesy: G. Konstadinidis and et. al., “Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor”

Examples: RAW, Core 2 Duo, Quad Core, Ultra Sparc


Wire delay problem

Wire Delay Problem

20mm

20mm

20mm

1

3

0

2

6

5

3

7

4

0

2

1

14

12

13

8

10

11

15

9

0

1

22

20

21

16

18

19

23

17

5

7

4

6

30

28

29

24

26

27

31

25

9

11

8

10

38

36

37

32

34

35

39

33

3

2

46

44

45

40

42

43

47

41

13

15

12

14

54

52

53

48

50

51

55

49

62

60

61

56

58

59

63

57

Past

FUTURE

Present

  • Wire delay proportional to wire’s RC constant

Resistance increases as Capacitance remains constant.


Network on chip noc

Network-on-Chip (NoC)

Router

Route

Computation

(RC)

Virtual Channel

(VC)

Core 3

Core 2

Core 1

Core 0

Crossbar Switch

Core

Credits

In/Out

Switch Allocator

(SA)

+X

+X

Router

Core 7

Core 6

Core 5

Core 4

Link

-X

-X

+Y

+Y

Core 11

Core 10

Core 9

Core 8

-Y

-Y

Core 15

Core 14

Core 13

Core 12

Processing Core


Power dissipation

Power Dissipation

Intel Tera-Flops (65 nm)

Tile Power

Routing Power

Courtesy: Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61

  • 28% of a tile’s overall power is for the router and links

  • Link power will become a more major contribution of a router’s

  • overall power for future VLSI technology

  • Router and link power should be about 10-15% of the tile’s power budget

Potential Solutions: Optics, RF and 3D stacking


Why use optics

Why use Optics?

  • Lower latency

  • Higher bandwidth (WDM, SDM & TDM)

  • Increased bandwidth density(compact parallel optics)

  • Low power (1.1 mW/Gb)

  • Bit-rate independent of distance

  • Lower cross-talk

  • Does not suffer for impedance mismatch

  • and signal reflection

  • Low signal attenuation


Electrical interconnect

Electrical Interconnect

R =wire resistant per length

C =wire capacitance per length

Cp=inverter output capacitance

C0=inverter input capacitance

Rs= inverter resistance

Sopt=inverter size

Lopt = Wire distance

rs

R, C

Cp

C0

lopt

RC Link:

sopt


Itrs 2007 transistor link parameters

ITRS 2007 Transistor & Link Parameters?

Electrical link device parameters for various VLSI technologies

  • Increase wire delay due to RC constant

  • Increase in Ioffn & Ishortckt current parameters


Optical interconnect

Optical Interconnect

On-Chip

Optical

Layer

Off-Chip Laser

On-Chip

Modulator

Photodetector

Transmission Medium

- Transmitter

Electronics

Layer

Buffer Chain

TIA

Limiting

Amplifier

Driver for

Electronics


Micro ring resonators

Resonant wavelength (λ0)

λ0 m= neff 2R

m  an integer

VR

neff effective refractive index

R  radius of the ring resonator

VR

n+

p+

n+

Input Port 0

Output Port 0

Micro-ring Resonators

=VOFF

n+

p+

n+

Input Port 0

Output Port 0

VR

=VOFF

=VON

Output Port 1

n+

p+

n+

  • CMOS compatible

  • Low power (0.1 mW)

  • Small footprint (10 um)

  • High Bandwidth (10 Gb)

Output Port 0

Input Port 0


Waveguide receiver

Waveguide & Receiver

[1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”,

39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9 , Iss. 13 Dec. 2006 pg.492 – 50

[2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication

Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007

[3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical

Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005

[4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing

station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004


Electrical optical comparison

Electrical/Optical Comparison

Power-delay product at various technology nodes for a 5 mm link.

Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects


Critical length

Critical Length

Critical Length is the distance where optical becomes more advantageous

core-to-core distance


Advantages of propel

Advantages of PROPEL

  • Efficient use of optical components

  • Balance between optics and electronics

  • Simple network design – Low diameter, DOR

  • Scalability

  • Fault Tolerant


Propel s design

PROPEL’s Design

0, 1, 2, …

Broadband Light source

Tile 0

0

1

4

5

8

10

12

14

L2

L2

L2

L2

2

6

7

9

11

13

15

3

Photonic

Transceiver

L2

L2

L2

28

30

L2

16

17

20

22

24

26

Optical

Interconnect

tile

Core

Core 0

Core 1

L2

Cache

27

29

31

18

19

21

23

25

Photonic

Transceiver

40

42

44

45

32

33

36

38

L2

L2

L2

Core2

Core3

L2

41

43

46

47

34

35

37

39

L2

L2

L2

56

57

60

61

48

49

52

53

L2

58

62

63

59

50

51

54

55


Propel s routing wavelength assignment x direction

PROPEL’s Routing & Wavelength Assignment (x-direction)

Broadband Signal

λ1(0,0)

λ3(0,0)

λ2(0,0)

Home Channel 0

Home Channel 1

λ2(2,0)

λ3(2,0)

λ0(1,0)

Home Channel 2

Home Channel 3

Core 0

Core 8

Core 4

Core 12

Core 13

Core 9

Core 5

Core 1

L2

Cache

L2

Cache

L2

Cache

L2

Cache

Core 14

Core 2

Core 6

Core 10

Core 15

Core 11

Core 3

Core 7

λ0(1,0)+λ2(1,0)+λ3(2,0)

λ1(0,0)+λ2(0,0)+ λ3(0,0)

Tile 0

Tile 1

Tile 3

Tile 2


Propel s 64 wavelength design

PROPEL’s 64 Wavelength Design

Research has shown 64-wavelengths are possible to traverse down one waveguide.

Laser

Optical Inter-Title Communication Channels

X-Receiver

X-Receiver

X-Receiver

X-Receiver

X-Transmitter

X-Transmitter

X-Transmitter

X-Transmitter

λ(48-63)

λ(0-15)

λ(32-47)

λ(16-31)

Core 4

Core 12

Core 8

Core 0

Core 5

Core 1

Core 13

Core 9

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Transmitter

Y-Transmitter

Y-Transmitter

Y-Transmitter

Shared L2

Shared L2

Shared L2

Shared L2

Core 14

Core 6

Core 10

Core 2

Core 3

Core 15

Core 7

Core 11

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Receiver

Y-Receiver

Y-Receiver

Y-Receiver

Tile 2

Tile 3

Tile 1

Tile 0


Advisor avinash kodi

PROPEL’s x- and y-direction Implementation

Laser

Off-Chip

Bank 0

Bank 1

X-Receiver

X-Transmitter

Tile 0

Tile 1

Tile 2

Tile 3

Core 0

Core 1

L1 Cache

L1 Cache

Y-Transmitter

Tile 4

Tile 5

Tile 6

Tile 7

Bank 2

Shared L2

Core 2

Core 3

Tile 8

Tile 1

Tile 2

Tile 3

L1 Cache

L1 Cache

Y-Receiver

Bank 3

Tile 12

Tile 5

Tile 6

Tile 7

Bank 4-15

On-Chip

DRAM


Advisor avinash kodi

Memory Routing and Wavelength Assignment

Bank 0

Bank 3

Bank 1

Bank 2

. .

. .

. .

. .

. .

. .

. .

. .

Receiver

λ48-63

λ16-31

λ32-47

λ0-15

From CMP

To CMP

From Laser

Transmitter

λ0-15

λ16-31

λ32-47

λ48-63


Communication example

Communication Example

Route

Computation

(RC)

Virtual Channel

(VC)

Credits

In/Out

Switch Allocator

(SA)

Laser

Crossbar Switch

X0

Tile 0

Tile 1

Tile 2

Tile 3

Tile 4

Tile 5

Tile 6

Tile 7

X0

X-Transmitter

X-Receiver

X1

X1

Core 0

Core 1

Tile 8

Tile 1

Y-Transmitter

X2

L1 Cache

L1 Cache

X2

Shared L2

Y0

Y0

Tile 12

Tile 13

Core 2

Core 3

Y1

Y1

L1 Cache

L1 Cache

Y-Receiver

Y2

Y2

Tile 3 communicates with Tile 8.

L2 Cache


Modulation implementation

Modulation Implementation

λ0-15

λ16-31

λ32-47

. .

. .

. .

Broadband

Signal

. .

. .

. .

λ16

λ0

λ31

λ32

λ15

λ47

23


Multicasting broadcasting

Multicasting & Broadcasting

Tile 1

Tile 2

Tile 3

Tile 0

Tile 4

Tile 8

  • Multicasting: single tile to multiple tiles.

  • Broadcasting: single tile to all-tile communication.

    • Use 3 individual multicasts

Tile 12

Sending Tile

Communication Link

Tile 5

Tile 6

Tile 7

Tile 9

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15


Performance evaluation

Performance Evaluation

  • Cost & Component Comparison

  • Synthetic Traffic

    • OPTISM

    • Uniform, Bit-reversal, Butterfly, Complement,

      Matrix transpose, Perfect Shuffle

  • SPLASH-2

    • Simics with GEMS and Garnet

    • FFT, LU, Radiosity and Ocean

  • Networks topology evaluated

    • Electrical: Mesh, Cmesh and Flattened-butterfly

    • Optical: Circuit-switch, Shared-bus and Corona


Electronic parameters

Route

Computation

(RC)

Electronic Parameters

Credits

In/Out

Virtual Channel

(VC)

Switch Allocator

(SA)

Esw = wf × (Cxbi + Cxbo)V2DD

Crossbar (0.8 mW/flit)

Crossbar Switch

Pwrite = Pwordline + (2 × F × Pbitline) + (F × Pmemory-cell)

Pread= Pwordline + F × (Pbitliner + Pchg)

VC Buffer (4.03 mW/flit)

+X

+X

-X

-X

+Y

+Y

-Y

-Y

Processing Element (PE)

Plink = Pdynmanic + Pleakage+ Pshort¡ckt

Electrical Link (22 mW/mm)


Optical parameters

Optical Parameters

On-Chip

Optical

Layer

Off-Chip Laser

On-Chip

Modulator

Photodetector

Transmission Medium

Electronics

Layer

Receiver Circuitry (1.1 mW/Gbps)

Micro-ring Modulator (0.1 mW)

TIA

Limiting

Amplifier

Driver for

Electronics

Buffer Chain


Component comparison

Component Comparison

PROPEL is the most cost effective NoCs !!!!


Synthetic traffic trace

Synthetic Traffic Trace

  • Uniform traffic: Each packet's destination has an

  • equal probability to be all nodes.

  • Bit-Reversal:.

  • Source: an-1,an-2,...,a1,a0Destination: a0,a1 ,..., an-2,an-1

  • Butterfly:

  • Source: an-1,an-2,...,a1,a0Destination: a0,an-2,...,a1,an-1

  • Complement:

  • Source: an-1,an-2,...,a1,a0Destination: an-1’,an-2’,...,a1’,a0’

  • Matrix Transpose

  • Source: an-1,an-2,...,a1,a0Destination: an/2-1,...,a0,an-1,an-2

  • Perfect-shuffle:

  • Source: an-1,an-2,...,a1,a0Destination: an-2,an-3,...,a0,an-1


Uniform traffic throughput

Uniform Traffic Throughput

  • 25% Improvement

  • over Mesh

  • 9% Improvement

  • over Flattened-butterfly

  • Over 2× increase in

  • performance over

  • Circuit-switch, Cmesh

  • and Shared-bus


Advisor avinash kodi

Uniform Traffic Latency

  • PROPEL saturates at a

  • network load of 0.5

  • Saturates at a network

  • load of 0.1 higher than

  • than Flattened-butterfly

  • Saturates at a 2× higher

  • network load than

  • Shared-bus and

  • Circuit-switch.


Advisor avinash kodi

All Traffic Saturation Throughput


Advisor avinash kodi

Bit-Reversal Traffic Latency

  • PROPEL saturates at a

  • network load of 0.25

  • Saturates at a network

  • load of 0.25 higher than

  • than Flattened-butterfly

  • Saturates at a 1.5× higher

  • network load than

  • Shared-bus and

  • Circuit-switch.


Advisor avinash kodi

Complement Traffic Latency

  • Networks with core

  • concentrations create

  • communication hotspot.


Advisor avinash kodi

Matrix Transpose Traffic Latency

  • PROPEL saturates at a

  • network load of 0.3

  • Circuit-switch saturates

  • higher than the electrical

  • networks


Advisor avinash kodi

Synthetic Traffic Power Dissipation

5× Reduction In Power


Advisor avinash kodi

Simics Parameters

  • Simics is a full system simulator from Virtutech


Advisor avinash kodi

SPLASH-2 Benchmarks

  • FFT kernel is a 1-Dimensional version of the radix-n1/2 six step FFT algorithm.

  • LU kernel is used to factor a dense matrix into the upper and lower triangular matrices.

  • Radiosity is a graphics kernel used to calculate the equal distribution of light in a scene.

  • The Ocean application evaluates the boundary and eddy currents of large scale ocean movements.


Advisor avinash kodi

SPLASH-2 Speed-Up


Advisor avinash kodi

Conclusion

  • PROPEL is a low power high bandwidth NoC for future many-core processors.

  • PROPEL uses both electronic for packet switching and optics for inter-router communication, allowing for a reduction in electrical and optical components.

  • PROPEL uses the least number of optical components and consumes the least area, when compared to other opto-electronic networks.

  • PROPEL is able to outperform and dissipate less power when compared to well-known network topologies.


Advisor avinash kodi

QUESTION?


Future work

Future Work

  • Use optics to go to memory

  • Dynamic Bandwidth

  • Dynamic Voltage Scaling

  • Application Integration with the NoC


Examples of nocs 1 2

Examples of NoCs (1/2)

Core

Router

Core

Link

Router

Link

Torus

Mesh

  • Advantages

  • Reduced Hop Count

  • DOR routing

  • Disadvantages

  • Difficult to Integrate on-chip

  • Advantages

  • Simple to Integrate on-chip

  • DOR routing

  • Disadvantages

  • High hop count


Examples of nocs 2 2

Examples of NoCs (2/2)

Flattened-butterfly

Cmesh

  • Advantages

  • Max hop count of 2

  • Reduce power dissipation

  • Disadvantages

  • Not easily scalable

  • Advantages

  • Reduced Network Diameter

  • Fewer Routers

  • Disadvantages

  • Multiple cores share same ports


Advisor avinash kodi

PROPEL Multicasting Example

Laser

Multicast example: Tile 0 communicates the same data to Tile 1,2 & 3

X-Receiver

X-Receiver

X-Receiver

X-Receiver

X-Transmitter

X-Transmitter

X-Transmitter

X-Transmitter

Core 0

Core 12

Core 4

Core 8

Core 1

Core 9

Core 5

Core 13

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Transmitter

Y-Transmitter

Y-Transmitter

Y-Transmitter

Shared L2

Shared L2

Shared L2

Shared L2

Core 10

Core 6

Core 14

Core 2

Core 3

Core 15

Core 11

Core 7

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Receiver

Y-Receiver

Y-Receiver

Y-Receiver

Tile 2

Tile 3

Tile 1

Tile 0


Advisor avinash kodi

PROPEL’s Implementation (3/4)

Transmitters

Off-chip laser

λ0-15

λ16-31

λ32-47

λ48-63

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

From

Memory

λ16-31

λ0-15

λ32-47

λ48-63

λ0-15

λ32-47

λ16-31

λ16-31

λ48-63

λ0-15

λ32-47

λ48-63

λ0-15

To

Memory

λ32-47

λ16-31

λ48-63

Receivers

Tile 2

Tile 3

Tile 1

Tile 0


Propel s design 64 wavelengths assignment

PROPEL’s Design64-Wavelengths Assignment

  • Research has show 64-wavelengths are possible to traverse down one waveguide.

    • Wavelengths used for PROPEL are extended from 4 to 64.


Propel broadcasting

PROPEL Broadcasting

Tile 1

Tile 2

Tile 3

Tile 0

Tile 4

Tile 8

  • Single tile to all-tile communication.

    • Use 3 individual multicasts

Tile 12

Sending Tile

Communication Link

Tile 5

Tile 6

Tile 7

Tile 9

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15


Electrical link power dissipation

Electrical Link Power Dissipation

Optical Power Dissipation


  • Login