Advisor avinash kodi
Sponsored Links
This presentation is the property of its rightful owner.
1 / 49

Advisor: Avinash Kodi PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

PROPEL : Power & Area-Efficient, Scalable Opto -Electronic Network-on-Chips ( NoCs ) . Thesis Defense. Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: [email protected] Advisor: Avinash Kodi. Outline. Motivation & Background PROPEL: Architecture

Download Presentation

Advisor: Avinash Kodi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


PROPEL :

Power & Area-Efficient, Scalable Opto-Electronic Network-on-Chips (NoCs)

Thesis Defense

Randy W. Morris, Jr.

Affiliation: EECS, Ohio University

E-mail: [email protected]

Advisor: AvinashKodi


Outline

  • Motivation & Background

  • PROPEL: Architecture

  • PROPEL: Implementation

  • Performance Analysis

  • Conclusion


Why Chip Multi-Processor? (1/2)

After 2002 diminishing returns from single core designs!!

Courtesy: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative

Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.


Why Chip Multi-Processor? (2/2)

Courtesy: G. Konstadinidis and et. al., “Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor”

Examples: RAW, Core 2 Duo, Quad Core, Ultra Sparc


Wire Delay Problem

20mm

20mm

20mm

1

3

0

2

6

5

3

7

4

0

2

1

14

12

13

8

10

11

15

9

0

1

22

20

21

16

18

19

23

17

5

7

4

6

30

28

29

24

26

27

31

25

9

11

8

10

38

36

37

32

34

35

39

33

3

2

46

44

45

40

42

43

47

41

13

15

12

14

54

52

53

48

50

51

55

49

62

60

61

56

58

59

63

57

Past

FUTURE

Present

  • Wire delay proportional to wire’s RC constant

Resistance increases as Capacitance remains constant.


Network-on-Chip (NoC)

Router

Route

Computation

(RC)

Virtual Channel

(VC)

Core 3

Core 2

Core 1

Core 0

Crossbar Switch

Core

Credits

In/Out

Switch Allocator

(SA)

+X

+X

Router

Core 7

Core 6

Core 5

Core 4

Link

-X

-X

+Y

+Y

Core 11

Core 10

Core 9

Core 8

-Y

-Y

Core 15

Core 14

Core 13

Core 12

Processing Core


Power Dissipation

Intel Tera-Flops (65 nm)

Tile Power

Routing Power

Courtesy: Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61

  • 28% of a tile’s overall power is for the router and links

  • Link power will become a more major contribution of a router’s

  • overall power for future VLSI technology

  • Router and link power should be about 10-15% of the tile’s power budget

Potential Solutions: Optics, RF and 3D stacking


Why use Optics?

  • Lower latency

  • Higher bandwidth (WDM, SDM & TDM)

  • Increased bandwidth density(compact parallel optics)

  • Low power (1.1 mW/Gb)

  • Bit-rate independent of distance

  • Lower cross-talk

  • Does not suffer for impedance mismatch

  • and signal reflection

  • Low signal attenuation


Electrical Interconnect

R =wire resistant per length

C =wire capacitance per length

Cp=inverter output capacitance

C0=inverter input capacitance

Rs= inverter resistance

Sopt=inverter size

Lopt = Wire distance

rs

R, C

Cp

C0

lopt

RC Link:

sopt


ITRS 2007 Transistor & Link Parameters?

Electrical link device parameters for various VLSI technologies

  • Increase wire delay due to RC constant

  • Increase in Ioffn & Ishortckt current parameters


Optical Interconnect

On-Chip

Optical

Layer

Off-Chip Laser

On-Chip

Modulator

Photodetector

Transmission Medium

- Transmitter

Electronics

Layer

Buffer Chain

TIA

Limiting

Amplifier

Driver for

Electronics


Resonant wavelength (λ0)

λ0 m= neff 2R

m  an integer

VR

neff effective refractive index

R  radius of the ring resonator

VR

n+

p+

n+

Input Port 0

Output Port 0

Micro-ring Resonators

=VOFF

n+

p+

n+

Input Port 0

Output Port 0

VR

=VOFF

=VON

Output Port 1

n+

p+

n+

  • CMOS compatible

  • Low power (0.1 mW)

  • Small footprint (10 um)

  • High Bandwidth (10 Gb)

Output Port 0

Input Port 0


Waveguide & Receiver

[1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”,

39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9 , Iss. 13 Dec. 2006 pg.492 – 50

[2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication

Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007

[3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical

Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005

[4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing

station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004


Electrical/Optical Comparison

Power-delay product at various technology nodes for a 5 mm link.

Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects


Critical Length

Critical Length is the distance where optical becomes more advantageous

core-to-core distance


Advantages of PROPEL

  • Efficient use of optical components

  • Balance between optics and electronics

  • Simple network design – Low diameter, DOR

  • Scalability

  • Fault Tolerant


PROPEL’s Design

0, 1, 2, …

Broadband Light source

Tile 0

0

1

4

5

8

10

12

14

L2

L2

L2

L2

2

6

7

9

11

13

15

3

Photonic

Transceiver

L2

L2

L2

28

30

L2

16

17

20

22

24

26

Optical

Interconnect

tile

Core

Core 0

Core 1

L2

Cache

27

29

31

18

19

21

23

25

Photonic

Transceiver

40

42

44

45

32

33

36

38

L2

L2

L2

Core2

Core3

L2

41

43

46

47

34

35

37

39

L2

L2

L2

56

57

60

61

48

49

52

53

L2

58

62

63

59

50

51

54

55


PROPEL’s Routing & Wavelength Assignment (x-direction)

Broadband Signal

λ1(0,0)

λ3(0,0)

λ2(0,0)

Home Channel 0

Home Channel 1

λ2(2,0)

λ3(2,0)

λ0(1,0)

Home Channel 2

Home Channel 3

Core 0

Core 8

Core 4

Core 12

Core 13

Core 9

Core 5

Core 1

L2

Cache

L2

Cache

L2

Cache

L2

Cache

Core 14

Core 2

Core 6

Core 10

Core 15

Core 11

Core 3

Core 7

λ0(1,0)+λ2(1,0)+λ3(2,0)

λ1(0,0)+λ2(0,0)+ λ3(0,0)

Tile 0

Tile 1

Tile 3

Tile 2


PROPEL’s 64 Wavelength Design

Research has shown 64-wavelengths are possible to traverse down one waveguide.

Laser

Optical Inter-Title Communication Channels

X-Receiver

X-Receiver

X-Receiver

X-Receiver

X-Transmitter

X-Transmitter

X-Transmitter

X-Transmitter

λ(48-63)

λ(0-15)

λ(32-47)

λ(16-31)

Core 4

Core 12

Core 8

Core 0

Core 5

Core 1

Core 13

Core 9

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Transmitter

Y-Transmitter

Y-Transmitter

Y-Transmitter

Shared L2

Shared L2

Shared L2

Shared L2

Core 14

Core 6

Core 10

Core 2

Core 3

Core 15

Core 7

Core 11

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Receiver

Y-Receiver

Y-Receiver

Y-Receiver

Tile 2

Tile 3

Tile 1

Tile 0


PROPEL’s x- and y-direction Implementation

Laser

Off-Chip

Bank 0

Bank 1

X-Receiver

X-Transmitter

Tile 0

Tile 1

Tile 2

Tile 3

Core 0

Core 1

L1 Cache

L1 Cache

Y-Transmitter

Tile 4

Tile 5

Tile 6

Tile 7

Bank 2

Shared L2

Core 2

Core 3

Tile 8

Tile 1

Tile 2

Tile 3

L1 Cache

L1 Cache

Y-Receiver

Bank 3

Tile 12

Tile 5

Tile 6

Tile 7

Bank 4-15

On-Chip

DRAM


Memory Routing and Wavelength Assignment

Bank 0

Bank 3

Bank 1

Bank 2

. .

. .

. .

. .

. .

. .

. .

. .

Receiver

λ48-63

λ16-31

λ32-47

λ0-15

From CMP

To CMP

From Laser

Transmitter

λ0-15

λ16-31

λ32-47

λ48-63


Communication Example

Route

Computation

(RC)

Virtual Channel

(VC)

Credits

In/Out

Switch Allocator

(SA)

Laser

Crossbar Switch

X0

Tile 0

Tile 1

Tile 2

Tile 3

Tile 4

Tile 5

Tile 6

Tile 7

X0

X-Transmitter

X-Receiver

X1

X1

Core 0

Core 1

Tile 8

Tile 1

Y-Transmitter

X2

L1 Cache

L1 Cache

X2

Shared L2

Y0

Y0

Tile 12

Tile 13

Core 2

Core 3

Y1

Y1

L1 Cache

L1 Cache

Y-Receiver

Y2

Y2

Tile 3 communicates with Tile 8.

L2 Cache


Modulation Implementation

λ0-15

λ16-31

λ32-47

. .

. .

. .

Broadband

Signal

. .

. .

. .

λ16

λ0

λ31

λ32

λ15

λ47

23


Multicasting & Broadcasting

Tile 1

Tile 2

Tile 3

Tile 0

Tile 4

Tile 8

  • Multicasting: single tile to multiple tiles.

  • Broadcasting: single tile to all-tile communication.

    • Use 3 individual multicasts

Tile 12

Sending Tile

Communication Link

Tile 5

Tile 6

Tile 7

Tile 9

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15


Performance Evaluation

  • Cost & Component Comparison

  • Synthetic Traffic

    • OPTISM

    • Uniform, Bit-reversal, Butterfly, Complement,

      Matrix transpose, Perfect Shuffle

  • SPLASH-2

    • Simics with GEMS and Garnet

    • FFT, LU, Radiosity and Ocean

  • Networks topology evaluated

    • Electrical: Mesh, Cmesh and Flattened-butterfly

    • Optical: Circuit-switch, Shared-bus and Corona


Route

Computation

(RC)

Electronic Parameters

Credits

In/Out

Virtual Channel

(VC)

Switch Allocator

(SA)

Esw = wf × (Cxbi + Cxbo)V2DD

Crossbar (0.8 mW/flit)

Crossbar Switch

Pwrite = Pwordline + (2 × F × Pbitline) + (F × Pmemory-cell)

Pread= Pwordline + F × (Pbitliner + Pchg)

VC Buffer (4.03 mW/flit)

+X

+X

-X

-X

+Y

+Y

-Y

-Y

Processing Element (PE)

Plink = Pdynmanic + Pleakage+ Pshort¡ckt

Electrical Link (22 mW/mm)


Optical Parameters

On-Chip

Optical

Layer

Off-Chip Laser

On-Chip

Modulator

Photodetector

Transmission Medium

Electronics

Layer

Receiver Circuitry (1.1 mW/Gbps)

Micro-ring Modulator (0.1 mW)

TIA

Limiting

Amplifier

Driver for

Electronics

Buffer Chain


Component Comparison

PROPEL is the most cost effective NoCs !!!!


Synthetic Traffic Trace

  • Uniform traffic: Each packet's destination has an

  • equal probability to be all nodes.

  • Bit-Reversal:.

  • Source: an-1,an-2,...,a1,a0Destination: a0,a1 ,..., an-2,an-1

  • Butterfly:

  • Source: an-1,an-2,...,a1,a0Destination: a0,an-2,...,a1,an-1

  • Complement:

  • Source: an-1,an-2,...,a1,a0Destination: an-1’,an-2’,...,a1’,a0’

  • Matrix Transpose

  • Source: an-1,an-2,...,a1,a0Destination: an/2-1,...,a0,an-1,an-2

  • Perfect-shuffle:

  • Source: an-1,an-2,...,a1,a0Destination: an-2,an-3,...,a0,an-1


Uniform Traffic Throughput

  • 25% Improvement

  • over Mesh

  • 9% Improvement

  • over Flattened-butterfly

  • Over 2× increase in

  • performance over

  • Circuit-switch, Cmesh

  • and Shared-bus


Uniform Traffic Latency

  • PROPEL saturates at a

  • network load of 0.5

  • Saturates at a network

  • load of 0.1 higher than

  • than Flattened-butterfly

  • Saturates at a 2× higher

  • network load than

  • Shared-bus and

  • Circuit-switch.


All Traffic Saturation Throughput


Bit-Reversal Traffic Latency

  • PROPEL saturates at a

  • network load of 0.25

  • Saturates at a network

  • load of 0.25 higher than

  • than Flattened-butterfly

  • Saturates at a 1.5× higher

  • network load than

  • Shared-bus and

  • Circuit-switch.


Complement Traffic Latency

  • Networks with core

  • concentrations create

  • communication hotspot.


Matrix Transpose Traffic Latency

  • PROPEL saturates at a

  • network load of 0.3

  • Circuit-switch saturates

  • higher than the electrical

  • networks


Synthetic Traffic Power Dissipation

5× Reduction In Power


Simics Parameters

  • Simics is a full system simulator from Virtutech


SPLASH-2 Benchmarks

  • FFT kernel is a 1-Dimensional version of the radix-n1/2 six step FFT algorithm.

  • LU kernel is used to factor a dense matrix into the upper and lower triangular matrices.

  • Radiosity is a graphics kernel used to calculate the equal distribution of light in a scene.

  • The Ocean application evaluates the boundary and eddy currents of large scale ocean movements.


SPLASH-2 Speed-Up


Conclusion

  • PROPEL is a low power high bandwidth NoC for future many-core processors.

  • PROPEL uses both electronic for packet switching and optics for inter-router communication, allowing for a reduction in electrical and optical components.

  • PROPEL uses the least number of optical components and consumes the least area, when compared to other opto-electronic networks.

  • PROPEL is able to outperform and dissipate less power when compared to well-known network topologies.


QUESTION?


Future Work

  • Use optics to go to memory

  • Dynamic Bandwidth

  • Dynamic Voltage Scaling

  • Application Integration with the NoC


Examples of NoCs (1/2)

Core

Router

Core

Link

Router

Link

Torus

Mesh

  • Advantages

  • Reduced Hop Count

  • DOR routing

  • Disadvantages

  • Difficult to Integrate on-chip

  • Advantages

  • Simple to Integrate on-chip

  • DOR routing

  • Disadvantages

  • High hop count


Examples of NoCs (2/2)

Flattened-butterfly

Cmesh

  • Advantages

  • Max hop count of 2

  • Reduce power dissipation

  • Disadvantages

  • Not easily scalable

  • Advantages

  • Reduced Network Diameter

  • Fewer Routers

  • Disadvantages

  • Multiple cores share same ports


PROPEL Multicasting Example

Laser

Multicast example: Tile 0 communicates the same data to Tile 1,2 & 3

X-Receiver

X-Receiver

X-Receiver

X-Receiver

X-Transmitter

X-Transmitter

X-Transmitter

X-Transmitter

Core 0

Core 12

Core 4

Core 8

Core 1

Core 9

Core 5

Core 13

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Transmitter

Y-Transmitter

Y-Transmitter

Y-Transmitter

Shared L2

Shared L2

Shared L2

Shared L2

Core 10

Core 6

Core 14

Core 2

Core 3

Core 15

Core 11

Core 7

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

L1 Cache

Y-Receiver

Y-Receiver

Y-Receiver

Y-Receiver

Tile 2

Tile 3

Tile 1

Tile 0


PROPEL’s Implementation (3/4)

Transmitters

Off-chip laser

λ0-15

λ16-31

λ32-47

λ48-63

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

From

Memory

λ16-31

λ0-15

λ32-47

λ48-63

λ0-15

λ32-47

λ16-31

λ16-31

λ48-63

λ0-15

λ32-47

λ48-63

λ0-15

To

Memory

λ32-47

λ16-31

λ48-63

Receivers

Tile 2

Tile 3

Tile 1

Tile 0


PROPEL’s Design64-Wavelengths Assignment

  • Research has show 64-wavelengths are possible to traverse down one waveguide.

    • Wavelengths used for PROPEL are extended from 4 to 64.


PROPEL Broadcasting

Tile 1

Tile 2

Tile 3

Tile 0

Tile 4

Tile 8

  • Single tile to all-tile communication.

    • Use 3 individual multicasts

Tile 12

Sending Tile

Communication Link

Tile 5

Tile 6

Tile 7

Tile 9

Tile 10

Tile 11

Tile 13

Tile 14

Tile 15


Electrical Link Power Dissipation

Optical Power Dissipation


  • Login