1 / 49

Advisor: Avinash Kodi

PROPEL : Power & Area-Efficient, Scalable Opto -Electronic Network-on-Chips ( NoCs ) . Thesis Defense. Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: rm700603@ohio.edu. Advisor: Avinash Kodi. Outline . Motivation & Background PROPEL: Architecture

fadhila
Download Presentation

Advisor: Avinash Kodi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PROPEL : Power & Area-Efficient, Scalable Opto-Electronic Network-on-Chips (NoCs) Thesis Defense Randy W. Morris, Jr. Affiliation: EECS, Ohio University E-mail: rm700603@ohio.edu Advisor: AvinashKodi

  2. Outline • Motivation & Background • PROPEL: Architecture • PROPEL: Implementation • Performance Analysis • Conclusion

  3. Why Chip Multi-Processor? (1/2) After 2002 diminishing returns from single core designs!! Courtesy: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.

  4. Why Chip Multi-Processor? (2/2) Courtesy: G. Konstadinidis and et. al., “Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor” Examples: RAW, Core 2 Duo, Quad Core, Ultra Sparc

  5. Wire Delay Problem 20mm 20mm 20mm 1 3 0 2 6 5 3 7 4 0 2 1 14 12 13 8 10 11 15 9 0 1 22 20 21 16 18 19 23 17 5 7 4 6 30 28 29 24 26 27 31 25 9 11 8 10 38 36 37 32 34 35 39 33 3 2 46 44 45 40 42 43 47 41 13 15 12 14 54 52 53 48 50 51 55 49 62 60 61 56 58 59 63 57 Past FUTURE Present • Wire delay proportional to wire’s RC constant Resistance increases as Capacitance remains constant.

  6. Network-on-Chip (NoC) Router Route Computation (RC) Virtual Channel (VC) Core 3 Core 2 Core 1 Core 0 Crossbar Switch Core Credits In/Out Switch Allocator (SA) +X +X Router Core 7 Core 6 Core 5 Core 4 Link -X -X +Y +Y Core 11 Core 10 Core 9 Core 8 -Y -Y Core 15 Core 14 Core 13 Core 12 Processing Core

  7. Power Dissipation Intel Tera-Flops (65 nm) Tile Power Routing Power Courtesy: Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61 • 28% of a tile’s overall power is for the router and links • Link power will become a more major contribution of a router’s • overall power for future VLSI technology • Router and link power should be about 10-15% of the tile’s power budget Potential Solutions: Optics, RF and 3D stacking

  8. Why use Optics? • Lower latency • Higher bandwidth (WDM, SDM & TDM) • Increased bandwidth density(compact parallel optics) • Low power (1.1 mW/Gb) • Bit-rate independent of distance • Lower cross-talk • Does not suffer for impedance mismatch • and signal reflection • Low signal attenuation

  9. Electrical Interconnect R =wire resistant per length C =wire capacitance per length Cp=inverter output capacitance C0=inverter input capacitance Rs= inverter resistance Sopt=inverter size Lopt = Wire distance rs R, C Cp C0 lopt RC Link: sopt

  10. ITRS 2007 Transistor & Link Parameters? Electrical link device parameters for various VLSI technologies • Increase wire delay due to RC constant • Increase in Ioffn & Ishortckt current parameters

  11. Optical Interconnect On-Chip Optical Layer Off-Chip Laser On-Chip Modulator Photodetector Transmission Medium - Transmitter Electronics Layer Buffer Chain TIA Limiting Amplifier Driver for Electronics

  12. Resonant wavelength (λ0) λ0 m= neff 2R m  an integer VR neff effective refractive index R  radius of the ring resonator VR n+ p+ n+ Input Port 0 Output Port 0 Micro-ring Resonators =VOFF n+ p+ n+ Input Port 0 Output Port 0 VR =VOFF =VON Output Port 1 n+ p+ n+ • CMOS compatible • Low power (0.1 mW) • Small footprint (10 um) • High Bandwidth (10 Gb) Output Port 0 Input Port 0

  13. Waveguide & Receiver [1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”, 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9 , Iss. 13 Dec. 2006 pg.492 – 50 [2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007 [3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005 [4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004

  14. Electrical/Optical Comparison Power-delay product at various technology nodes for a 5 mm link. Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects

  15. Critical Length Critical Length is the distance where optical becomes more advantageous core-to-core distance

  16. Advantages of PROPEL • Efficient use of optical components • Balance between optics and electronics • Simple network design – Low diameter, DOR • Scalability • Fault Tolerant

  17. PROPEL’s Design 0, 1, 2, … Broadband Light source Tile 0 0 1 4 5 8 10 12 14 L2 L2 L2 L2 2 6 7 9 11 13 15 3 Photonic Transceiver L2 L2 L2 28 30 L2 16 17 20 22 24 26 Optical Interconnect tile Core Core 0 Core 1 L2 Cache 27 29 31 18 19 21 23 25 Photonic Transceiver 40 42 44 45 32 33 36 38 L2 L2 L2 Core2 Core3 L2 41 43 46 47 34 35 37 39 L2 L2 L2 56 57 60 61 48 49 52 53 L2 58 62 63 59 50 51 54 55

  18. PROPEL’s Routing & Wavelength Assignment (x-direction) Broadband Signal λ1(0,0) λ3(0,0) λ2(0,0) Home Channel 0 Home Channel 1 λ2(2,0) λ3(2,0) λ0(1,0) Home Channel 2 Home Channel 3 Core 0 Core 8 Core 4 Core 12 Core 13 Core 9 Core 5 Core 1 L2 Cache L2 Cache L2 Cache L2 Cache Core 14 Core 2 Core 6 Core 10 Core 15 Core 11 Core 3 Core 7 λ0(1,0)+λ2(1,0)+λ3(2,0) λ1(0,0)+λ2(0,0)+ λ3(0,0) Tile 0 Tile 1 Tile 3 Tile 2

  19. PROPEL’s 64 Wavelength Design Research has shown 64-wavelengths are possible to traverse down one waveguide. Laser Optical Inter-Title Communication Channels X-Receiver X-Receiver X-Receiver X-Receiver X-Transmitter X-Transmitter X-Transmitter X-Transmitter λ(48-63) λ(0-15) λ(32-47) λ(16-31) Core 4 Core 12 Core 8 Core 0 Core 5 Core 1 Core 13 Core 9 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Transmitter Y-Transmitter Y-Transmitter Y-Transmitter Shared L2 Shared L2 Shared L2 Shared L2 Core 14 Core 6 Core 10 Core 2 Core 3 Core 15 Core 7 Core 11 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Receiver Y-Receiver Y-Receiver Y-Receiver Tile 2 Tile 3 Tile 1 Tile 0

  20. PROPEL’s x- and y-direction Implementation Laser Off-Chip Bank 0 Bank 1 X-Receiver X-Transmitter Tile 0 Tile 1 Tile 2 Tile 3 Core 0 Core 1 L1 Cache L1 Cache Y-Transmitter Tile 4 Tile 5 Tile 6 Tile 7 Bank 2 Shared L2 Core 2 Core 3 Tile 8 Tile 1 Tile 2 Tile 3 L1 Cache L1 Cache Y-Receiver Bank 3 Tile 12 Tile 5 Tile 6 Tile 7 Bank 4-15 On-Chip DRAM

  21. Memory Routing and Wavelength Assignment Bank 0 Bank 3 Bank 1 Bank 2 . . . . . . . . . . . . . . . . Receiver λ48-63 λ16-31 λ32-47 λ0-15 From CMP To CMP From Laser Transmitter λ0-15 λ16-31 λ32-47 λ48-63

  22. Communication Example Route Computation (RC) Virtual Channel (VC) Credits In/Out Switch Allocator (SA) Laser Crossbar Switch X0 Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 X0 X-Transmitter X-Receiver X1 X1 Core 0 Core 1 Tile 8 Tile 1 Y-Transmitter X2 L1 Cache L1 Cache X2 Shared L2 Y0 Y0 Tile 12 Tile 13 Core 2 Core 3 Y1 Y1 L1 Cache L1 Cache Y-Receiver Y2 Y2 Tile 3 communicates with Tile 8. L2 Cache

  23. Modulation Implementation λ0-15 λ16-31 λ32-47 . . . . . . Broadband Signal . . . . . . λ16 λ0 λ31 λ32 λ15 λ47 23

  24. Multicasting & Broadcasting Tile 1 Tile 2 Tile 3 Tile 0 Tile 4 Tile 8 • Multicasting: single tile to multiple tiles. • Broadcasting: single tile to all-tile communication. • Use 3 individual multicasts Tile 12 Sending Tile Communication Link Tile 5 Tile 6 Tile 7 Tile 9 Tile 10 Tile 11 Tile 13 Tile 14 Tile 15

  25. Performance Evaluation • Cost & Component Comparison • Synthetic Traffic • OPTISM • Uniform, Bit-reversal, Butterfly, Complement, Matrix transpose, Perfect Shuffle • SPLASH-2 • Simics with GEMS and Garnet • FFT, LU, Radiosity and Ocean • Networks topology evaluated • Electrical: Mesh, Cmesh and Flattened-butterfly • Optical: Circuit-switch, Shared-bus and Corona

  26. Route Computation (RC) Electronic Parameters Credits In/Out Virtual Channel (VC) Switch Allocator (SA) Esw = wf × (Cxbi + Cxbo)V2DD Crossbar (0.8 mW/flit) Crossbar Switch Pwrite = Pwordline + (2 × F × Pbitline) + (F × Pmemory-cell) Pread= Pwordline + F × (Pbitliner + Pchg) VC Buffer (4.03 mW/flit) +X +X -X -X +Y +Y -Y -Y Processing Element (PE) Plink = Pdynmanic + Pleakage+ Pshort¡ckt Electrical Link (22 mW/mm)

  27. Optical Parameters On-Chip Optical Layer Off-Chip Laser On-Chip Modulator Photodetector Transmission Medium Electronics Layer Receiver Circuitry (1.1 mW/Gbps) Micro-ring Modulator (0.1 mW) TIA Limiting Amplifier Driver for Electronics Buffer Chain

  28. Component Comparison PROPEL is the most cost effective NoCs !!!!

  29. Synthetic Traffic Trace • Uniform traffic: Each packet's destination has an • equal probability to be all nodes. • Bit-Reversal:. • Source: an-1,an-2,...,a1,a0Destination: a0,a1 ,..., an-2,an-1 • Butterfly: • Source: an-1,an-2,...,a1,a0Destination: a0,an-2,...,a1,an-1 • Complement: • Source: an-1,an-2,...,a1,a0Destination: an-1’,an-2’,...,a1’,a0’ • Matrix Transpose • Source: an-1,an-2,...,a1,a0Destination: an/2-1,...,a0,an-1,an-2 • Perfect-shuffle: • Source: an-1,an-2,...,a1,a0Destination: an-2,an-3,...,a0,an-1

  30. Uniform Traffic Throughput • 25% Improvement • over Mesh • 9% Improvement • over Flattened-butterfly • Over 2× increase in • performance over • Circuit-switch, Cmesh • and Shared-bus

  31. Uniform Traffic Latency • PROPEL saturates at a • network load of 0.5 • Saturates at a network • load of 0.1 higher than • than Flattened-butterfly • Saturates at a 2× higher • network load than • Shared-bus and • Circuit-switch.

  32. All Traffic Saturation Throughput

  33. Bit-Reversal Traffic Latency • PROPEL saturates at a • network load of 0.25 • Saturates at a network • load of 0.25 higher than • than Flattened-butterfly • Saturates at a 1.5× higher • network load than • Shared-bus and • Circuit-switch.

  34. Complement Traffic Latency • Networks with core • concentrations create • communication hotspot.

  35. Matrix Transpose Traffic Latency • PROPEL saturates at a • network load of 0.3 • Circuit-switch saturates • higher than the electrical • networks

  36. Synthetic Traffic Power Dissipation 5× Reduction In Power

  37. Simics Parameters • Simics is a full system simulator from Virtutech

  38. SPLASH-2 Benchmarks • FFT kernel is a 1-Dimensional version of the radix-n1/2 six step FFT algorithm. • LU kernel is used to factor a dense matrix into the upper and lower triangular matrices. • Radiosity is a graphics kernel used to calculate the equal distribution of light in a scene. • The Ocean application evaluates the boundary and eddy currents of large scale ocean movements.

  39. SPLASH-2 Speed-Up

  40. Conclusion • PROPEL is a low power high bandwidth NoC for future many-core processors. • PROPEL uses both electronic for packet switching and optics for inter-router communication, allowing for a reduction in electrical and optical components. • PROPEL uses the least number of optical components and consumes the least area, when compared to other opto-electronic networks. • PROPEL is able to outperform and dissipate less power when compared to well-known network topologies.

  41. QUESTION?

  42. Future Work • Use optics to go to memory • Dynamic Bandwidth • Dynamic Voltage Scaling • Application Integration with the NoC

  43. Examples of NoCs (1/2) Core Router Core Link Router Link Torus Mesh • Advantages • Reduced Hop Count • DOR routing • Disadvantages • Difficult to Integrate on-chip • Advantages • Simple to Integrate on-chip • DOR routing • Disadvantages • High hop count

  44. Examples of NoCs (2/2) Flattened-butterfly Cmesh • Advantages • Max hop count of 2 • Reduce power dissipation • Disadvantages • Not easily scalable • Advantages • Reduced Network Diameter • Fewer Routers • Disadvantages • Multiple cores share same ports

  45. PROPEL Multicasting Example Laser Multicast example: Tile 0 communicates the same data to Tile 1,2 & 3 X-Receiver X-Receiver X-Receiver X-Receiver X-Transmitter X-Transmitter X-Transmitter X-Transmitter Core 0 Core 12 Core 4 Core 8 Core 1 Core 9 Core 5 Core 13 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Transmitter Y-Transmitter Y-Transmitter Y-Transmitter Shared L2 Shared L2 Shared L2 Shared L2 Core 10 Core 6 Core 14 Core 2 Core 3 Core 15 Core 11 Core 7 L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache L1 Cache Y-Receiver Y-Receiver Y-Receiver Y-Receiver Tile 2 Tile 3 Tile 1 Tile 0

  46. PROPEL’s Implementation (3/4) Transmitters Off-chip laser λ0-15 λ16-31 λ32-47 λ48-63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . From Memory λ16-31 λ0-15 λ32-47 λ48-63 λ0-15 λ32-47 λ16-31 λ16-31 λ48-63 λ0-15 λ32-47 λ48-63 λ0-15 To Memory λ32-47 λ16-31 λ48-63 Receivers Tile 2 Tile 3 Tile 1 Tile 0

  47. PROPEL’s Design64-Wavelengths Assignment • Research has show 64-wavelengths are possible to traverse down one waveguide. • Wavelengths used for PROPEL are extended from 4 to 64.

  48. PROPEL Broadcasting Tile 1 Tile 2 Tile 3 Tile 0 Tile 4 Tile 8 • Single tile to all-tile communication. • Use 3 individual multicasts Tile 12 Sending Tile Communication Link Tile 5 Tile 6 Tile 7 Tile 9 Tile 10 Tile 11 Tile 13 Tile 14 Tile 15

  49. Electrical Link Power Dissipation Optical Power Dissipation

More Related