ATAC: Multicore Processor with On-Chip Optical Network

ATAC: Multicore Processor with On-Chip Optical Network George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, Anant Agarwal Massachusetts Institute of Technology Cambridge, MA 02139 Presenter : Yeluo Chen

Background • Number of transistor on a chip will double every two years • Multicore processor will have 1000 core or more within the next decade • Overcome Challenges In order to Improve Performance • ATAC address the Chanllenges

“Moore’s Gap” Tiled Multicore The GOPS Gap Multicore SMT, FGMT, CGMT OOO Superscalar Pipelining Performance (GOPS) 1000 Transistors 100 10 1 • Diminishing returns from single CPU mechanisms (pipelining, caching, etc.) • Wire delays • Power envelopes 0.1 0.01 time 1992 1998 2002 2006 2010

Multicore Scaling Trends Today A few large cores on each chip Diminishing returns prevent cores from getting more complex Only option for future scaling is to add more cores Still some shared global structures: bus, L2 caches m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Tomorrow 100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007] Simple cores are more power and area efficient Global structures do not scale; all resources must be distributed p p c c BUS L2 Cache

The Future of Multicore Number of cores doubles every 18 months Parallelism replaces clock frequency scaling and core complexity Resulting Challenges… Scalability Programming Power IBM XCell 8i Tilera TILE64 MIT RAW Sun Ultrasparc T2

Multicore Challenges Scalability How do we turn additional cores into additional performance? Must accelerate single apps, not just run more apps in parallel Efficient core-to-core communication is crucial Architectures that grow easily with each new technology generation Programming Traditional parallel programming techniques are hard Parallel machines were rare and used only by rocket scientists Multicores are ubiquitous and must be programmable by anyone Power Already a first-order design constraint More cores and more communication  more power Previous tricks (e.g. lower Vdd) are running out of steam

Multicore Communication Today Single shared resource Uniform communication cost Communication through memory Doesn’t scale to many cores due to contention and long wires Scalable up to about 8 cores p p c c BUS L2 Cache Bus-based Interconnect DRAM

Multicore Communication Tomorrow m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Point-to-Point Mesh Network Examples: MIT Raw, Tilera TILEPro64, Intel Terascale Prototype Neighboring tiles are connected Distributed communication resources Non-uniform costs: Latency depends on distance Encourages direct communication More energy efficient than bus Scalable to hundreds of cores DRAM DRAM DRAM DRAM

ATAC Advantages ATAC Processor Architecture: On-chip Optical Communication Wavelength Division Multiplexing (WDM) Simultaneously carry multiple independent signal on different wavelength ie. 64 WDM = 64 bit Electrical Bus Eliminate Communication Contention Low Loss Less Power Not require Periodic Repeater Eliminate multiple hops between cores at large scale

ATAC Optical Building Blocks • Light Source (Optical PWR Supply) • Generated by Off-Chip Laser (PWR ~1.5W) • Coupled into an on chip waveguide • Waveguide • On-chip channels for light transmission (Made of Si) • Manufactured with CMOS process • Optical Filter (Ring Resonator) & Modulator • Couple specific wavelength from PWR supply waveguide to data waveguide • Translates an electrical signal into an optical signal • Place signal into Waveguide • Photodetector • Absorbs photons and outputs electrical signal

Optical bit transmission • Putting it together • Optical data transmission from one core to another

ATAC Architecture m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Electrical Mesh Interconnect Consist of Optical Waveguides Optical Broadcast WDM Interconnect

The 1000 Cores ATAC Purely optical design scales to about 64 cores Global Optical Interconnect: ANet After that, clusters of cores share optical hubs ENet and BNet move data to/from optical hub Dedicated, special-purpose electrical networks P r o c $ D i r $ memory BNet ONet HUB ENet NET memory Electrical Networks Connect 16 Cores to Optical Hub 64 Optically-Connected Clusters

Optical Broadcast Network Waveguide passes through every core Multiple wavelengths (WDM) eliminates contention Signal reaches all cores in <3ns Low Latency Same signal can be received by all cores Data is sent once, any or all cores can receive it = efficient broadcast Each core sends data using a different wavelength no contention optical waveguide

Optical Broadcast Network • Electronic-photonic integration using standard CMOS process • Cores communicate via optical WDM broadcast and select network • Each core sends on its own dedicated wavelength using modulators • Cores can receive from some set of senders using optical filters N cores

Core-to-core communication • 32-bit data words transmitted across several parallel waveguides • Each core contains receive filters and a FIFO buffer for every sender • Data is buffered at receiver until needed by the processing core • Receiver can screen data by sender (i.e. wavelength) or message type ONet 32 Bundle of waveguides 32 FIFO FIFO FIFO FIFO FIFO FIFO 32 32 Processor Core Processor Core Processor Core receiving core sending core A sending core B

ATAC Bandwidth 64 cores, 32 lines, 1 Gb/s Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s Receive-Weighted BW: 2 Tb/s * 63 receivers= 126 Tb/s Good metric for broadcast networks – reflects WDM ATAC allows better utilization of computational resources because less time is spent performing communication

ATAC Efficiency Cores can directly communicate with anyother core in one hop (<3ns) Broadcasts require just one send No complicated routing on network required Cheap broadcast enables frequent global communications Energy required ANet – 300fJ/bit Electrical Signal – 94fJ/bit/mm 1000 Cores 1mm/hop Electrical signal if destination less than four hops Optical signal for Broadcast and long unicast message

ATAC Performance Simulation &Evaluation • Parsec and Splash2 Benchmark • Nine Applications from Splash2 • Three Applications from Parsec • ANet vs. Emesh • Cache Coherence Protocol • ACKwise • DirB • DirNB • Performance Evaluation and Comparison of ANet and Emesh combinations • Each type of network couple with a coherence Protocol • Six combination evaluated • (a) ANet-ACKwisek, (b) ANet-DirkB, (c) ANet-DirkNB, (d) EMesh-ACKwisek, (e) EMesh- DirkB and (f) EMesh-DirkNB

ATAC Performance Simulation &Evaluation 64 Cores simulation: ANet^64 compared to 64 bit wide E-mesh network 1024 Cores simulation: ANet^1024 compared to 256 bit wide E-mesh network

ATAC Performance Simulation &Evaluation DirB ACKwise

ATAC Performance Simulation &Evaluation

System Capabilities and Performance Baseline: Raw Multicore Chip • Leading-edge tiled multicore 64-core system (65nm process) • Peak performance: 64 GOPS • Chip power: 24 W • Theoretical power eff.: 2.7 GOPS/W • Effective performance: 7.3 GOPS • Effective power eff: 0.3 GOPS/W • Total system power: 150 W ATAC Multicore Chip • Future optical interconnect multicore 64-core system (65nm process) • Peak performance: 64 GOPS • Chip power: 25.5 W • Theoretical power eff.: 2.5 GOPS/W • Effective performance: 38.0 GOPS • Effective power eff.: 1.5 GOPS/W • Total system power: 153 W Optical communications require a small amount of additional system power but allow for much better utilization of computational resources.

Communication-centric Computing 500pJ 3pJ 500pJ 500pJ 500pJ 3pJ 3pJ p p 3pJ c c BUS L2 Cache • View of extended global memory can be enabled cheaply with on-chip distributed cache memory and ATAC network • ATAC reduces off-chip memory calls, and hence energy and latency memory Bus-Based Multicore ATAC

What Does the Future Look Like? Corollary of Moore’s law: Number of cores will double every 18 months ‘05 ‘08 ‘11 ‘14 ‘02 Research 64 256 1024 4096 16 Industry 16 64 256 1024 4 1K cores by 2014! Are we ready? (Cores minimally big enough to run a self respecting OS!)

ATAC is an Efficient Network • Modulators are Primary Source of Power Consumption • Receive Power: Require only ~2 fJ/bit even with -5dB link loss • Modulator Power: • Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver) • Example: 64-Core Communication • (i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core) • Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W • Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W • Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit • Comparison: Electrical Broadcast Across 64 Cores • Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power) • (Assumes 150fJ/mm/bit, 1-mm spaced tiles)

Summary ATAC uses optical networks to enable multicore programming and performance scaling ATAC encourages communication-centric architecture, which helps multicore performance and power scalability ATAC simplifies programming with a contention-free all-to-all broadcast network ATAC is enabled by recent advances in CMOS integration of optical components

ATAC: Multicore Processor with On-Chip Optical Network