Networks-on-Chip Sergio Tota and Mario R. Casu VLSI Laboratory
Seminar contents • The Premises • Homogenous and Heterogeneous Systems-on-Chip and their interconnection networks • The Network-on-Chip approach • Examples • Our THIN contribution (Sergio’s speech) • Back to the coffee corner…
The premises • The System-on-Chip (SoC) today • Heterogeneous ~10 IP’s • Homogeneous (MP-SoC) ~ 10 uP (with exceptions) • On-Chip BUS (AMBA, Core Connect, Wishbone, …) • IP and uP are sold with proprietary Bus IF • Near and long-term forecast • 100 IP/uP: Busses are non scalable! • Physical Design issues: signal integrity, power consumption, timing closure • Clock issues: Is time for the Globally Asynchronous paradigm? (Still locally synchronous) • Need for “more regular” design
Heterogeneous Today’s SoC CPU DSP MEM Interconnection network (BUS) Embedded FPGA Dedicated IP I/O
The Cell Processor • Fclock > 4 GHz. • Memory bandwidth: 25.6 GBytes per second. • I/O bandwidth: 76.8 GBytes per second. • Performance: • 256 GFLOPS (Single precision at 4 GHz). • 256 GOPS (Integer at 4 GHz). • 25 GFLOPS (Double precision at 4 GHz). • 235 square mm. • 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz
Cell’s Element Interconnect Bus • From the trenches: D. Krolak, IBM • “Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”
Cell’s Element Interconnect Bus • 4 rings (2 ckwise + 2 counter-ckwise) • No token rings, still request/grant arbitrations
CPU CPU CPU CPU CPU CPU CPU CPU MEM MEM MEM MEM MEM MEM MEM MEM Homogeneous SoC (MP-SoC) Interconnection network (BUS, XBAR)
MP-SoC: Cisco CRS-1 Router CRS-1 Router uses 188 extensible network processors per “Silicon Packet Processor” chip
MP-SoC: Cisco CRS-1 Router CRS-1 Router uses 188 extensible network processors per “Silicon Packet Processor” chip 16 PPE Clusters of 12 PPEs each
Very long wires Year 2005 Year 2010 1 ns (1 GHz) 0.1 ns (10 GHz) B B A A
Bus pros () and cons () Every unit attached adds parasitic capacitance, therefore electrical performance degrades with growth. Bus timing is difficult in a deep submicron process. Bus arbiter delay grows with the number of masters. The arbiter is also instance-specific. Bandwidth is limited and shared by all units attached. Bus latency is zero once arbiter has granted control. The silicon cost of a bus is near zero. Any bus is almost directly compatible with most available IPs, including software running on CPUs. The concepts are simple and well understood.
What are NoC’s? • According to Wikipedia: • “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.” • Imprecise…
NoC exemplified Processor Master Processor Master Processor Master Global Memory Slave Routing Node Routing Node Routing Node Processor Master Processor Master Processor Master Global I/O Slave Routing Node Routing Node Routing Node Global I/O Slave Processor Master Processor Master Processor Master Routing Node Routing Node Routing Node
Basic Ingredients of a NoC • N Computational Resources • Processing Elements (PE) • 1 Connection Topology • 1 Routing technique • M N Switches • N Network Interfaces
For the Connoisseurs… • 1 Addressing system • 1 Switch-level Arbitration policy • 1 Communication Protocol • 1 Programming model • Message passing • Shared Memory • Bon appetit!
NoC: Good news Only point-to-point one-way wires are used, for all network sizes. Aggregated bandwidth scales with the network size. Routing decisions are distributed and the same router is re-instanciated, for all network sizes. NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)
There’s no free lunch… • Internal network contention causes (often unpredictable) latency. • The network has a significant silicon area. • Bus-oriented IPs need smart wrappers. • Software needs clean synchronization in multiprocessor systems. • System designers need reeducation for new concepts.
Facts about NoC’s • It is a way to decouple computation from communication • The design is layered(physical, network, application…): Taming complexity is made easier • Communication between processing elements in NoC takes place by encapsulating data in packets • The elementary packet piece to which switch and routing operations apply is the flit
Topologies • Heritage of networks with new constraints • Need to accommodate interconnects in a 2D layout • Cannot route long wires (clock frequency bound) • SPIN, • CLICHE’ • Torus • Folded torus • Octagon • BFT.
Topologies • Heritage of networks with new constraints • Need to accommodate interconnects in a 2D layout • Cannot route long wires (clock frequency bound)
Switching • Again, techniques inherited from Computer and Communication Networks • New constraints in silicon: area and power • Use as few buffers as possible • Store & Forward and Virtual-Cut-Through • Need buffers size for an entire packet, unsuited! • Limited buffer size in • Wormhole • Deflection Routing, a.k.a. “Hot Potato” • Virtual channels • Increase buffer size…
Routing • Deterministic vs. Adaptive • Simplify/Complicate routing logic • Easy/Uneasy deadlock free • Prone/Robust to congestion • 2D dimension order routing (XY) most used static routing in NoC (e.g. with Wormhole and Mesh)
Who first had the idea? • No clear parenthood. The most referred papers according to Google (#cit.) • Guerrier’00 (204), A Generic Architecture for On-Chip Packet-Switched Interconnections • Dally’01 (392), Route Packets, Not Wires: On-Chip Interconnection Networks • Benini’02 (417), Networks on Chips: A New SoC Paradigm • Kumar’02 (184), A Network on Chip Architecture and Design Methodology
SPIN (Guerrier et al., DATE ’00/’03) • Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology. • A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words
Dally et al., DAC’01 • 2D folded torus topology • Wormhole routing and Virtual Channels (VC)
Kumar et al., ISLVLSI’02 • Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’ • 2D Mesh Topology • Message Passing
Pande et al., TCOMP’05 • Butterfly Fat Tree • Wormhole, Virtual channels • Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration) • “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration
Goossens et al., IEE CDT’03 • Both VCT and WH, GT and BE, IQ and VOQ • GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k” in a S&F fashion. • BE uses Matrix Scheduling • GT connections set up by BE special system packets • Prototype with WH and IQ • 5 ports • 0.13 um, 0.26 mm2 , 500/166 MHz • Flit size = 3 words, each 32 bits • 80 Gb/s aggregate bandwidth
Common properties • Data integrity, meaning that data is delivered uncorrupted • Lossless data delivery, which means no data is dropped in the interconnect • In-order data delivery, which specifies that the order in which data is delivered is the same order in which it has been sent • Throughput and Latency services that offer time related bounds.
What is new? • Yes you are very right, no new concepts • Amazing application of network ideas to the chip context • But ideas need to be re-contextualized • Old constraints • Latency, bandwidth • New constraints are very tight • Area, Power, Clocks • Differences of fine-grain NoC with large-grain Networks • Today links are 100% reliable. Might become false for ultra-scaled technologies and globally asynchronous NoC • For many applications, lowest latency is more important than highest bandwidth
Simulation Issues • Stochastic traffic generators • Ease of implementation/simulation • Fast simulation • MP-SoC loop interactions ignored? • Self-similar traffic used by some • Trace-Based Simulation • Need for extensive pre-simulation • Long simulations (days-weeks) • Accurate results • Stay tuned for Sergio’s speech…
Applications • Main NoC feature: high communication bandwidth • Desirable feature for MP-SoC: low communication latency • The twos are often contrasting requirements: • “Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God.” —Anonymous • Desperately seeking benchmarks and killer applications • Networking!!! • Multimedia?
The THIN NoC • What we think will make a NoC sexy enough for chip designers • Least switch area and power • Fast and low latency switch • Ideally one single clock cycle latency and cutting edge clock frequency Fck (technology limited) • Large bandwidth = high Fck X high data parallelism • Need for a lightweight NoC design • Torino Hawaii Interconnection Network • Joint work with Hawaii University at Manoa, Dept. Electrical Engineering
Some References • J. Rabaey et al., “A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing,” IEEE Journal of Solid State Circuits, Vol. 35, No. 11, Nov. 2000, pp. 1697 - 1704 • P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip Packet-Switched Interconnections,” Proc. Design and Test in Europe (DATE), pp. 250-256, Mar. 2000. • A. Adriahantenaina et al., “SPIN: a Scalable, Packet Switched, On-chip Micro-network,” Proc. Design and Test in Europe (DATE), Mar. 2003. • L. Benini and G. De Micheli, “Networks on Chips: A New SoC Paradigm,” Computer, vol. 35, no. 1, Jan. 2002, pp. 70-78. • S. Kumar et al., “A network on chip architecture and design methodology,” in Proc. ISVLSI, 2002. • W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnection networks,” in Proc. Design Automation Conf., 2001. • K. Goossens et al., “Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip,” IEE Proc.-Comput. Digit. Tech., Vol. 150, No. 5, Sep. 2003, pp. 294-302. • P.P. Pande et al., “Performance Evaluation and Design Trade-offs for Network-on-Chip Interconnect Architectures,” IEEE Trans. Computers, vol. 54, no. 8, Aug. 2005, pp. 1025-1040.