1 / 96

Switch Design a unified view of micro-architecture and circuits

Switch Design a unified view of micro-architecture and circuits. Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH) Xanthi, Greece dimitrak@ee.duth.gr http://utopia.duth.gr/~dimitrak. System abstraction. Algorithms-Applications.

simone
Download Presentation

Switch Design a unified view of micro-architecture and circuits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Switch Designa unified view of micro-architecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH) Xanthi, Greece dimitrak@ee.duth.gr http://utopia.duth.gr/~dimitrak

  2. System abstraction Algorithms-Applications • Processors for computation • Memories for storage • IO for connecting to the outside world • Network for communication and system integration Operating System Processors Memory Instruction Set Architecture Microarchitecture Network Register-Transfer Level Logic design Circuits IO Devices Switch Design - NoCs 2012

  3. Logic, State and Memory • Datapath functions • Controlled by FSMs • Can be pipelined • Mapped on silicon chips • Gate-level netlist from a cell library • Cells built from transistors after custom layout • Memory macros store large chunks of data • Multi-ported register files for fast local storage and access of data Switch Design - NoCs 2012

  4. On-Chip Wires • Passive devices that connect transistors • Many layers of wiring on a chip • Wire width, spacing depends on metal layer • High density local connections, Metal 1-5 • Upper metal layers > 6 are wider and used for less dense low-delay global connections Switch Design - NoCs 2012

  5. Future of wires: 2.5D – 3D integration Evolution Switch Design - NoCs 2012

  6. Optical wiring • Optical connections will be integrated on chip • Useful when the power of electrical connections will limit the available chip IO bandwidth • A balanced solution that involves both optical and electrical components will probably win Switch Design - NoCs 2012

  7. Let’s send a word on a chip • Sender and receiver on the same clock domain • Clock-domain crossing just adds latency • Any relation of the sender-receiver clocks is exploited • Mesochronous interface • Tightly coupled synchronizers [AMD Zacate] Switch Design - NoCs 2012

  8. Point-to-point links: Flow control Data • Synchronous operation • Data on every cycle • Sender can stall • Data valid signal • Receiver can stall • Stall (back-pressure) signal • Either can stall • Valid and Stall backpressure • Partially decouple Sender and Receiver by adding a buffer at the receive side S R Data S Valid R Data S R Stall Data S Valid R Stall Switch Design - NoCs 2012

  9. Sender and Receiver decoupled by a buffer • Receiver accepts some of the sender’s traffic even if the transmitted words are not consumed • When to stop? How is buffer overflow avoided? • Let’s see first how to build a buffer • Clock-domain crossing can be tightly coupled within the buffer Switch Design - NoCs 2012

  10. Buffer organization • A FIFO container that maintains order of arrival • 4 interfaces (full, empty, put, get) • Elastic • Cascade of depth-1 stages • Internal full/empty signals • Shift register in/Parallel out • Put: shift all entries • Get: tail pointer • Circular buffer • Memory with head / tail pointers • Wrap around array implementation • Storage can be register based Switch Design - NoCs 2012

  11. Buffer implementation • The same basic structure evolves with extra read/write flexibility • Multiplexers and head/tail pointers handle data movement and addressing Elastic Shift In/Parallel Out Circular array Switch Design - NoCs 2012

  12. Link-level flow control: Backpressure • Link-level flow control provides a closed feedback loop to control the flow of data from a sender to a receiver • Explicit flow control (stall-go) • Receiver notifies the sender when to stop/resume transmission • Implicit flow control (credits) • Sender knows when to stop to avoid buffer overflow • For unreliable channels we need extra mechanisms for detecting and handling transmission errors Switch Design - NoCs 2012

  13. STALL-GO flow control • One signal STALL/GO is sent back to the receiver • STALL=0 (G0) means that the sender is allowed to send • STALL=1 (STALL) means that the sender should stop • The sender changes its behavior the moment it detects a change to the backpressure signal • Data valid (not shown) is asserted when new data are available Stall Switch Design - NoCs 2012

  14. STALL-GO flow control: example • In-flight words will be dropped or they will replace the ones that wait to be consumed • In every case data arelost • STALL and GO should be connected with the buffer availability of the receiver’s queue • The example assumes that the receiver is stalled or released for other network reasons Stall Switch Design - NoCs 2012

  15. Buffering requirements of STALL&GO • STALL should be asserted early enough • Not drop words in-flight • Timing of STALL assertion guarantees lossless operation • GO should be asserted late enough • Have words ready-to-consume before new words arrive • Correct timing guarantees high throughput • Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles If not available the link remains idle Stall Switch Design - NoCs 2012

  16. STALL&GO on pipelined and elastic links • Traffic is “blind” during a time interval of Round-trip Time (RTT) • the source will only learn about the effects of its transmission RTT after this transmission has started • the (corrective) effects of a contention notification will only appear at the site of contention RTT after that occurrence Switch Design - NoCs 2012

  17. Credit-based flow control • Sender keeps track of the available buffer slots of the receiver • The number of available slots is called credits • The available credits are stored in a credit counter • If #credits > 0 sender is allowed to send a new word • Credits are decremented by 1 for each transmitted word • When one buffer slot is made free in the receive side, the sender is notified to increase the credit count • An example where credit update signal is registered first at the receive side Switch Design - NoCs 2012

  18. Credit-based flow control: Example Available Credits Credit Update 0* means that credit counter is incremented and decremented at the same cycle (ways and stays at 0) Switch Design - NoCs 2012

  19. Credit-based flow control: Buffers and Throughput Switch Design - NoCs 2012

  20. Condition for 100% throughput Credit loop • The number of registers that the data and the credits pass through define the credit loop • 100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loop • Changing the available number of credits can reconfigure maximum throughput at runtime • Credit-based FC is lossless with any buffer size > 0. • Stall and Go FC requires at least one loop extra buffer space than credit-based FC Switch Design - NoCs 2012

  21. Link-level flow control enhancements • Reservation based flow control • Separate control and data functions • Control links race ahead of the data to reserve resources • When data words arrive, they can proceed with little overhead • Speculative flow control • The sender can transmit cells even without sufficient credits • Speculative transmissions occur when no other words with available credits is eligible for transmission • The receiver drops an incoming cell if its buffer is full • For every dropped word a NACK is returned to the sender • Each cell remains stored at the sender until it is positively acknowledged • Each cell may be speculatively transmitted at most once • All retransmissions must be performed when credits are available • The sender consumes credit for every cell sent, i.e., for speculative as well as credited transmissions. Switch Design - NoCs 2012

  22. Send a large message(packet) • Send long packet of 1Kbit over a 32-bit-wire channel • Serialize the message to 16 words of 32 bits • Need 16 cycles for packet transmission • Each packet is transmitted word-by-word • When the output port is free, send the next word immediately • Old fashioned Store-and-forward required the entire packet to reach each node before initiating next transmission Switch Design - NoCs 2012

  23. Buffer allocation policies • Each transmitted word needs a free downstream buffer slot • When the output of the downstream node is blocked the buffer will hold the arriving words • How much free buffering is guaranteed before sending the first word of a packet? • Virtual Cut Through (VCT): The available buffer slots equal the words of the packet • Each blocked packet stays together and consumes the buffers of only one node • Wormhole: Just a few are enough • Packet inevitably occupies the buffers of more nodes • Nothing is lost due to flow control backpressure policy Switch Design - NoCs 2012

  24. VCT and Wormhole in graphics Switch Design - NoCs 2012

  25. Link sharing • The number of wires of the link does not increase • One word can be sent on each clock cycle • The channel should be shared • A multiplexer is needed at the output port of the sender • Each packet is sent un-interrupted • Wormhole, and VCT behave this way • Connection is locked for a packet until the tail of the packet passes the output port Switch Design - NoCs 2012

  26. Who drives the select signals? • The arbiter is responsible for selecting which packet will gain access to the output channel • A word is sent if buffer slots are available downstream • It receives requests from the inputs and grants only one of them • Decisions are based on some internal priority state Switch Design - NoCs 2012

  27. Arbitration for Wormhole and VCT • In wormhole and VCT the words of each packet are not mixed with the words of other packets • Arbitration is performed once per packet and the decision is locked at the output for all packet duration • Even if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port • Buffer utilization managed by flow control mechanism Switch Design - NoCs 2012

  28. How can I place my buffers? Switch Design - NoCs 2012

  29. Let’s add some complexity: Networks • A network of terminal nodes • Each node can be a source or a sink • Multiple point-to-point links connected with switches • Parallel communication between components Switch Source/Sink Terminal Node Switch Design - NoCs 2012

  30. Complexity affects the switches • Multiple input-output permutations should be supported • Contention should be resolved and non-winning inputs should be handled • Buffered locally • Deflected to the network • Separate flow control for each link • Each packet needs to know/compute the path to its destination Switch Design - NoCs 2012

  31. How are the terminal nodes connected to the switch? • More than one terminal nodes can connect per switch • Concentration good for bursty traffic • Local switch isolates local traffic from the main network Switch Design - NoCs 2012

  32. Switch design: IO interface Separate flow control per link Switch Design - NoCs 2012

  33. Switch design: One output port per-output requests Let’s reuse the circuit we already have for one output port Switch Design - NoCs 2012

  34. Switch design: Input buffers Data from input#1 Requests for output #0 • Move buffers to the inputs Switch Design - NoCs 2012

  35. Switch design: Complete output ports • How are the output requests computed? Switch Design - NoCs 2012

  36. Routing computation • Routing computation generates per output requests • The header of the packet carries the requests for each intermediate node (source routing) • The requests are computed/retrieved based on the packet’s destination (distributed routing) Switch Design - NoCs 2012

  37. Routing logic • Routing logic translates a global destination address to a local output port request • To reach node X from node Y should use output port #2 of Y • A Lookup-table is enough for holding the request vector that corresponds to each destination Switch Design - NoCs 2012

  38. Switch building blocks Switch Design - NoCs 2012

  39. Running example of switch operation • Switches transfer packets • Packets are broken to flits • Head flit only knows packet’s destination • The wires of each link equals the bits of each flit Switch Design - NoCs 2012

  40. Buffer access • Buffer incoming packets per link • Read the destination of the head of each queue Switch Design - NoCs 2012

  41. Routing Computation/Request Generation • Compute output requests and drive the output arbiters Switch Design - NoCs 2012

  42. Arbitration-Multiplexer path setup • Arbitrate per output • The grant signals • Drive the output multiplexers • Notify the inputs about the arbitration outcome Switch Design - NoCs 2012

  43. Switch traversal • Words H will leave the switch on the next clock edge provided they have at least one credit Switch Design - NoCs 2012

  44. Link traversal • Words going to a non-blocked output leave the switch • The grants of a blocked output (due to flow control) are lost • An output arbiter can also stall in case of blocked output Switch Design - NoCs 2012

  45. Head-Of-Line blocking: performance limiter • The FIFO order of the input buffers limit the throughput of the switch • The flit is blocked by the Head-of-Line that lost arbitration • A memory throughput problem Switch Design - NoCs 2012

  46. Wormhole switch operation • The operations can fit in the same cycle or they can be pipelined • Extra registers are needed in the control path • Registers in the input/output ports already present • LT at the end involves a register write • Body/tail flits inherit the decisions taken by the head flits Switch Design - NoCs 2012

  47. Look-ahead routing • Routing computation is based only on packet’s destination • Can be performed in switch A and used in switch B • Look-ahead routing computation (LRC) Switch Design - NoCs 2012

  48. Look-ahead routing • The LRC is performed in parallel to SA • LRC should be completed before the ST stage in the same switch • The head flit needs the output port requests for the next switch Switch Design - NoCs 2012

  49. Look-ahead routing details • The head flit of each packet carries the output port requests for the next switch together with the destination address Switch Design - NoCs 2012

  50. Low-latency organizations SA ST LT • Baseline • SA precedes ST (no speculation) • SA decoupled from ST • Predict or Speculate arbiter’s decisions • When prediction is wrong replay all the tasks (same as baseline) • Do in different phases • Circuit switching • Arbitration and routing at setup phase • At transmit only ST is needed since contention is already resolved • Bypass switches • Reduce latency under certain criteria • When bypass not enabled same as baseline LRC SA LT ST LRC SA ST LT LRC Setup Setup ST LT Xmit Xmit ST LT Switch Design - NoCs 2012

More Related