1 / 21

The NoX Router

The NoX Router. Mitchell Hayenga Mikko Lipasti. Overview. New low-latency router technique Don’t arbitrate or speculate! Encode. XOR Property (A^B) ^ B = A Hides arbitration latency Eliminates dead cycles The NoX Router Single-cycle/wormhole/mesh implementation

hollye
Download Presentation

The NoX Router

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The NoX Router Mitchell Hayenga MikkoLipasti

  2. Overview • New low-latency router technique • Don’t arbitrate or speculate! Encode. • XOR Property (A^B) ^ B = A • Hides arbitration latency • Eliminates dead cycles • The NoX Router • Single-cycle/wormhole/mesh implementation • Frequency competitive with pure speculative • 2.7%-34.4% better ED2 on application traces • Up to 9.9% better throughput on synthetic traffic Control Input Channel Switch Fabric

  3. Motivation • Modern On-Chip Networks • Bandwidth Plentiful, Latency Critical • Control • Complex, Speculative, Critical Path • Datapath • Fast, Simple, Wire-Dominated • NoX Tradeoff • Marginal increase in datapath complexity • Hide control latency Virtual Channel Router Pipeline Evolution BW RC VA SA ST LT BW NRC VA SA ST LT BW NRC VA SA ST LT VA NRC SA ST LT Intel Teraflops Router

  4. Switch Arbitration Techniques • Non-Speculative • Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] • Assume contention doesn’t happen • Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle Control B Wins A Wins ? A A A A A Switch Fabric B B B cycle 0 1 2 3 4 clk port 0 port 1 grant valid out data out A A A A B B p1 p0 p0 A B A ??? No Contention Contention

  5. Switch Arbitration Techniques • Non-Speculative • Arbitration occurs before switch traversal • Speculative Switch Traversal [Mullins ISCA 2004] • Assume contention doesn’t happen • Wasted cycle in the event of contention • Arbiter decides what gets sent on the next cycle • Encoding • Blindly transmit, XOR within switch fabric • No contention - data sent unmodified • Contention - data sent XOR’d • Arbiter decides what was sent Control B Wins A A A A A A^B Switch Fabric B B cycle 0 1 2 3 4 clk port 0 port 1 grant valid out data out A A A B p1 p0 A A B^A No Contention Contention

  6. Receive Logic • Works upon simple XOR property. • (A^B^C) ^ (B^C) = A • Simple Decode • Always able to decode by XORing two sequential values • Maintains previous router’s arbitration order/fairness 0 1 Coded B B^C A A^B^C B^C A B^C A^B^C C C Flit Buffer 0

  7. Tradeoffs and Scaling • Arbitration • O(log n) delay for most arbiters • Decode logic • Constant with respect to # of ports • Switch Fabric • XOR delay scales slightly worse than a mux/tristate-based solution • Maybe not an issue (control latency) Control Input Channel Switch Fabric Switch Fabric

  8. The NoX Router • Network of XORs • Implementation Details • 8x8 Mesh, 2mm long 64-bit links • Single Cycle (Router+Link) • Wormhole • Dimension ordered routing • Minimally buffered

  9. Baseline Designs • Non-Speculative • Serial arbitration & switch logic • Long cycle time • Efficient link utilization • Speculative Techniques [Mullins ISCA 2004] • Hides arbitration latency • Potential for wasted link bandwidth • Spec-Fast & Spec-Accurate [Mullins ASP-DAC 2006]

  10. Frequency Analysis • Overheads present in all designs • 248ps SRAM delay • 98ps link latency

  11. Synthetic Traffic - Latency bandwidth (MB/s/node) bandwidth (MB/s/node)

  12. Synthetic Traffic – ED2 bandwidth (MB/s/node) bandwidth (MB/s/node)

  13. Application Traffic - Latency

  14. Application Traffic – ED2

  15. Power @ Fixed Bandwidth • Traffic Pattern • Uniform Random • 2GB/s/node injection rate • Spec-Fast saturated • Switch/Link glitching in speculative • Marginal additional decode power Decode negligible

  16. Area Floorplanning Standard Router NoX Router ~17% More Area XOR Switch Decoding and Masking Crossbar 161.2 µm 161.2 µm 140 µm 140 µm Port 1 – 64x4 SRAM Port 2 – 64x4 SRAM Port 3 – 64x4 SRAM Port 4 – 64x4 SRAM Port 0 – 64x4 SRAM Port 4 – 64x4 SRAM Port 2 – 64x4 SRAM Port 3 – 64x4 SRAM Port 1 – 64x4 SRAM Port 0 – 64x4 SRAM 101.0 µm 102.2 µm 70 µm 70 µm 28 µm

  17. Going Further • Input Speedup • What if we could drive two values from an input buffer in a single cycle • Final decode step has 2 values available • Last packet sees no additional delay from contention at the previous router • Multi-hop encoded forwarding • Don’t decode @ every hop, decode when packets diverge • Allow new collisions with the “head” flit • Requires additional sideband info Switch Fabric Flit Buffer B B A^B A

  18. Conclusion • New encoding-based low-latency router technique • Hides arbitration latency • Comparable frequency to speculative switch traversal techniques • Eliminates wasted interconnect bandwidth • Promising application to multiple router architectures

  19. Thanks – Questions?

  20. Virtual Channels • Future Work • Physical Channels vs. Virtual Channels • VC Router Benefits • Dynamic bandwidth sharing (performance) • VC Router Negatives • Increased arbitration delay (performance) • Increased buffer energy (power) • Large unified crossbar (area, power) • Possible but tradeoffs need to be re-evaluated • Structuring of input buffers/decode logic • VC credit accounting

  21. Multi-Flit Support • Current support is conservative • Performs similarly to speculative routers if multi-flit packets collide • Not all bad though • ~70% of packets are single-flit coherence packets • Only head-flit collisions matter • Requests all single-flit • Alternatives • Fragment multi-flit packets • Provide sufficient buffering space

More Related