Allocator Implementations for Network-on-Chip Routers - PowerPoint PPT Presentation

asha
allocator implementations for network on chip routers n.
Skip this Video
Loading SlideShow in 5 Seconds..
Allocator Implementations for Network-on-Chip Routers PowerPoint Presentation
Download Presentation
Allocator Implementations for Network-on-Chip Routers

play fullscreen
1 / 22
Download Presentation
Allocator Implementations for Network-on-Chip Routers
158 Views
Download Presentation

Allocator Implementations for Network-on-Chip Routers

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University

  2. Overview • Allocators have major impact on router performance • Zero-load latency, throughput under load, cycle time • On-chip environment imposes stringent constraints • Cycle time, power, no iterative / multi-cycle allocators • Main Contributions: • RTL-based performance & cost evaluation of virtual channel and switch allocators for NoC routers • Sparse VC allocation scheme reduces delay, area & power • Pessimistic speculation scheme minimizes delay penalty Allocator Implementations for NoC Routers

  3. Separable Allocators Input-first: • Implement allocation as two phases • Local arbitration at each input • Global arbitration at each output • Pros: • Straightforward implementation • Delay scales logarithmically • Cons: • Arbiters within each phase are independent • Bad choice in first phase can limit matching Outputs Inputs Output-first: Outputs Inputs Allocator Implementations for NoC Routers

  4. [Tamir’93]Wavefront Allocator • Consider inputs and outputs together • Grant requests on diagonal, kill conflicts • Repeat for other diagonals • Pros: • Tends to generate better matchings • Tiled design facilitates full-custom implem. • Cons: • Delay scales linearly • Orig. design has (false) combinational loops Outputs Inputs Allocator Implementations for NoC Routers

  5. Evaluation Methodology • Analytical models useful for developing intuition • But becoming increasingly inaccurate • Wire delay impact, synthesized vs. full-custom logic, … • Use two-pronged evaluation approach: • Delay & cost via detailed RTL-based evaluation • Synthesized using Synopsys Design Compiler in topo mode • Commercial 45nm low power library @ worst case • Network-level performance via simulation • Cycle-oriented interconnection network simulator • 64-node networks: 2D mesh & 2D flattened butterfly • Request-reply traffic, synthetic traffic patterns Allocator Implementations for NoC Routers

  6. Virtual Channel Allocation • Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) • Before packets can proceed through router, need to claim ownership of VC buffer at next router • VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use • P×V inputs (input VCs), P×V outputs (output VCs) • Once assigned, VC is used for entire packet’s duration Allocator Implementations for NoC Routers

  7. Sparse VC Allocation (1) • VCs are used for variety of purposes: • Deadlock avoidance • Break cyclic dependencies • Routing deadlock (within network) • Protocol deadlock (at network boundary) • Flow control • Decouple buffers and channels to avoid head-of-line blocking • Idea: Partition set of VCs to restrict legal requests • Significantly reduces VC allocator logic complexity • Delay/area/power savings of up to 41%/90%/83% Allocator Implementations for NoC Routers

  8. Sparse VC Allocation (2) IVC OVC 64 Requests 32 Requests 24 Requests NM P×2 Requests REQ P×8 Requests MIN P×4 Requests NM P×2 Requests REP MIN P×4 Requests 2×2×2 VCs 2×4 VCs 8 VCs Allocator Implementations for NoC Routers

  9. VC Allocator Performance [FBfly, 2×2×2 VCs] Allocator Implementations for NoC Routers

  10. VC Allocator Delay Allocator Implementations for NoC Routers

  11. VC Allocator Cost Allocator Implementations for NoC Routers

  12. Switch Allocation • Flits require crossbar access to traverse router • VCs at each input port share crossbar input • Switch allocator generates crossbar schedule • Allocation performed on cycle-by-cycle basis • P×V inputs (input VCs), P outputs (output ports) • At most one VC per input can be granted in each cycle • Speculative allocation reduces zero-load latency • Start switch allocation before VC allocation completes Allocator Implementations for NoC Routers

  13. Pessimistic Speculation (1) • Conventional approach: • Separate allocators for spec. and non-spec. requests • Non-spec. grants mask conflicting spec. grants • Conflict detection is on critical path • At low load, most requests are granted • Idea: Assume all requests will be granted • Mask spec. grants with non-spec. requests • Overlap conflict detection and allocation • Sacrifice speculation accuracy for lower delay • But preserve zero-load latency improvement Allocator Implementations for NoC Routers

  14. Pessimistic Speculation (2) nonspec. allocator nonspec. requests nonspec. grants conflict detection spec. allocator spec. requests spec. grants mask Allocator Implementations for NoC Routers

  15. Switch Allocator Performance (1) [Mesh, 2×1×1 VCs] Allocator Implementations for NoC Routers

  16. Switch Allocator Performance (2) [FBfly, 2×2×4 VCs] >20% Allocator Implementations for NoC Routers

  17. Switch Allocator Delay Allocator Implementations for NoC Routers

  18. Switch Allocator Cost Allocator Implementations for NoC Routers

  19. Speculation Performance (1) [Mesh, 2×1×1 VCs] Allocator Implementations for NoC Routers

  20. Speculation Performance (2) [Fbfly, 2×2×4 VCs] Allocator Implementations for NoC Routers

  21. Speculation Implementation Allocator Implementations for NoC Routers

  22. Conclusions • Network-level performance is largely insensitive to VC allocator implemetation • Light effective load facilitates near-ideal matchings • Sparse VC allocation can greatly reduce delay & cost • Partition set of VCs based on functionality • Restrict possible requests allocator must handle • For switch allocation, wavefront allocator produces better matchings but increases delay & cost • Difference increases with number of ports, VCs • Pessimistic speculation reduces switch allocator delay • Trade for some performance degradation near saturation Allocator Implementations for NoC Routers