1 / 35

Sheng Ma 1,2 , Natalie Enright Jerger 2 , Zhiying Wang 1

Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-chip. Sheng Ma 1,2 , Natalie Enright Jerger 2 , Zhiying Wang 1 1 National University of Defense Technology, China 2 University of Toronto, Canada. Overview.

nituna
Download Presentation

Sheng Ma 1,2 , Natalie Enright Jerger 2 , Zhiying Wang 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-chip Sheng Ma1,2, Natalie Enright Jerger2, Zhiying Wang1 1National University of Defense Technology, China 2University of Toronto, Canada

  2. Overview • NoC performance is coupled to buffer and virtual channel (VC) utilization • Buffers or VCs consume a large fraction of NoC area and power • ViChaR [Nicopoulos et al MICRO 2006], FVADA [Xu et al. HPCA 2010] • Virtual channel re-allocation determines how efficiently VC resources are used

  3. Overview • Novel VC re-allocation scheme – Whole packet forwarding • Fully adaptive routing in wormhole networks • Combines flit-based and packet-based flow controls • Efficient with large fraction of short packets

  4. Overview • Novel VC re-allocation scheme – Whole packet forwarding • Fully adaptive routing in wormhole networks • Combines flit-based and packet-based flow controls • Efficient with large fraction of short packets • An efficient design of fully adaptive routing

  5. Overview • Novel VC re-allocation scheme – Whole packet forwarding • Fully adaptive routing in wormhole networks • Combines flit-based and packet-based flow controls • Efficient with large fraction of short packets • An efficient design of fully adaptive routing • Important in a VC-limited network • Maintains packet adaptivity with low hardware overhead

  6. Background: Cache Coherent NoCs • Large fraction of packets are short • Due to abundant wiring resources on chip

  7. Background: Cache Coherent NoCs • Large fraction of packets are short • Due to abundant wiring resources on chip • Multiple virtual networks (VNs) are needed • Prevent protocol-level deadlock • VC budget is limited for each VN

  8. Conservative VC Re-allocation • Fully adaptive routing requires conservative VC re-allocation • A VC can be re-allocated only when it is empty • Results in low VC utilization

  9. Conservative VC Re-allocation • The conservative VC re-allocation is for deadlock-avoidance • Two key reasons for deadlock • Cyclic channel dependency • A packet spans two VCs: its header flit is not at the head of the first VC, while its tail flit holds the second VC

  10. Partially Adaptive or Deterministic Routing • Key reasons for deadlock • Cyclic channel dependency • A packet spans two VCs: its header flit is not at the head of the first VC, while its tail flit holds the second VC • Partially adaptive and deterministic routing provides limited physical paths

  11. Aggressive VC Re-allocation • Partially adaptive and deterministic routing utilizes aggressive VC re-allocation • A VC can be re-allocated once it receives the tail flit of last allocated packet • Brings in high VC utilization by sacrificing packet adaptivity

  12. Whole Packet Forwarding (WPF) • Key reasons for deadlock • Cyclic channel dependency • A packet spans two VCs: its header flit is not at the head of the first VC, while its tail flit holds the second VC • If a downstream VC has enough free slots to hold the entire packet, then this VC can be re-allocated to the new packet

  13. WPF vs. Packet-based Flow Controls • Packet-based flow controls • Store-and-forward (SAF) and virtual cut-through (VCT) • Require enough buffer space to hold an entire packet before VC re-allocation • VCs must be as deep as the maximum packet length • WPF does not have such a requirement

  14. Fully Adaptive Routing Design • WPF can be utilized with many fully adaptive routing algorithms • We propose a novel fully adaptive routing algorithm using WPF • Fully adaptive routing based on Duato’s theory [Duato, IEEE Trans. Parallel Distrib. Syst. 1993] • Classifies adaptive and escape VCs • Escape VCs can only be used when the output port adheres to DOR • Supports jumping between escape and adaptive VCs • A packet must be able to always apply for an escape VC

  15. Maintaining Packet Adaptivity • Existing designs select the output port first (port-selection-first) • Does not support jumping between escape and adaptive VCs

  16. Maintaining Packet Adaptivity • Key point: A packet must be able to always apply for an escape VC

  17. Maintaining Packet Adaptivity • Key point: A packet must be able to always apply for an escape VC

  18. Hardware Implementation

  19. Hardware Overhead Naive design FULLY design

  20. Evaluation • Booksim [Dally & Towles, PPIN], FeS2 [Neelakantam et al., ASPLOS 2008] • Deterministic routing • Dimension order routing (DOR) • Aggressive VC re-allocation • Partially adaptive routing • Negative-first, odd-even, et al. • Aggressive VC re-allocation • Fully adaptive routing • Port-selection-first (PSF), FULLY • Conservative VC re-allocation • Whole packet forwarding • 4x4 mesh network, 1 virtual network • 2-VC/VN • 4-flit/VC • Single-flit packet (80%) and 5-flit packet (20%)

  21. Performance for Synthetic Workloads Bit reverse • Fully adaptive routing is strongly limited by low VC utilization • WPF can greatly improve performance of fully adaptive routing • Maintaining packet adaptivity is important with limited VC budget

  22. Performance for Synthetic Workloads (a) Transpose-1 (b) Transpose-2 • Partially adaptive routing provides uneven adaptivity • Fully adaptive routing maintains stable performance Unstable performance for symmetric patterns stable performance for symmetric patterns

  23. Full System Configuration • PARSEC • simsmall input sets • Total task runtime • 16-core x86 computing platform • Private L1 (32KB), private L2 (512KB) • 16 threads • 4x4 mesh • MOESI distributed directory • 4 unordered virtual networks • Each VN has the same configuration as the synthetic experiments • Cache line • 64 bytes • Flit size • 16 bytes

  24. Performance for PARSEC • Sophisticated routing algorithms improve saturation throughput • Has little effect on on light-load applications • Significant system performance improvement for heavy-load applications 37.8% Heavy-load applications Light-load applications

  25. Sensitivity to Single-flit Packet Ratio (b) Transpose-1 with 40% SFP ratio • Aggressive VC re-allocation makes DOR and partially routing insensitive to packet length distribution • The effect of WPF drops with lower SFP ratio (a) Transpose-1 with 60% SFP ratio

  26. Conclusion • Flow control has significant impact with large fraction of short packets • WPF allows a non-empty VC to be re-allocated in fully adaptive routing • Especially effective in VC-limited networks • WPF does not introduce deadlock • Maintaining packet adaptivity is important in VC-limited network • Modest hardware overhead

  27. Thank you! Questions?

  28. Backup Slides - Deadlock-freedom Proof • By contradiction • Assume there is a deadlock configuration for WPF (config0) • Step 1: Build another one by removing those packets whose header flits are not at head of VCs

  29. Backup Slides - Deadlock-freedom Proof • Step 2: Config1 can be achieved with conservative VC re-allocation • WPF can only forward packets to those VCs which are allowed • Step 3: Config1 is a deadlock configuration • Removing packets will not create empty VCs

  30. Backup Slides – Scaliabilty • Performance trends among different algorithms are the same • Maintaining packet adaptivity is more important with larger network

  31. Backup Slides - Sensitivity to VC count • Increasing VC count from 2 to 4 has minor effect on DOR • More VCs greatly increase the performance of PSF and FULLY • WPF shows decreasing effect with more VCs (a) Bit reverse with 2 VCs (b) Bit reverse with 4 VCs

  32. Backup Slides – Different packet length • More downstream VC status registers are needed

  33. Backup Slides – HoL Blocking • In VC-limited environment, providing high VC utilization is much more important than the negative effect of HoL blocking

  34. Backup Slides – WPF vs. Aggressive • A non-empty VC can be re-allocated only when it has enough slots to hold whole packet, while aggressive VC re-allocation has no requirement

  35. Adaptive Routing Dynamically route traffic so that load can be balanced (uniform link utilization) Thus packet may take completely different path from SAME source to SAME destination depending on the current network congestion If congestion is encountered, it can be routed around through a ‘less congested’ path How correctly the optimal dynamic path is identified is where the various adaptive routing algorithms differ

More Related