1 / 31

Regional Congestion Awareness for Load Balance in Networks-on-Chip

Regional Congestion Awareness for Load Balance in Networks-on-Chip. Boris Grot. Paul Gratz Steve Keckler. The University of Texas at Austin Department of Computer Sciences. The Era of Many-core. Intel Polaris 80 tiles 8x10 2D mesh. UT TRIPS 2x16 exec tiles 16 NUCA tiles.

moral
Download Presentation

Regional Congestion Awareness for Load Balance in Networks-on-Chip

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regional Congestion Awareness for Load Balance in Networks-on-Chip Boris Grot Paul Gratz Steve Keckler The University of Texas at Austin Department of Computer Sciences UTCS

  2. The Era of Many-core • Intel Polaris • 80 tiles • 8x10 2D mesh • UT TRIPS • 2x16 exec tiles • 16 NUCA tiles • Tilera Tile • 64 cores • 5 networks UTCS

  3. The Era of Many-core • Many tiles: cores, cache, accelerators, more! • Many on-chip networks • Many traffic types: operands, memory, I/O UTCS

  4. Networks on a Chip (NOCs) • First-order system-level impact: • Performance • Energy • Resilience • Prior work: • Topology (Dally, DAC 2001) • Flow Control (Dally, IEEE Trans on Computers 1987) • Router µArch (Peh, HPCA 2001) • Prototyping (Taylor, IEEE Micro 2002) • Routing (Seo, ISCA 2005. Kim, DAC 2005) UTCS

  5. Routing Policy • Determines the path from Source to Dest. • Directly impacts load-balancing properties of the network • Ability to spread network load • Major performance implications • Current NOC research & practice: DOR • Deadlock freedom • Low implementation complexity • Fast route calculation • Poor load balancing properties UTCS

  6. Routing Example: Transpose Traffic Dimension-Order Routing (DOR) Adaptive Routing 100% Wanted: Load Balance Avg latency = 230 cycles Avg latency = 18 cycles UTCS

  7. Outline • Adaptive routing • Problems with adaptive routing • Regional Congestion Awareness • Evaluation • Conclusion UTCS

  8. Adaptive Routing • Path is a function of network condition • Dynamically balances load among network links • Used in systems from IBM, Cray, DEC, etc. • Issues: • Deadlock (Duato, Trans. On Parallel & Dist’d Systems 1993) • Minimal vs non-minimal routing • Router complexity & latency (Kim, DAC 2005) • Performance UTCS

  9. Adaptive Routing: Performance Issues • Performance depends on ability to estimate network congestion • Local metrics • Downstream VC & buffer availability • XB demand (ie, output port contention) • Limitations of local metrics • Myopic congestion estimation • By the time congestion is encountered, it's too late • Congestion in the center and underutilization at the edges • Poor load-balancing properties • Uniformly distributed traffic • Transient hot spots UTCS

  10. Ideal Routing • Perfect knowledge of network state • Low router complexity • low logic & state overhead • no impact on critical path • Low bandwidth requirements UTCS

  11. Regional Congestion Awareness (RCA) • Local data collection • Propagation to neighboring routers • Aggregation of local & non-local data • Trivial logic & state overhead • Low bandwidth requirements • Significantly improved network visibility UTCS

  12. RCA 1D UTCS

  13. RCA Fanin UTCS

  14. Baseline Router µArch (Kim et.al.) UTCS

  15. RCA Router µArch UTCS

  16. RCA Router µArch UTCS

  17. RCA Router µArch UTCS

  18. RCA Router µArch (RCA 1D) UTCS

  19. RCA Router µArch (RCA Fanin) UTCS

  20. RCA Details • Aggregation • Local vs non-local weight assignment: 50-50 • Trivial logic (one 8-bit adder/port) • Propagation • Differentiates RCA variants • Trivial complexity (0-2 8-bit adders/port) • RCA bandwidth • Baseline: 8 bits/channel • Can be reduced by serializing each update • Negligible performance impact at 1 bit/channel • Subject to traffic pattern stability UTCS

  21. Experimental Methodology Combined XB Demand + Free VCs 1 Splash traces courtesy of A. Kumar et. al. UTCS

  22. UTCS

  23. UTCS

  24. UTCS

  25. UTCS

  26. UTCS

  27. UTCS

  28. Results: Splash UTCS

  29. Results: Splash UTCS

  30. RCA Conclusions • Improved congestion estimation through aggregation of local and non-local measurements • Significant performance improvement • Improved load-balancing • Better throughput • 71% max latency reduction on Splash • Low complexity, no critical path impact • Multiple configurations possible • Performance-complexity trade-offs UTCS

  31. RCA Future Work • Performance in other topologies • Eg: 2D and 3D tori • Applicability to off-chip networks • Early results are promising • System-level energy/power impact • Earlier task completion vs RCA overhead • Network fault tolerance • Expect significant improvements in network performance under 1+ faults via improved load balancing UTCS

More Related