1 / 32

Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems. Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *. * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China.

zudora
Download Presentation

Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoCfor Heterogeneous Multicore Systems Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai* * University of Minnesota, Twin Cities, USA +ShanghaiTech University, China 28th IEEE International Parallel & Distributed Processing Symposium

  2. Heterogeneous Multicore System CPU CPU GPU GPU GPU GPU Interconnection Network L2 L2 MEM MEM ShanghaiTech

  3. On-chip Traffic Characteristics Traffic Pattern Switching Mechanism Packet Switching Erratic Random Latency-sensitive CPU Circuit Switching Streaming Dedicated Throughput-intensive GPU NoCs must handle different traffic differently ShanghaiTech

  4. Packet Switching vs. Circuit Switching Performance Perspective Src node Intm. node1 Intm. node2 Intm. node3 Dest node Src node Intm. node1 Intm. node2 Intm. node3 Dest node link traversal setup link traversal data router pipeline router pipeline Network delay Setup delay ack data Network delay Packet-switched Circuit-switched ShanghaiTech

  5. Packet Switching vs. Circuit Switching Energy Perspective Allocation & Arbitration Allocation & Arbitration Allocation & Arbitration Packet-switched Circuit-switched Circuit-switched NoC: potentially energy efficient for certain traffic pattern ShanghaiTech

  6. Packet Switching or Circuit Switching Packet SwitchingFlexible, Scalable  Latency, Energy Frequency Regular Erratic • Circuit • Switching • Packet Switching Fixed • Circuit Switching Latency, Energy •  Setup, Maintenance Destination • Packet Switching • Packet Switching Random NoC with both packet and circuit switching? ShanghaiTech

  7. Multi-plane vs. Single-plane Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes Increasing hardware requirement  Low resource utilization PS CS Single-plane: Packet and circuit switching sharing the same communication fabric PS+CS How can Packet and Circuit Switching share the same fabric? ShanghaiTech

  8. Space-Division Multiplexing 4 bits A A A 2 bits B SDM B B 1 bits C (Space-division Multiplexing) C C 1 bits D D D Physically divide a channel into sub-channels PS+CS • K. Lusala et al., IJRC 2012 • S. Secchi et al., DSD 2008 • A. K. Lusala, ReCoSoC 2011 • M. Modarressi et al., DATE 2009 SDM suffers from packet serialization problem ShanghaiTech

  9. Time-Division Multiplexing time 0 1 2 3 4 5 6 7 A A D C B B A A A A TDM B B (Time-division Multiplexing) A B C D C C 8 bits D D PS+CS We propose TDM-based hybrid-switched NoC ! ShanghaiTech

  10. Outline • Introduction • Design TDM-based Hybrid-switching NoC • Optimizations for Hybrid Switching • Conclusion ShanghaiTech

  11. Hybrid-switched Router Routing Logic VC Allocator Packet-switched SW Allocator Input 1 BW RC VA HP ST SA ST Circuit-switched Output 1 Packet-switched Pipeline Circuit-switched Pipeline Slot Table Packet-switched Output n Input n Circuit-switched Crossbar Slot Table ShanghaiTech

  12. Circuit-switched Path Setup t0 R2 R3 R0 R1 CS t0 R0 R1 R2 t1 CS t2 t3 CS t4 R5 R4 R3 t5 CS t6 t7 • Set up the path before transmission • Setup messages are sent through the packet-switched network • Acknowledge the source upon successful setup Keep time-slot assignment in Slot Tables ShanghaiTech

  13. Slot Table Configuration Walkthrough v v v v v v v v out out out out out out out out setup 1 setup 2 in_1 in_2 ① ② in_1 in_2 (succeed) (fail) s0 s0 in_1 → out_4 in_1 → out_3 s1 s1 slot_id = 2 slot_id = 3 s2 s2 duration = 2 duration = 1 s3 s3 teardown 1 ③ in_1 in_2 ④ in_1 in_2 in_1 → out_4 s0 slot_id = 2 s0 duration = 2 s1 s1 s2 s2 s3 s3 ShanghaiTech

  14. Slot Table Size V.S. • Larger slot table • More energy overhead • Longer packet waiting time • Finer-grain multiplexing Smaller slot table • Less energy overhead • Smaller packet waiting time • Coarser-grain multiplexing Slot table more request more request active inactive Initial (reset) (reset) Slot table size should be adjusted dynamically ShanghaiTech

  15. Circuit-Switched Path Exclusiveness Slot Table v out SW Allocator s0 Exclusively occupied by circuit-switched paths 1 out_3 s1 1 out_3 Crossbar configuration signals s2 0 (PS) s3 1 out_2 s4 1 out_2 s5 0 (PS) s6 1 out_1 s7 1 out_1 • Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented. ShanghaiTech

  16. Time-slot Stealing Slot Table Crossbar Line Address v out Decoder configuration signals SW Allocator valid CS flit enable Enable path reuse between packet- and circuit-switched data paths From upstream router VC Allocator ShanghaiTech

  17. Hybrid-switched Network • Path Setup • Endpoint Selection: Frequent communication pairs • Route Selection: Adaptive Routing • Switching Decision • Referring to packet slack* Routing decision is made based on the utilization of slot tables in neighbor routers *J. Yin et al., ISLPED 2012 ShanghaiTech

  18. Full System Evaluation Platform MEM C L2 C L2 C L2 MEM M G G G G M CPU Core/ GPU SM/ L2 Cache/ MC C L2 C L2 C L2 G G G G G G MEM MEM R R M L2 C L2 C M G G G G G G • Benchmarks • CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise • GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto ShanghaiTech

  19. Performance Evaluation ↑ 0.3% CPU CPU performance impact is negligible ↑ 4.1% GPU GPU performance is improved ShanghaiTech

  20. Network Energy Evaluation 6.3% saving ShanghaiTech

  21. Overall – Basic Hybrid-switched NoC 0.3% CPU performance improvement 4.1% GPU performance improvement 6.3% Network energy reduction Can we do better? CPU Speedup GPU Speedup Network Energy ShanghaiTech

  22. Outline • Introduction • Design TDM-based Hybrid-switching NoC • Optimizations for Hybrid Switching • Conclusion ShanghaiTech

  23. Opportunity: Low Path Utilization Overlapped paths Circuit-switched paths are under utilized • Large number of overlapped circuit-switched paths • Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables) ShanghaiTech

  24. Optimization: Path Sharing Hitchhiker-sharing Circuit-switched Path Hitchhiker-sharing Sources Vicinity-sharing Circuit-switched Path Vicinity-sharing Destinations Enable path reuse among circuit-switched data paths ShanghaiTech

  25. Performance Evaluation ↑ 0.3% ↑ 0.2% CPU ↑ 4.1% ↑ 3.7% GPU ShanghaiTech

  26. Network Energy Evaluation 6.3% saving 9.0% saving Can we do EVEN better? ShanghaiTech

  27. Opportunity: Lower Buffer Pressure Percentage of flits that are circuit-switched Packet-switched Circuit-switched Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths. ShanghaiTech

  28. Optimization: Aggressive Power-gating inactive Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating. Packet-switched Input 1 Circuit-switched active Slot Table Reduce dynamic and leakage power dissipation ShanghaiTech

  29. Performance Evaluation ↓ 1.6% ↑ 0.3% ↑ 0.2% CPU ↑ 2.6% ↑ 4.1% ↑ 3.7% GPU ShanghaiTech

  30. Network Energy Evaluation 6.3% saving 9.0% saving 17.1% saving Energy saving is significant ShanghaiTech

  31. Overall 1.6% CPU performance degradation 2.6% GPU performance improvement 17.1% Network energy reduction CPU Speedup GPU Speedup Network Energy ShanghaiTech

  32. Conclusion • TDM-based Hybrid-switched Network • TDM is an efficient way to enable on-chip resource sharing • Hybrid-switched NoC handles different traffic differently • Performance • Energy efficiency • Scalability (in paper) ShanghaiTech

More Related