1 / 27

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems. Jieming Yin , Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar , and Antonia Zhai University of Minnesota – Twin Cities. Network-on-Chips. Core. Core. Core. Core. Core. Core. Core.

Download Presentation

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities

  2. Network-on-Chips Core Core Core Core Core Core Core Core R R R R R R R R • Scalable • Provides high bandwidth • Leads to latency • Leads to energy consumption

  3. Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super-scalar Super-scalar Super-scalar Super-scalar Only some routers are fully utilized

  4. DVFS for Reducing NoC Energy • Dynamic Voltage and Frequency Scaling • Router energy dominates • DVFS reduces router energy, but leads to delay • Previous work are conservative on aggressiveness We need more aggressive DVFS

  5. Limitations of Aggressive DVFS • DVFS to reduce energy • Limitations of Aggressive DVFS • Increase latency • Reduce throughput Work for limited traffic pattern Latency Sensitive Insensitive High Throughput Low Dynamic Voltage Frequency Scaling Our Previous Work * This Work Latency Throughput Contention • * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

  6. Flexible-Pipeline Routers Frequency = 0.5F T T 1 2 3 4 Frequency = 0.5F 1 2 3 4 T Flexible pipeline reduces router pipeline delay

  7. Exploiting DVFS Opportunity High utilization Mid utilization 1 Low utilization (a) Minimal path routing Src1 Dest1 1’ (b) Non-minimal path routing Src1 Dest1

  8. Exploiting DVFS Opportunity (cont.) • Dynamic Energy: EDyn ∝ Vdd2 • Static Energy: ESta ∝ Vdd • Clock Energy: EClk ∝ (Freq* Vdd2) Operating at Mid-frequency gets most benefit

  9. Exploiting DVFS Opportunity (cont.) 100% frequency 50% frequency 1 25% frequency Src1 Dest1 (a) Minimal path routing 1. Performance 1’ 2. Dynamic Energy 3. Static Energy Src1 Dest1 (b) Non-minimal path routing More benefit with bigger network

  10. Outline • Introduction • Non-minimal path selection • - Issue • - Solution • - Challenges • Infrastructure (CPU+GPU) • Results • Conclusion

  11. Non-minimal Path Routing High utilization Mid utilization Low utilization (a) Minimal path routing Src Dest (b) Non-minimal path routing Src Dest

  12. Too Close ! High utilization Mid utilization Src Dest Low utilization (a) Minimal path routing Performance Static Energy Dynamic Energy Src Dest (b) Non-minimal path routing

  13. Too Aggressive ! High utilization Mid utilization Low utilization Static Energy Dynamic Energy Non-minimal path routing Src1 Dest1

  14. Dynamic Network Tuning Packet: Router: Input Initial State N Slack == 1 Utilization Monitor Y N Dx>=3 || Dy>=3 V/F Scaling Y Least busy port Min. path port Busy information propagation Slack = 0 How to determine Slack? Output

  15. Busy Information Propagation • Busy Metrics • Buffer utilization • Crossbar utilization • Router utilization • Propagation • Regional congestion awareness • [Grot et al. HPCA08]

  16. Regional Congestion Awareness • Local data collection • Propagation to neighboring routers • Aggregation of local & non-local data

  17. Slack in Applications Thread 0 Thread 1 Thread 2 Thread n Thread 0 Thread 0 read miss Thread 0 ready Thread 0 schedule Slack of a packet: The number of cycles the packet can be delayed without affecting the overall execution time • CPU: Not necessarily, but assume NO slack • GPU: Based on # of threads

  18. Tile-Based Multicore System MEM C L2 C L2 MEM C L2 G G G M M G CPU Core/ GPU SM/ L2 Cache/ MC C L2 C L2 C L2 G G G G G G MEM MEM M L2 C L2 C M R R G G G G G G

  19. Benchmark • Benchmarks • CPU: afi,ammp, art, equake, kmeans, scalparc • GPU: blackscholes, lps, lib, nn, bfs • Evaluate ALL 30 CPU+GPU combinations • For presentation purpose, classify • CPU: 1) Memory-bound • 2) Computation-bound • GPU: 1) Latency-tolerant • 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles

  20. Benchmark Categorization memory-bound CPU + latency-tolerant GPU computation-bound CPU + latency-tolerant GPU memory-bound CPU + latency-intolerant GPU computation-bound CPU + latency-intolerant GPU Latency Sensitive Insensitive High Throughput Low

  21. Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads

  22. Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU

  23. Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS

  24. Conclusion Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Latency Sensitive Insensitive High Workload Mix • High throughput • Latency Insensitive Throughput Low Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

  25. Thank You!

  26. Exploiting Slack in GPU

  27. Exploiting Slack in GPU Predict slack based on # of available warps

More Related