1 / 26

Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *

Energy-Efficient Speculative Threads: Dynamic Thread Allocation in Same-ISA Heterogeneous Multicore System. Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *. *University of Minnesota – Twin Cities † NVIDIA Corporation ‡ National Chiao Tung University, Taiwan.

ownah
Download Presentation

Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy-Efficient Speculative Threads: Dynamic Thread Allocation in Same-ISA Heterogeneous Multicore System Yangchun Luo*,Venkatesan Packirisamy†,Wei-Chung Hsu‡, and Antonia Zhai* *University of Minnesota – Twin Cities † NVIDIA Corporation ‡National Chiao Tung University, Taiwan

  2. Background Traditional Program Multicore Processor P0 P1 P2 P3  Multicore requires thread-level parallelism

  3. Speculative Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Thread-LevelSpeculation (TLS)  p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88   Speculation Failure Load 88 Parallel execution More potential parallelism

  4. Speculation vs. Energy Efficiency Successful Speculation Failed Speculation p != q p == q Load 20 Load 88 dependence Time Store 88 Store 88   Load 88 Improve performance Waste dynamic power Reduce leakage duration More leaking component Can we exploit performance without compromising energy efficiency?

  5. Impact from Underlying Hardware TLS Energy Efficiency vs. Hardware Configuration [PackirisamyICCD’08] SMT Architecture CMP Architecture … SMT L1 Cache Overall higher efficiency Better in some cases

  6. Optimization Opportunities Resource contention Use CMP Failed threads competes Low instr. level parallelism Use simpler cores TLS exploits both ILP & TLP Unique cache patterns of TLS Multiple cache activated Use smaller caches …

  7. Our Proposal Program Execution Underlying Hardware On-chip Heterogeneity Dynamic Resource Allocation

  8. Same-ISA Heterogeneity (1) Multi-Threading Execution Mode No mixed mode smt (2) Core Computing Power (issue width/SMT support) …… (3) L1 Cache Size (set/associativity) …… Not change L2 cache size What components to integrate?

  9. Design Space Exploration An Unbounded Heterogeneous Multicore 8iss 6iss 1iss 1iss 1iss 6iss 6iss 6iss 8iss 8iss smt 8iss …… smt 256K 8way …… 16K 128K 64K 4way 2way 1way • No Power-On and Off Overheads • No Cache Warm-up Cost

  10. Component Usage Sequential Segments Parallel Segments Coverage 16K-4way 65% 2-issue 20% CMP SMT CMP-based 15% 2-iss 4-iss 6-iss 32K-4way 5% 4-issue 60% 32K 16K 64K SMT-based 41% 64K-4way 25% 6-issue 15% Always favor 4-way set associative cache

  11. Proposed Integration CMP SMT 4-way set associativity 2-iss 4-iss 6-iss 32K 16K 64K 4-issue SMT 2-issue non-SMT …… smt 64K 4-way L1 Resizable by Sets How much improvement? Unified Level 2 Cache

  12. Baseline Improvement Estimation Sequential Program • No overheads & Oracle thread allocation SMT … 16% 9% 7% Total Improvement Upperbound: 29%

  13. Improvement Estimation • No overheads & Oracle thread allocation Proposal: 29% Improvement over SMT … Unbounded: 33% Improvement over SMT … Our proposal captures most of the benefit!

  14. Overhead Sources 1. Startup Overhead Powered-off Powered-on Static power consumed 2. Cache Reconfiguration Overhead Bigger size Smaller size Smaller size Bigger size Mapping changed cold Content discarded Dirty lines Write back to L2

  15. Overhead Impacts • Oracle thread allocation … … … … … … 80% Overheads Heterogeneous vs. Homogeneous +29% -80%

  16. Overhead Mitigation • Oracle thread allocation Benchmark Statistics Throttling Mechanisms • Average <300 instr./thread • Overall coverage ≈75% • Reduce reconfiguration frequency • Only when duration > overhead • Delay device powering-off Fine Granularity +29% Overheads -80% +13%

  17. Outline On-chip Heterogeneity • Design Space Exploration • Heterogeneous Integration • Overhead Impact • Overhead Mitigation Dynamic Resource Allocation

  18. Determining Resource Configuration Proposed Architecture The Difficulty: Intertwined Factors Our Solution: Divide & Conquer Multithread Type L1 Cache Size Core Issue Width SMT vs. CMP 2-issue vs. 4-issue 64KB vs. 16KB Runtime Monitoring • Hardware Performance Counters • Sampling Run: 4-issue SMT with 64K L1

  19. Decision Makings Hardware Performance Monitor low ILP low ILP Cycles non-speculative thread stall due to resource contention IPC Instructions issued from the 2nd half of ROB Reused cache blocks 0.1% 4-issue Core ROB 2-issue Core ROB Right Decisions!

  20. Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous

  21. Heterogeneous vs. Homogeneous 4% higher perf. 6% less energy ED2P Improvement wrt. SMT 13% Heterogeneous Heterogeneity is beneficial 33% Unbounded 21

  22. Execution Mode Breakdown Thread Migration Cache Resizing 16K 4 issue 64K 16K Coverage 2 issue 16K 4 issue 64K Dynamic allocation is essential

  23. Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous

  24. Heterogeneous vs. Sequential Baseline SEQ 38% higher perf. 7% more energy ED2P Compared to SEQ -54% Baseline Improve Performance Efficiently Heterogeneous 44%

  25. Related work TLS Energy Efficiency and Hardware Configuration [Packirisamy, Luo, Zhaiet alICCD’08] CMP and SMT favored differently  Heterogeneous integration Energy-Efficient TLS on a CMP [Renauet alICS’05] Ours: matching threads with configuration Theirs: can complement our system • Ours is different: • Speculative threads • Fine granularity and overhead mitigation Same-ISA Heterogeneous Multicore [Kumar et alMicro’03] [Kumar et alISCA’04] Dynamic Perf. Tuning for TLS [Luo, Packirisamy, Zhaiet alISCA’09] Integrate to extract efficient threads

  26. Conclusion Heterogeneous TLS Performance Uniprocessor Power TLS Multithreading • Evaluation Summary: • 44% better than uniprocessor • 13% better than homogeneous On-chip Heterogeneity Dynamic Resource Allocation

More Related