Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *

Energy-Efficient Speculative Threads: Dynamic Thread Allocation in Same-ISA Heterogeneous Multicore System Yangchun Luo*,Venkatesan Packirisamy†,Wei-Chung Hsu‡, and Antonia Zhai* *University of Minnesota – Twin Cities † NVIDIA Corporation ‡National Chiao Tung University, Taiwan

Background Traditional Program Multicore Processor P0 P1 P2 P3  Multicore requires thread-level parallelism

Speculative Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Thread-LevelSpeculation (TLS)  p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88   Speculation Failure Load 88 Parallel execution More potential parallelism

Speculation vs. Energy Efficiency Successful Speculation Failed Speculation p != q p == q Load 20 Load 88 dependence Time Store 88 Store 88   Load 88 Improve performance Waste dynamic power Reduce leakage duration More leaking component Can we exploit performance without compromising energy efficiency?

Impact from Underlying Hardware TLS Energy Efficiency vs. Hardware Configuration [PackirisamyICCD’08] SMT Architecture CMP Architecture … SMT L1 Cache Overall higher efficiency Better in some cases

Optimization Opportunities Resource contention Use CMP Failed threads competes Low instr. level parallelism Use simpler cores TLS exploits both ILP & TLP Unique cache patterns of TLS Multiple cache activated Use smaller caches …

Our Proposal Program Execution Underlying Hardware On-chip Heterogeneity Dynamic Resource Allocation

Same-ISA Heterogeneity (1) Multi-Threading Execution Mode No mixed mode smt (2) Core Computing Power (issue width/SMT support) …… (3) L1 Cache Size (set/associativity) …… Not change L2 cache size What components to integrate?

Design Space Exploration An Unbounded Heterogeneous Multicore 8iss 6iss 1iss 1iss 1iss 6iss 6iss 6iss 8iss 8iss smt 8iss …… smt 256K 8way …… 16K 128K 64K 4way 2way 1way • No Power-On and Off Overheads • No Cache Warm-up Cost

Component Usage Sequential Segments Parallel Segments Coverage 16K-4way 65% 2-issue 20% CMP SMT CMP-based 15% 2-iss 4-iss 6-iss 32K-4way 5% 4-issue 60% 32K 16K 64K SMT-based 41% 64K-4way 25% 6-issue 15% Always favor 4-way set associative cache

Proposed Integration CMP SMT 4-way set associativity 2-iss 4-iss 6-iss 32K 16K 64K 4-issue SMT 2-issue non-SMT …… smt 64K 4-way L1 Resizable by Sets How much improvement? Unified Level 2 Cache

Baseline Improvement Estimation Sequential Program • No overheads & Oracle thread allocation SMT … 16% 9% 7% Total Improvement Upperbound: 29%

Improvement Estimation • No overheads & Oracle thread allocation Proposal: 29% Improvement over SMT … Unbounded: 33% Improvement over SMT … Our proposal captures most of the benefit!

Overhead Sources 1. Startup Overhead Powered-off Powered-on Static power consumed 2. Cache Reconfiguration Overhead Bigger size Smaller size Smaller size Bigger size Mapping changed cold Content discarded Dirty lines Write back to L2

Overhead Impacts • Oracle thread allocation … … … … … … 80% Overheads Heterogeneous vs. Homogeneous +29% -80%

Overhead Mitigation • Oracle thread allocation Benchmark Statistics Throttling Mechanisms • Average <300 instr./thread • Overall coverage ≈75% • Reduce reconfiguration frequency • Only when duration > overhead • Delay device powering-off Fine Granularity +29% Overheads -80% +13%

Outline On-chip Heterogeneity • Design Space Exploration • Heterogeneous Integration • Overhead Impact • Overhead Mitigation Dynamic Resource Allocation

Determining Resource Configuration Proposed Architecture The Difficulty: Intertwined Factors Our Solution: Divide & Conquer Multithread Type L1 Cache Size Core Issue Width SMT vs. CMP 2-issue vs. 4-issue 64KB vs. 16KB Runtime Monitoring • Hardware Performance Counters • Sampling Run: 4-issue SMT with 64K L1

Decision Makings Hardware Performance Monitor low ILP low ILP Cycles non-speculative thread stall due to resource contention IPC Instructions issued from the 2nd half of ROB Reused cache blocks 0.1% 4-issue Core ROB 2-issue Core ROB Right Decisions!

Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous

Heterogeneous vs. Homogeneous 4% higher perf. 6% less energy ED2P Improvement wrt. SMT 13% Heterogeneous Heterogeneity is beneficial 33% Unbounded 21

Execution Mode Breakdown Thread Migration Cache Resizing 16K 4 issue 64K 16K Coverage 2 issue 16K 4 issue 64K Dynamic allocation is essential

Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous

Heterogeneous vs. Sequential Baseline SEQ 38% higher perf. 7% more energy ED2P Compared to SEQ -54% Baseline Improve Performance Efficiently Heterogeneous 44%

Related work TLS Energy Efficiency and Hardware Configuration [Packirisamy, Luo, Zhaiet alICCD’08] CMP and SMT favored differently  Heterogeneous integration Energy-Efficient TLS on a CMP [Renauet alICS’05] Ours: matching threads with configuration Theirs: can complement our system • Ours is different: • Speculative threads • Fine granularity and overhead mitigation Same-ISA Heterogeneous Multicore [Kumar et alMicro’03] [Kumar et alISCA’04] Dynamic Perf. Tuning for TLS [Luo, Packirisamy, Zhaiet alISCA’09] Integrate to extract efficient threads

Conclusion Heterogeneous TLS Performance Uniprocessor Power TLS Multithreading • Evaluation Summary: • 44% better than uniprocessor • 13% better than homogeneous On-chip Heterogeneity Dynamic Resource Allocation

Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *

Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *

Presentation Transcript

Antonia Coello Novello

Tie Luo Han

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

CLICKERS @ HSU

Presented by: Ashok Venkatesan

Jiajia Luo , Wei Wang, and Hairong Qi The University of Tennessee, Knoxville

Wei Luo Lanzhou University 2011 Hall C User Meeting January 14, 2011

Venkatesan Guruswami Prasad Raghavendra

Antonia Maria Teresa Mirabal

Anarghya Mitra and Zelun Luo

Wei Hsu University of Minnesota

Interviewing Luo

David S. L. Wei Joint Work with Alex Chia-Chun Hsu and C.-C. Jay Kuo

Second Language and Identity Linda and Wei-Li Hsu

Song Wei

Present by Hsu Ting-Wei 2006.03.16

Presented by: Ashok Venkatesan

Mingsheng Wei

CHAO-WEI HSU, CHAU-TING YEH, MING-LING CHANG, and YUN-FAN LIAW

Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *

Chia- Hsun Chiang and Wan-Chen Hsu Presenter : Wan-Chen Hsu

Eugene Hsu