1 / 22

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads. Yangchun Luo , Venkatesan Packirisamy, Nikhil Mungre † , Ankit Tarkas † , Wei-Chung Hsu, and Antonia Zhai. Dept. of Computer Science and Engineering † Dept. of Electronic Computer Engineering University of Minnesota – Twin Cities.

silas
Download Presentation

Dynamic Performance Tuning for Speculative Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre†, Ankit Tarkas†, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering †Dept. of Electronic Computer Engineering University of Minnesota – Twin Cities

  2. Motivation Multicore processors brought forth large potentials of computing power Intel Core i7 die photo Exploiting these potentials demands thread-level parallelism

  3. Exploiting Thread-Level Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Compiler gives up Thread-LevelSpeculation (TLS)  p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88   Speculation Failure Load 88 Parallel execution But Unpredictable Potentially more parallelism with speculation

  4. Addressing Unpredictability • State-of-the-art approach: • Profiling-directed optimization • Compiler analysis A typical compilation framework Source code Profiling run Region selection Feedback Profiling info Parallel code generation Speculative code optimization Machine code

  5. Problem #1 spawning thread inter-thread synchronization 0 1 2 wait 3 signal speculation failure Time   ……  4 allowcommit 3 load imbalance Hard to model TLS overhead statically, even with profiling

  6. Problem #2 Training Input Actual Input Profiling Behavior Actual Behavior Profile-based decision may not be appropriate for actual input

  7. Problem #3 Metric phase1 phase2 phase3 Time Behavior of the same code changes over time

  8. Problem #4 Extra misses Sequential Parallel P P P L1$ L1$ L1$ Load 0x22 miss Load 0x22 miss Load 0x22 miss Load 0x22 hit  Prefetching Pollution P P P P L1$ L1$ L1$ L1$ Load *p=0x88 Load *p=0x80   Load *p’=0x08 Load *p’=0x88 How important are cache impacts?

  9. Impact on Cache Performance A major loop from benchmark art (scanner.c:594) load *p=0x88  2.5X load *p’=0x88 Cache: p == p’Prefetched (+) 60% Cache impact is difficult to model statically

  10. Our proposal Proposed Performance Tuning Framework Accurate Timing & Cache Impacts Compiler Analysis Runtime Analysis Performance Evaluation Source code Adaptations for inputs and phase changes RuntimeInformation Region selection Profiling info Dynamic Execution Parallel code generation Performance Profile Table Speculative code optimization Tuning Policy Machine code Compiler Annotation

  11. Parallelization Decisions at Runtime Thread spawning End of a parallel execution Performance Profile Table parallelize serialize spawn … spawn  Parallel Execution spawn spawn  … Evaluate performance Parallel execution Sequential execution New execution model facilitates dynamic tuning

  12. Evaluating Performance Impact Runtime Analysis Performance Evaluation RuntimeInformation Dynamic Execution Performance Profile Table Tuning Policy Simple metric: cost of speculative failure Comprehensive evaluation: sequential performance prediction

  13. Sequential Performance Prediction Hardware Counters iFetch ExeStall Performance Impact Busy Cache Cache tick TLSoverhead Programmable Cycle Classifier ExeStall Information from ROB iFetch Busy SpeculativeParallel Predicted sequential P P Predict

  14. Cache Behavior: Scenario #1 TLS SEQ Predicted Squash T0 T0 T0 load p(unmodified) miss load p miss load p hit  T0 load p hit Count the miss in squashed execution

  15. Cache Behavior: Scenario #2 TLS SEQ Predicted T0 T0 T0 miss miss miss*2 load p load p load p store p store p store p  invalidate p T0 miss load p • Not count a miss if • tag matched • invalid status

  16. Tune the performance Runtime Analysis Performance Evaluation RuntimeInformation Dynamic Execution Performance Profile Table Tuning Policy Exhaustive search Prioritized search

  17. How to prioritize the search • Search order: • Inside-Out • Compiler suggestion • Outside-In • Performance summary metric: • Speedup or not? • Improvement in cycles MoreAggressive Suboptimal Improve.in cycles Quantitative Quantitative+StaticHint Performance Summary metric Speedupor not? Simple Compilersuggestion Inside-Out Search order

  18. C C C C C Evaluation Infrastructure Benchmarks • SPEC2000 written in C, -O3 optimization Underlying architecture • 4-core, chip-multiprocessor (CMP) • speculation supported by coherence Simulator • Superscalar with detailed memory model • simulates communication latency • models bandwidth and contention P P P P Interconnect Detailed, cycle-accurate simulation

  19. Comparing Tuning Policies 1.37x 1.23x 1.17x Parallel Code Overhead

  20. Profile-Based vs. Dynamic 1.46x 1.27x 1.25x Dynamic outperformed static by ≈10% (1.37x/1.25x) Potentially go up to ≈15% (1.46x/1.27x)

  21. Related works: Region Selection Statically Dynamically Compiler Heuristics [VijaykumarMicro’98] Runtime Loop Detection [TubellaHPCA’98] [MarcuelloICS’98] Balanced Min-Cut [Johnson PLDI’04] Simple Profiling Pass [Liu PPoPP’06] (POSH) Saving Power from Useless Re-spawning [RenauICS’05] Extensive Profiling [Wang LCPC’05] Profile-Based Search [Johnson PPoPP’07] Dynamic Performance Tuning [Luo ISCA’09] (this work) Lack of high-levelinformation Lack of adaptability

  22. Conclusions Deciding how to extract speculative threads at runtime • Quantitatively summarize performance profile • Compiler hints avoid suboptimal decisions Compiler and hardware work together. • 37% speedup over sequential execution • ≈10% speedup over profile-based decisions Configurable HPMs enable efficient dynamic optimizations Dynamic performance tuning can improve TLS efficiency

More Related