1 / 37

Handling Branches in TLS Systems with Multi-Path Execution

Handling Branches in TLS Systems with Multi-Path Execution. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Polychronis Xekalakis and Marcelo Cintra. Introduction. Power efficiency, complexity and time-to-market reasons lead to CMPs

kaethe
Download Presentation

Handling Branches in TLS Systems with Multi-Path Execution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra

  2. Introduction • Power efficiency, complexity and time-to-market reasons lead to CMPs • Many simple cores = high TLP but low ILP • Ok for throughput computing and embarrassingly parallel applications • Problem: • No benefits for sequential applications • Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores • Solution: Speculative Multithreading (SM) HPCA 2010

  3. Speculative Multithreading • Basic Idea: Use idle cores/contexts to speculate on future application needs • TLS: speculatively execute parallel threads • HT/RA: speculatively perform future memory operations • MP: speculatively execute along multiple branch targets • No SM model works best all times • Hardware infrastructure is very similar • Our Idea: Combine SM models and seamlessly exploit (speculative) TLP and/or ILP • In this work: TLS + MP • (for TLS +HT/RA see [ICS’09]) ICS 2009

  4. Key Contributions • Analyze branch prediction for TLS Systems • Propose a mixed execution model that combines TLS with MP execution • We show that TLS allows MP to be more aggressive • Our approach outperforms state-of-the-art SM models: • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010

  5. Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

  6. Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit Speculative Thread 1 Thread 2 Time HPCA 2010

  7. Thread Level Speculation • Benefit: TLP/ILP • TLP (Overlapped Execution) • ILP (Prefetching) Speculative Speculative Thread 1 Thread 1 Thread 2 Thread 2 Time Time Overlapped Execution Prefetching HPCA 2010

  8. MultiPath Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do MP • Discard wrong path Main Thread Correct Paths Time MP Mode Wrong Paths HPCA 2010

  9. MultiPath Execution • Benefit: • ILP (Branch Pred.) Main Thread Correct Paths Branch Misp. Cost Time Wrong Paths HPCA 2010

  10. Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

  11. Impact of Branch Prediction on TLS • TLS emulates wider processor: • Removing mispredictions important (Amdahl) HPCA 2010

  12. Branch Entropy for TLS • Much harder for TLS: • History partitioning • History re-order HPCA 2010

  13. Increasing the Size of the Branch Predictor • Aliasing not much of a problem • Fundamental limitation is lack of history HPCA 2010

  14. Designing a Better Predictor • Predictors that exploit longer histories not necessarily better .. HPCA 2010

  15. Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

  16. Mixed Execution Model • When idle resources: • Try MP on top of TLS!! • Map TLS threads on empty cores • Map MP threads on empty contexts (same core) • Minimal extra HW: • Branch confidence estimator • MP bit – thread on MP mode • PATHS – how many outstanding branches • DIR – which path thread followed HPCA 2010

  17. Combined TLS/MP Model Speculative Thread 1 Thread 2 Time HPCA 2010

  18. Combined TLS/MP Model Speculative Thread 1 MP: 0 PATHS: 000 DIR: 000 Thread 1 Thread 2 Time Low Confidence Branch HPCA 2010

  19. Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 1 PATHS: 001 DIR: 001 Multi-Path Mode HPCA 2010

  20. Combined TLS/MP Model Speculative Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1a Thread 2 Thread 1b Time Thread 1b MP: 0 PATHS: 000 DIR: 000 Branch Resolved HPCA 2010

  21. Intricacies to be Handled • How do we map TLS/MP threads? • Different mapping policies for TLS threads • Dealing with thread ordering • Correct data forwarding • Dealing with violations • While in “MP-Mode” delay restarts/kills/commits • No squashes on the wrong path • Thread spawning: • Delayed as well – keep contention low HPCA 2010

  22. Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

  23. Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: • Four way CMP, 4-Issue cores, 6 contexts / core • 32K-bit OGEHL, 1KByte BTB, 32-Entry RAS • 8 Kbit enhanced JRS confidence estimator • 32KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches HPCA 2010

  24. Comparing TLS, MP and Combined TLS/MP HPCA 2010

  25. Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor HPCA 2010

  26. Comparing TLS, MP and Combined TLS/MP • Additive benefits; no point in doubling the predictor • 9.2% over TLS, 28.2% over MP HPCA 2010

  27. Pipeline Flushes • Significant amount of flush reductions • More than base MP! HPCA 2010

  28. Outline • Introduction • Speculative Multithreaded Models • Analysis of Branch Prediction in TLS • Mixed Execution Model • Experimental Setup and Results • Conclusions HPCA 2010

  29. Also in the Paper … • Detailed HW description • Impact of scheduling • Limiting MP to DP • Effect of scaling • Effect of a better CE HPCA 2010

  30. Conclusions • CMPs are here to stay: • What about single threaded apps. and apps with significant seq. sections? • We advocate the use of speculative multithreading • Analyzed branch prediction for modern TLS systems • Proposed a new mixed execution model • TLS is nicely complemented by MP • Unified scheme outperforms existing SM models • TLS by 9.2% avg. (up to 23.2%) • MP by 28.2 % avg. (up to 138%) HPCA 2010

  31. Handling Branches in TLS Systems with Multi-Path Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis and Marcelo Cintra

  32. Backup Slides ICS 2009

  33. Prediction Stats ICS 2009

  34. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt

  35. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p

  36. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’)

  37. Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp)

More Related