1 / 30

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm. University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA. University of Manchester http://intranet.cs.man.ac.uk/apt/projects/iTLS.

barth
Download Presentation

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/projects/iTLS Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

  2. Introduction • Thermal/power constraints, complexity and time-to-market reasons lead to CMPs • Many simple cores = high TLP but low ILP • Ok for throughput computing and embarrassingly parallel applications • Problem: • No benefits for sequential applications • Parallel applications with large sequential parts are still limited by Amdahl • =>Thread Level Speculation (TLS) Intl. Symp. on Workload Characterization - December 2010

  3. Modivation • Shortcoming of prior work in assessing TLS performance potential • Evaluations often tied to particular TLS architectural configuration • Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features • Workload choice often limited to one particular domain or programming style Intl. Symp. on Workload Characterization - December 2010

  4. Contributions • In-depth implementation-independent study of TLS performance potential • Evaluate TLS architectural features • Evaluate workloads from a variety of domains • Investigate load imbalance and coverage within the context of TLS Intl. Symp. on Workload Characterization - December 2010

  5. Outline • Introduction • Background • Methodology • Results • Conclusions Intl. Symp. on Workload Characterization - December 2010

  6. Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit Speculative Thread 1 Thread 2 Time Intl. Symp. on Workload Characterization - December 2010

  7. Architectural Extensions • Multiversioned caches • Support for out-of-order spawning • Dynamic dependence synchronization • Intermediate checkpointing • Data value prediction Intl. Symp. on Workload Characterization - December 2010

  8. Outline • Introduction • Background • Methodology • Results • Conclusions Intl. Symp. on Workload Characterization - December 2010

  9. Methodology • Benchmarks • Imperative: • SPEC CPU 2006 • Mediabench II • Instrumentation • GCC4 pass • Annotate loop iterations and method bodies • Mark induction, reduction variables and use of return values • Operate after the intermediate optimizations • Object oriented: • SPEC JVM 98 • DaCapo • Jikes RVM modification Intl. Symp. on Workload Characterization - December 2010

  10. Methodology • Trace Generation • Simics, full-system functional simulator • Non-intrusive trace of memory accesses • Trace-Driven Simulation • In-house Simulator-tool • Extracts threads out of loop iterations and/or method call cont. • Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction Intl. Symp. on Workload Characterization - December 2010

  11. Methodology • Task Selection • In-order loop-level speculation • Innermost loops • Best loops out of three dynamic depth levels • In-order method and Out-of-Order speculation • Dynamic thread spawning policy favoring safer threads • Maximum thread size heuristic • All loops and/or methods are candidates Intl. Symp. on Workload Characterization - December 2010

  12. Outline • Introduction • Background • Methodology • Results • Conclusions Intl. Symp. on Workload Characterization - December 2010

  13. Loop-level speculation - Innermost for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { inner_loop_body1 for(k=0;k<n;k++) { spawn_thread(); innermost_loop_body } inner_loop_body2 } outer_loop_body1 } Speculative Iter. 1 Iter. 2 Iter. n … Intl. Symp. on Workload Characterization - December 2010

  14. Loop-level speculation - Innermost Intl. Symp. on Workload Characterization - December 2010

  15. Loop-level speculation – Best loop depth for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { spawn_thread(); inner_loop_body1 for(k=0;k<n;k++) { innermost_loop_body } inner_loop_body2 } outer_loop_body1 } Speculative Iter. 1 Iter. 2 Iter. n … Intl. Symp. on Workload Characterization - December 2010

  16. Loop-level speculation – Best loop depth Intl. Symp. on Workload Characterization - December 2010

  17. Method-level speculation - In-Order Speculative pid = spawn_thread(); If(pid !=0) method(); method _Cont. method method Cont.

  18. Method-level speculation - In-Order Intl. Symp. on Workload Characterization - December 2010

  19. Method-level speculation - OoO Speculative pid = spawn_thread(); If(pid !=0) method1(); method1 _Cont. method1() { method1_body1 pid = spawn_thread(); If(pid!=0) method1(); method2_cont } method1 method1 Cont. method2 Cont. Time

  20. Method-level speculation - OoO Intl. Symp. on Workload Characterization - December 2010

  21. Mixed speculation - In-Order Intl. Symp. on Workload Characterization - December 2010

  22. Mixed speculation - OoO Intl. Symp. on Workload Characterization - December 2010

  23. Load Imbalance and Coverage Intl. Symp. on Workload Characterization - December 2010

  24. Results – Multi-versioning to the rescue? Intl. Symp. on Workload Characterization - December 2010

  25. Outline • Introduction • Background • Methodology • Results • Conclusions Intl. Symp. on Workload Characterization - December 2010

  26. Conclusions • Load imbalance and limited coverage important factors in realizing TLS performance • Support for OoO spawning not providing significant benefits for the task policy employed • Multi-versioned caches unlock performance in some cases but not panacea • Task selection critical Intl. Symp. on Workload Characterization - December 2010

  27. Also in the paper • In-depth analysis of high coverage loops for selected benchmarks • Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler • OoO Loop-level speculation • Outline most of the proposed architectural and compiler extensions for TLS systems Intl. Symp. on Workload Characterization - December 2010

  28. Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/projects/iTLS Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

  29. Backup slides – Auto parallelizing compiler comparison Intl. Symp. on Workload Characterization - December 2010

  30. Backup slides – OoO loop Intl. Symp. on Workload Characterization - December 2010

More Related