Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/projects/iTLS Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

Introduction • Thermal/power constraints, complexity and time-to-market reasons lead to CMPs • Many simple cores = high TLP but low ILP • Ok for throughput computing and embarrassingly parallel applications • Problem: • No benefits for sequential applications • Parallel applications with large sequential parts are still limited by Amdahl • =>Thread Level Speculation (TLS) Intl. Symp. on Workload Characterization - December 2010

Modivation • Shortcoming of prior work in assessing TLS performance potential • Evaluations often tied to particular TLS architectural configuration • Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features • Workload choice often limited to one particular domain or programming style Intl. Symp. on Workload Characterization - December 2010

Contributions • In-depth implementation-independent study of TLS performance potential • Evaluate TLS architectural features • Evaluate workloads from a variety of domains • Investigate load imbalance and coverage within the context of TLS Intl. Symp. on Workload Characterization - December 2010

Outline • Introduction • Background • Methodology • Results • Conclusions Intl. Symp. on Workload Characterization - December 2010

Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit Speculative Thread 1 Thread 2 Time Intl. Symp. on Workload Characterization - December 2010

Architectural Extensions • Multiversioned caches • Support for out-of-order spawning • Dynamic dependence synchronization • Intermediate checkpointing • Data value prediction Intl. Symp. on Workload Characterization - December 2010

Methodology • Benchmarks • Imperative: • SPEC CPU 2006 • Mediabench II • Instrumentation • GCC4 pass • Annotate loop iterations and method bodies • Mark induction, reduction variables and use of return values • Operate after the intermediate optimizations • Object oriented: • SPEC JVM 98 • DaCapo • Jikes RVM modification Intl. Symp. on Workload Characterization - December 2010

Methodology • Trace Generation • Simics, full-system functional simulator • Non-intrusive trace of memory accesses • Trace-Driven Simulation • In-house Simulator-tool • Extracts threads out of loop iterations and/or method call cont. • Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction Intl. Symp. on Workload Characterization - December 2010

Methodology • Task Selection • In-order loop-level speculation • Innermost loops • Best loops out of three dynamic depth levels • In-order method and Out-of-Order speculation • Dynamic thread spawning policy favoring safer threads • Maximum thread size heuristic • All loops and/or methods are candidates Intl. Symp. on Workload Characterization - December 2010

Loop-level speculation - Innermost for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { inner_loop_body1 for(k=0;k<n;k++) { spawn_thread(); innermost_loop_body } inner_loop_body2 } outer_loop_body1 } Speculative Iter. 1 Iter. 2 Iter. n … Intl. Symp. on Workload Characterization - December 2010

Loop-level speculation - Innermost Intl. Symp. on Workload Characterization - December 2010

Loop-level speculation – Best loop depth for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { spawn_thread(); inner_loop_body1 for(k=0;k<n;k++) { innermost_loop_body } inner_loop_body2 } outer_loop_body1 } Speculative Iter. 1 Iter. 2 Iter. n … Intl. Symp. on Workload Characterization - December 2010

Loop-level speculation – Best loop depth Intl. Symp. on Workload Characterization - December 2010

Method-level speculation - In-Order Speculative pid = spawn_thread(); If(pid !=0) method(); method _Cont. method method Cont.

Method-level speculation - In-Order Intl. Symp. on Workload Characterization - December 2010

Method-level speculation - OoO Speculative pid = spawn_thread(); If(pid !=0) method1(); method1 _Cont. method1() { method1_body1 pid = spawn_thread(); If(pid!=0) method1(); method2_cont } method1 method1 Cont. method2 Cont. Time

Method-level speculation - OoO Intl. Symp. on Workload Characterization - December 2010

Mixed speculation - In-Order Intl. Symp. on Workload Characterization - December 2010

Mixed speculation - OoO Intl. Symp. on Workload Characterization - December 2010

Load Imbalance and Coverage Intl. Symp. on Workload Characterization - December 2010

Results – Multi-versioning to the rescue? Intl. Symp. on Workload Characterization - December 2010

Conclusions • Load imbalance and limited coverage important factors in realizing TLS performance • Support for OoO spawning not providing significant benefits for the task policy employed • Multi-versioned caches unlock performance in some cases but not panacea • Task selection critical Intl. Symp. on Workload Characterization - December 2010

Also in the paper • In-depth analysis of high coverage loops for selected benchmarks • Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler • OoO Loop-level speculation • Outline most of the proposed architectural and compiler extensions for TLS systems Intl. Symp. on Workload Characterization - December 2010

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA University of Manchester http://intranet.cs.man.ac.uk/apt/projects/iTLS Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

Backup slides – Auto parallelizing compiler comparison Intl. Symp. on Workload Characterization - December 2010

Backup slides – OoO loop Intl. Symp. on Workload Characterization - December 2010

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm