toward a more accurate understanding of the limits of the tls execution paradigm
Download
Skip this Video
Download Presentation
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

Loading in 2 Seconds...

play fullscreen
1 / 30

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm. University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA. University of Manchester http://apt.cs.man.ac.uk/projects/iTLS.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm' - keagan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
toward a more accurate understanding of the limits of the tls execution paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

University of Edinburgh

http://homepages.inf.ed.ac.uk/mc/Projects/VESPA

University of Manchester

http://apt.cs.man.ac.uk/projects/iTLS

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

introduction
Introduction
  • Thermal/power constraints, complexity and time-to-market reasons lead to CMPs
  • Many simple cores = high TLP but low ILP
    • Ok for throughput computing, server workloads, and embarrassingly parallel applications
  • Problem:
    • No benefits for sequential applications
    • Parallel applications with large sequential parts are still limited by Amdahl
  • => Thread Level Speculation (TLS)

Intl. Symp. on Workload Characterization - December 2010

modivation
Modivation
  • Shortcoming of prior work in assessing TLS performance potential
    • Evaluations often tied to particular TLS architectural configuration
    • Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features
    • Workload choice often limited to one particular domain or programming style

Intl. Symp. on Workload Characterization - December 2010

contributions
Contributions
  • In-depth implementation-independent study of TLS performance potential
  • Evaluate TLS architectural features
  • Evaluate workloads from a variety of domains
  • Investigate load imbalance and coverage within the context of TLS

Intl. Symp. on Workload Characterization - December 2010

outline
Outline
  • Introduction
  • Background
  • Methodology
  • Results
  • Conclusions

Intl. Symp. on Workload Characterization - December 2010

slide6

Thread Level Speculation

  • Compiler deals with:
    • Task selection
    • Code generation
  • HW deals with:
    • Different context
    • Spawn threads
    • Detecting violations
    • Replaying
    • Arbitrate commit

Speculative

Thread 1

Thread 2

Time

Intl. Symp. on Workload Characterization - December 2010

slide7

Architectural Extensions

  • Multiversioned caches
  • Support for out-of-order spawning
  • Dynamic dependence synchronization
  • Intermediate checkpointing
  • Data value prediction

Intl. Symp. on Workload Characterization - December 2010

outline1
Outline
  • Introduction
  • Background
  • Methodology
  • Results
  • Conclusions

Intl. Symp. on Workload Characterization - December 2010

methodology
Methodology
  • Benchmarks
    • Imperative:
      • SPEC CPU 2006
      • Mediabench II
  • Instrumentation
    • GCC4 pass
      • Annotate loop iterations and method bodies
      • Mark induction, reduction variables and use of return values
      • Operate after the intermediate optimizations
  • Object oriented:
    • SPEC JVM 98
    • DaCapo
  • Jikes RVM modification

Intl. Symp. on Workload Characterization - December 2010

methodology1
Methodology
  • Trace Generation
    • Simics, full-system functional simulator
    • Non-intrusive trace of memory accesses
  • Trace-Driven Simulation
    • In-house Simulator-tool
      • Extracts threads out of loop iterations and/or method call cont.
      • Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction

Intl. Symp. on Workload Characterization - December 2010

methodology2
Methodology
  • Task Selection
    • In-order loop-level speculation
      • Innermost loops
      • Best loops out of three dynamic depth levels
    • In-order method and Out-of-Order speculation
      • Dynamic thread spawning policy favoring safer threads
      • Maximum thread size heuristic
    • All loops and/or methods are candidates

Intl. Symp. on Workload Characterization - December 2010

outline2
Outline
  • Introduction
  • Background
  • Methodology
  • Results
  • Conclusions

Intl. Symp. on Workload Characterization - December 2010

loop level speculation innermost
Loop-level speculation - Innermost

for(i=0;i<m;i++){

outer_loop_body1

for(j=0;j<l;j++) {

inner_loop_body1

for(k=0;k<n;k++) {

spawn_thread();

innermost_loop_body

}

inner_loop_body2

}

outer_loop_body1

}

Speculative

Iter. 1

Iter. 2

Iter. n

Intl. Symp. on Workload Characterization - December 2010

loop level speculation innermost1
Loop-level speculation - Innermost

Intl. Symp. on Workload Characterization - December 2010

loop level speculation best loop depth
Loop-level speculation – Best loop depth

for(i=0;i<m;i++){

outer_loop_body1

for(j=0;j<l;j++) {

spawn_thread();

inner_loop_body1

for(k=0;k<n;k++) {

innermost_loop_body

}

inner_loop_body2

}

outer_loop_body1

}

Speculative

Iter. 1

Iter. 2

Iter. BD

Intl. Symp. on Workload Characterization - December 2010

loop level speculation best loop depth1
Loop-level speculation – Best loop depth

Intl. Symp. on Workload Characterization - December 2010

method level speculation in order
Method-level speculation - In-Order

Speculative

pid = spawn_thread();

If(pid !=0) method();

method _Cont.

method

method

Cont.

method level speculation in order1
Method-level speculation - In-Order

Intl. Symp. on Workload Characterization - December 2010

method level speculation ooo
Method-level speculation - OoO

Speculative

pid = spawn_thread();

If(pid !=0) method1();

method1 _Cont.

method1

method1

Cont.

method2

Cont.

method1()

{

method1_body1

pid = spawn_thread();

If(pid !=0) method1();

method2_cont

}

Time

method level speculation ooo1
Method-level speculation - OoO

Intl. Symp. on Workload Characterization - December 2010

mixed speculation in order
Mixed speculation - In-Order

Intl. Symp. on Workload Characterization - December 2010

mixed speculation ooo
Mixed speculation - OoO

Intl. Symp. on Workload Characterization - December 2010

load imbalance and coverage
Load Imbalance and Coverage

Intl. Symp. on Workload Characterization - December 2010

results multi versioning to the rescue
Results – Multi-versioning to the rescue?

Intl. Symp. on Workload Characterization - December 2010

outline3
Outline
  • Introduction
  • Background
  • Methodology
  • Results
  • Conclusions

Intl. Symp. on Workload Characterization - December 2010

conclusions
Conclusions
  • Load imbalance and limited coverage important factors in realizing TLS performance
  • Support for OoO spawning not providing significant benefits for the task policy employed
  • Multi-versioned caches unlock performance in some cases but not panacea
  • Task selection critical

Intl. Symp. on Workload Characterization - December 2010

also in the paper
Also in the paper
  • In-depth analysis of high coverage loops for selected benchmarks
  • Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler
  • OoO Loop-level speculation
  • Outline most of the proposed architectural and compiler extensions for TLS systems

Intl. Symp. on Workload Characterization - December 2010

toward a more accurate understanding of the limits of the tls execution paradigm1

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

University of Edinburgh

http://homepages.inf.ed.ac.uk/mc/Projects/VESPA

University of Manchester

http://intranet.cs.man.ac.uk/apt/projects/iTLS

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra

backup slides auto parallelizing compiler comparison
Backup slides – Auto parallelizing compiler comparison

Intl. Symp. on Workload Characterization - December 2010

backup slides ooo loop
Backup slides – OoO loop

Intl. Symp. on Workload Characterization - December 2010

ad