Toward a more accurate understanding of the limits of the tls execution paradigm
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm. University of Edinburgh http://homepages.inf.ed.ac.uk/mc/Projects/VESPA. University of Manchester http://apt.cs.man.ac.uk/projects/iTLS.

Download Presentation

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Toward a more accurate understanding of the limits of the tls execution paradigm

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

University of Edinburgh

http://homepages.inf.ed.ac.uk/mc/Projects/VESPA

University of Manchester

http://apt.cs.man.ac.uk/projects/iTLS

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra


Introduction

Introduction

  • Thermal/power constraints, complexity and time-to-market reasons lead to CMPs

  • Many simple cores = high TLP but low ILP

    • Ok for throughput computing, server workloads, and embarrassingly parallel applications

  • Problem:

    • No benefits for sequential applications

    • Parallel applications with large sequential parts are still limited by Amdahl

  • => Thread Level Speculation (TLS)

Intl. Symp. on Workload Characterization - December 2010


Modivation

Modivation

  • Shortcoming of prior work in assessing TLS performance potential

    • Evaluations often tied to particular TLS architectural configuration

    • Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features

    • Workload choice often limited to one particular domain or programming style

Intl. Symp. on Workload Characterization - December 2010


Contributions

Contributions

  • In-depth implementation-independent study of TLS performance potential

  • Evaluate TLS architectural features

  • Evaluate workloads from a variety of domains

  • Investigate load imbalance and coverage within the context of TLS

Intl. Symp. on Workload Characterization - December 2010


Outline

Outline

  • Introduction

  • Background

  • Methodology

  • Results

  • Conclusions

Intl. Symp. on Workload Characterization - December 2010


Toward a more accurate understanding of the limits of the tls execution paradigm

Thread Level Speculation

  • Compiler deals with:

    • Task selection

    • Code generation

  • HW deals with:

    • Different context

    • Spawn threads

    • Detecting violations

    • Replaying

    • Arbitrate commit

Speculative

Thread 1

Thread 2

Time

Intl. Symp. on Workload Characterization - December 2010


Toward a more accurate understanding of the limits of the tls execution paradigm

Architectural Extensions

  • Multiversioned caches

  • Support for out-of-order spawning

  • Dynamic dependence synchronization

  • Intermediate checkpointing

  • Data value prediction

Intl. Symp. on Workload Characterization - December 2010


Outline1

Outline

  • Introduction

  • Background

  • Methodology

  • Results

  • Conclusions

Intl. Symp. on Workload Characterization - December 2010


Methodology

Methodology

  • Benchmarks

    • Imperative:

      • SPEC CPU 2006

      • Mediabench II

  • Instrumentation

    • GCC4 pass

      • Annotate loop iterations and method bodies

      • Mark induction, reduction variables and use of return values

      • Operate after the intermediate optimizations

  • Object oriented:

    • SPEC JVM 98

    • DaCapo

  • Jikes RVM modification

Intl. Symp. on Workload Characterization - December 2010


Methodology1

Methodology

  • Trace Generation

    • Simics, full-system functional simulator

    • Non-intrusive trace of memory accesses

  • Trace-Driven Simulation

    • In-house Simulator-tool

      • Extracts threads out of loop iterations and/or method call cont.

      • Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction

Intl. Symp. on Workload Characterization - December 2010


Methodology2

Methodology

  • Task Selection

    • In-order loop-level speculation

      • Innermost loops

      • Best loops out of three dynamic depth levels

    • In-order method and Out-of-Order speculation

      • Dynamic thread spawning policy favoring safer threads

      • Maximum thread size heuristic

    • All loops and/or methods are candidates

Intl. Symp. on Workload Characterization - December 2010


Outline2

Outline

  • Introduction

  • Background

  • Methodology

  • Results

  • Conclusions

Intl. Symp. on Workload Characterization - December 2010


Loop level speculation innermost

Loop-level speculation - Innermost

for(i=0;i<m;i++){

outer_loop_body1

for(j=0;j<l;j++) {

inner_loop_body1

for(k=0;k<n;k++) {

spawn_thread();

innermost_loop_body

}

inner_loop_body2

}

outer_loop_body1

}

Speculative

Iter. 1

Iter. 2

Iter. n

Intl. Symp. on Workload Characterization - December 2010


Loop level speculation innermost1

Loop-level speculation - Innermost

Intl. Symp. on Workload Characterization - December 2010


Loop level speculation best loop depth

Loop-level speculation – Best loop depth

for(i=0;i<m;i++){

outer_loop_body1

for(j=0;j<l;j++) {

spawn_thread();

inner_loop_body1

for(k=0;k<n;k++) {

innermost_loop_body

}

inner_loop_body2

}

outer_loop_body1

}

Speculative

Iter. 1

Iter. 2

Iter. BD

Intl. Symp. on Workload Characterization - December 2010


Loop level speculation best loop depth1

Loop-level speculation – Best loop depth

Intl. Symp. on Workload Characterization - December 2010


Method level speculation in order

Method-level speculation - In-Order

Speculative

pid = spawn_thread();

If(pid !=0) method();

method _Cont.

method

method

Cont.


Method level speculation in order1

Method-level speculation - In-Order

Intl. Symp. on Workload Characterization - December 2010


Method level speculation ooo

Method-level speculation - OoO

Speculative

pid = spawn_thread();

If(pid !=0) method1();

method1 _Cont.

method1

method1

Cont.

method2

Cont.

method1()

{

method1_body1

pid = spawn_thread();

If(pid !=0) method1();

method2_cont

}

Time


Method level speculation ooo1

Method-level speculation - OoO

Intl. Symp. on Workload Characterization - December 2010


Mixed speculation in order

Mixed speculation - In-Order

Intl. Symp. on Workload Characterization - December 2010


Mixed speculation ooo

Mixed speculation - OoO

Intl. Symp. on Workload Characterization - December 2010


Load imbalance and coverage

Load Imbalance and Coverage

Intl. Symp. on Workload Characterization - December 2010


Results multi versioning to the rescue

Results – Multi-versioning to the rescue?

Intl. Symp. on Workload Characterization - December 2010


Outline3

Outline

  • Introduction

  • Background

  • Methodology

  • Results

  • Conclusions

Intl. Symp. on Workload Characterization - December 2010


Conclusions

Conclusions

  • Load imbalance and limited coverage important factors in realizing TLS performance

  • Support for OoO spawning not providing significant benefits for the task policy employed

  • Multi-versioned caches unlock performance in some cases but not panacea

  • Task selection critical

Intl. Symp. on Workload Characterization - December 2010


Also in the paper

Also in the paper

  • In-depth analysis of high coverage loops for selected benchmarks

  • Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler

  • OoO Loop-level speculation

  • Outline most of the proposed architectural and compiler extensions for TLS systems

Intl. Symp. on Workload Characterization - December 2010


Toward a more accurate understanding of the limits of the tls execution paradigm1

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm

University of Edinburgh

http://homepages.inf.ed.ac.uk/mc/Projects/VESPA

University of Manchester

http://intranet.cs.man.ac.uk/apt/projects/iTLS

Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian Watson, and Marcelo Cintra


Backup slides auto parallelizing compiler comparison

Backup slides – Auto parallelizing compiler comparison

Intl. Symp. on Workload Characterization - December 2010


Backup slides ooo loop

Backup slides – OoO loop

Intl. Symp. on Workload Characterization - December 2010


  • Login