combining thread level speculation helper threads and runahead execution n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Combining Thread Level Speculation, Helper Threads, and Runahead Execution PowerPoint Presentation
Download Presentation
Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Loading in 2 Seconds...

play fullscreen
1 / 30

Combining Thread Level Speculation, Helper Threads, and Runahead Execution - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Combining Thread Level Speculation, Helper Threads, and Runahead Execution. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra. Introduction. Single core, out-of-order cores don’t scale

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Combining Thread Level Speculation, Helper Threads, and Runahead Execution' - tania


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
combining thread level speculation helper threads and runahead execution

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

University of Edinburgh

http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA

Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra

introduction
Introduction
  • Single core, out-of-order cores don’t scale
    • Simpler solution: multi-core architectures
  • No speedup for single thread applications
    • Use Thread Level Speculation to extract TLP
    • Use Helper Threads or RunAhead to improve ILP
  • However for different apps. (or phases) some models work better than some others
  • Our Proposal:
    • Combine these execution models
    • Decide at runtime when to employ them

ICS 2009

contributions
Contributions
  • Introduce mixed Speculative Multithreading (SM) Execution Models
  • Design one that combines TLS, HT and RA
  • Propose a performance model able to quantify ILP and TLP benefits
  • Unified approach outperforms state-of-the-art SM models:
    • TLS by 10.2% avg. (up to 41.2%)
    • RA by 18.3 % avg. (up to 35.2%)

ICS 2009

outline
Outline
  • Introduction
  • Speculative Multithreading Models
  • Performance Model
  • Unified Scheme
  • Experimental Setup and Results
  • Conclusions

ICS 2009

helper threads
Helper Threads
  • Compiler deals with:
    • Memory ops miss/ hard-to-predict branches
    • Backward slices
  • HW deals with:
    • Spawn threads
    • Different context
    • Discard when finished
  • Benefit:
    • ILP (Prefetch/Warmup)

ICS 2009

runahead execution
RunAhead Execution
  • Compiler deals with:
    • Nothing
  • HW deals with:
    • Different context
    • When to do RA
    • VP Memory
    • Commit/Discard
  • Benefit:
    • ILP (Prefetch/Warmup)

ICS 2009

thread level speculation
Thread Level Speculation
  • Compiler deals with:
    • Task selection
    • Code generation
  • HW deals with:
    • Different context
    • Spawn threads
    • Detecting violations
    • Replaying
    • Arbitrate commit
  • Benefit: TLP/ILP
    • TLP (Overlapped Execution)
      • + ILP (Prefetching)

ICS 2009

outline1
Outline
  • Introduction
  • Speculative Multithreading Models
  • Performance Model
  • Unified Scheme
  • Experimental Setup and Results
  • Conclusions

ICS 2009

understanding performance benefits
Understanding Performance Benefits
  • Complex TLS thread interactions, obscure performance benefits
  • Even more true for mixed execution models
  • We need a way to quantify ILP and TLP contributions to bottom-line performance
  • Proposed model:
    • Able to break benefits in ILP/TLP contributions

ICS 2009

performance model
Performance Model
  • Sall = Sseq x Silp x Sovl
    • Compute overall speedup (Sall)

Tseq/Tmt

ICS 2009

performance model1
Performance Model
  • Sall = Sseq x Silp x Sovl
    • Compute overall speedup (Sall)
    • Compute sequential TLS speedup (Sseq)

Tseq/T1p

ICS 2009

performance model2
Performance Model
  • Sall = Sseq x Silp x Sovl
    • Compute overall speedup (Sall)
    • Compute sequential TLS speedup (Sseq)
    • Compute speedup due to ILP (Silp)

(T1+T2)/(T1’+T2’)

ICS 2009

performance model3
Performance Model
  • Sall = Sseq x Silp x Sovl
    • Compute overall speedup (Sall)
    • Compute sequential TLS speedup (Sseq)
    • Compute speedup due to ILP (Silp)
    • Use everything to compute TLP (Sovl)

Sall/(Sseq x Silp)

ICS 2009

outline2
Outline
  • Introduction
  • Speculative Multithreading Models
  • Performance Model
  • Unified Scheme
  • Experimental Setup and Results
  • Conclusions

ICS 2009

unified execution model
Unified Execution Model
  • Can we improve TLS?
    • Some of the threads do not help
    • Slack in usage of cores
  • Improve TLP:
    • Requires a better compiler
  • Improve ILP:
    • Combine TLS with another SM !
    • Most of the HW common

ICS 2009

combining tls ht and ra
Combining TLS, HT and RA
  • Start with TLS
  • Provide support to clone TLS threads and convert them to HT
  • Conversion to HT means:
    • Put them in RA mode
    • Suppress squashes and do not cause additional squashes
    • Discard them when they finish
  • No compiler slicing  purely HW approach

ICS 2009

intricacies to be handled
Intricacies to be Handled
  • HT may not prefetch effectively!
  • Dealing with contention
    • HT threads much faster  saturate BW
  • Dealing with thread ordering
    • TLS imposes total thread order
    • HT killed  squashes TLS threads

ICS 2009

creating and terminating ht
Creating and Terminating HT
  • Create a HT on a L2 miss we can VP
    • Use mem. address based confidence estimator
    • VP only if confident
  • Create a HT if we have a free processor
  • Only allow most speculative thread to clone
    • Seamless integration of HT with TLS
    • BUT: if parent no longer the most spec. TLS thread, the HT has to be killed
  • Additionally kill HT when:
    • Parent/HT thread finishes
    • HT causes exception

ICS 2009

outline3
Outline
  • Introduction
  • Speculative Multithreading Models
  • Performance Model
  • Unified Scheme
  • Experimental Setup and Results
  • Conclusions

ICS 2009

experimental setup
Experimental Setup
  • Simulator, Compiler and Benchmarks:
    • SESC (http://sesc.sourceforge.net/)
    • POSH (Liu et al. PPoPP ‘06)
    • Spec 2000 Int.
  • Architecture:
    • Four way CMP, 4-Issue cores
    • 16KB L1 Data (multi-versioned) and Instruction Caches
    • 1MB unified L2 Caches
    • Inst. window/ROB – 80/104 entries
    • 16KB Last Value Predictor

ICS 2009

comparing tls runahead and unified scheme1
Comparing TLS, RunAhead and Unified Scheme
  • Almost additive benefits

ICS 2009

comparing tls runahead and unified scheme2
Comparing TLS, RunAhead and Unified Scheme
  • Almost additive benefits
  • 10.2% over TLS, 18.3% over RA

ICS 2009

understanding the extra ilp
Understanding the extra ILP
  • Improvements of ILP come from:
    • Mainly memory
    • Branch prediction (improvement 0.5%)
  • Focus on memory:
    • Miss rate on committed path
    • Clustering of misses (different cost)

ICS 2009

normalized shared cache misses
Normalized Shared Cache Misses
  • All schemes better than sequential
  • Unified 41% better than sequential

ICS2009

isolated vs clustered misses
Isolated vs. Clustered Misses

.

  • Both TLS + RA  Large window machines
  • Unified does even better

ICS 2009

outline4
Outline
  • Introduction
  • Multithreading Models
  • Performance Model
  • Unified Scheme
  • Experimental Setup and Results
  • Conclusions

ICS 2009

also on the paper
Also on the paper …
  • Dealing with the load of the system
  • Converting TLS threads to HT
  • Multiple HT
  • Effect of a better VP
  • Detailed comparison of performance model against existing models (Renau et. al ICS ’05)

ICS 2009

conclusions
Conclusions
  • CMPs are here to stay:
    • What about single threaded apps. and apps with significant seq. sections?
  • Different apps. require different SM techniques
    • Even within apps. different phases
  • We propose the first mixed execution model
    • TLS is nicely complemented by HT and RA
  • Our unified scheme outperforms existing SM models
    • TLS by 10.2% avg. (up to 41.2%)
    • RA by 18.3 % avg. (up to 35.2%)

ICS 2009

combining thread level speculation helper threads and runahead execution1

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

University of Edinburgh

http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA

Polychronis Xekalakis

Nikolas Ioannou and Marcelo Cintra