Skip this Video
Download Presentation
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Loading in 2 Seconds...

play fullscreen
1 / 32

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar - PowerPoint PPT Presentation

  • Uploaded on

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar. Presented by: Ashay Rane. Published in: SIGARCH Computer Architecture News, 2003. Agenda. Overview (IMT, state-of-art) ‏ IMT enhancements Key results Critique Relation to Term Project.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar ' - seda

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Presented by: Ashay Rane

Published in: SIGARCH Computer Architecture News, 2003



  • Overview (IMT, state-of-art)‏
  • IMT enhancements
  • Key results
  • Critique
  • Relation to Term Project

Implicitly Multithreaded Processor (IMT)‏

  • SMT with speculation
  • Optimizations to basic SMT support
  • Average perf. improvement of 24%Max: 69%


  • Pentium 4 HT

Speculative SMT operation

  • When branch encountered, start executing likely path “speculatively”i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)
  • Overcome cost, overhead with savings in execution time and power (but worth the effort)‏
  • Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.
  • If dependence violation, squash thread and restart execution

How to buffer speculative data?

  • Load/Store Queue (LSQ)‏
    • Buffers data (along with its address)‏
    • Helps enforce dependency check
    • Makes rollback possible
  • Cache-based approaches

IMT: Most significant improvements

  • Assistance from Multiscalar compiler
  • Resource- and dependence-aware fetch policy
  • Multiplexing threads on a single hardware context
  • Overlapping thread startup operations with previous threads execution

What does Compiler do?

  • Extracts threads from program (loops)‏
  • Generates thread descriptor data about registers read and written and control flow exits (for rename tables)
  • Annotates instructions with special codes (“forward” & “release”) for dependence checking

Fetch Policy

  • Hardware keeps track of resource utilization
  • Resource requirement prediction from past four execution instances
  • When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads
  • Goal is to reduce number of thread squashes

Multiplexing threads on a single hardware context

  • Observations:
    • Threads usually short
    • Number of contexts less (2-8)‏

Hence frequent switching, less overlap


Multiplexing (contd.)‏

  • Larger threads can lead to:
    • Speculation buffer overflow
    • Increased dependence mis-speculation
    • Hence thread squashing
  • Each execution context can further support multiple threads (3-6)‏

Multiplexing: Required Hardware

  • Per context per thread:
    • Program Counter
    • Register rename table
  • LSQ shared among threads running on 1 execution context

Multiplexing: Implementation Issues

  • LSQ shared but it needs to maintain loads and stores for each thread separately
  • Therefore, create “gaps” for yet-to-be-fetched instructions / data
  • If space falls short, squash subsequent thread
  • What if threads from one program are mapped to different contexts?
  • IMT searches through other contexts
  • Easier to have multiple LSQs per context per thread but not good cost and power consumption

Register renaming

  • Required because multiple threads may use same registers
  • Separate rename tables
  • Master Rename Table (global)‏
  • Local Rename Table (per thread)‏
  • Pre-assign table (per thread)‏

Register renaming: Flow

  • Thread Invocation:
    • Copy from Master table into Local table (to reflect current status)‏
    • Also use “create” and “use” mask of thread descriptor(to for dependence check)‏
  • Before every subsequent thread invocation:
    • Pre-assign rename maps into Pre-assign table
    • Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.

Hiding thread startup delay

  • Rename tables to be setup before execution begins
  • Occupies table bandwidth, hence cannot be done for a number of threads in parallel
  • Hence overlap setting up of rename tables with previous thread’s execution

Load/Store Queue

  • Per context
  • Speculative load / store: Search through current and other contexts for dependence
  • No searching for non-speculative loads
  • Searching can take time, so schedules load-dependent instructions accordingly

Average improvement: 24%

  • Reduction in data dependence stalls
  • Little overhead of optimizations
  • Not all benchmark programs

Assuming 2-3 threads per context, 6-8 LSQ entries per thread.

  • Performance relative to IMT with unlimited resources

ICOUNT: Favor least number of instructions remaining to be executed

  • Biased-ICOUNT: Favor non-speculative threads
  • Worst-case resource estimation
  • Reduced thread squashing

TME: Executes both paths of an unpredictable branch (but such branches uncommon)‏

  • DMT:
    • Hardware-selection of threads. So spawns threads on backward-branch or function call instead of loops.
    • Also spawns threads out of order. So lower accuracy of branch prediction.

Compiler Support

  • Improvement in applications compiled using Multiscalar compiler
  • Scientific computing applications, not for desktop applications

LSQ Limitations

  • LSQ size deciding the size of speculative thread
  • Pentium 4 (without SMT):48 Loads, 24 Stores
  • Pentium 4 HT:24 Loads, 12 Stores per thread
  • IBM Power5:32 Loads, 32 Stores per thread

LSQ Limitations: Alternative

  • Cache-based approachi.e. Partition the cache to support different versions
  • Extra support required, but scalable

Register file size

  • IMT considers register file sizes of 128 and up.
  • Pentium 4 (as well as HT):Register file size = 128
  • IBM POWER5:Register file size = 80

Searching LSQ

  • Since loads and stores organized as per thread, search involves all locations of other threads.
  • If loads/stores organized according to addresses then lesser values to search.
  • Can make use of associativity of cache

So how is performance still high?

  • Assistance from Compiler
  • Resource and dependency-aware fetching
  • Multiple threads on an execution context
  • Overlapping rename table creation with execution

Term project

  • “Cache-based throughput improvement techniques for Speculative SMT processors”
  • Optimizations from IMT
  • Increasing granularity to reduce number of thread squashes