Implicitly-Multithreaded Processors
Download
1 / 32

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar. Presented by: Ashay Rane. Published in: SIGARCH Computer Architecture News, 2003. Agenda. Overview (IMT, state-of-art) ‏ IMT enhancements Key results Critique Relation to Term Project.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar ' - seda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Presented by: Ashay Rane

Published in: SIGARCH Computer Architecture News, 2003


Agenda

  • Overview (IMT, state-of-art)‏

  • IMT enhancements

  • Key results

  • Critique

  • Relation to Term Project


Implicitly Multithreaded Processor (IMT)

  • SMT with speculation

  • Optimizations to basic SMT support

  • Average perf. improvement of 24%Max: 69%


State-of-the-art

  • Pentium 4 HT

  • IBM POWER5

  • MIPS MT


Speculative SMT operation

  • When branch encountered, start executing likely path “speculatively”i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)

  • Overcome cost, overhead with savings in execution time and power (but worth the effort)‏

  • Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.

  • If dependence violation, squash thread and restart execution


How to buffer speculative data?

  • Load/Store Queue (LSQ)‏

    • Buffers data (along with its address)‏

    • Helps enforce dependency check

    • Makes rollback possible

  • Cache-based approaches


IMT: Most significant improvements

  • Assistance from Multiscalar compiler

  • Resource- and dependence-aware fetch policy

  • Multiplexing threads on a single hardware context

  • Overlapping thread startup operations with previous threads execution


What does Compiler do?

  • Extracts threads from program (loops)‏

  • Generates thread descriptor data about registers read and written and control flow exits (for rename tables)

  • Annotates instructions with special codes (“forward” & “release”) for dependence checking


Fetch Policy

  • Hardware keeps track of resource utilization

  • Resource requirement prediction from past four execution instances

  • When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads

  • Goal is to reduce number of thread squashes


Multiplexing threads on a single hardware context

  • Observations:

    • Threads usually short

    • Number of contexts less (2-8)‏

      Hence frequent switching, less overlap


Multiplexing (contd.)

  • Larger threads can lead to:

    • Speculation buffer overflow

    • Increased dependence mis-speculation

    • Hence thread squashing

  • Each execution context can further support multiple threads (3-6)‏


Multiplexing: Required Hardware

  • Per context per thread:

    • Program Counter

    • Register rename table

  • LSQ shared among threads running on 1 execution context


Multiplexing: Implementation Issues

  • LSQ shared but it needs to maintain loads and stores for each thread separately

  • Therefore, create “gaps” for yet-to-be-fetched instructions / data

  • If space falls short, squash subsequent thread

  • What if threads from one program are mapped to different contexts?

  • IMT searches through other contexts

  • Easier to have multiple LSQs per context per thread but not good cost and power consumption


Register renaming

  • Required because multiple threads may use same registers

  • Separate rename tables

  • Master Rename Table (global)‏

  • Local Rename Table (per thread)‏

  • Pre-assign table (per thread)‏


Register renaming: Flow

  • Thread Invocation:

    • Copy from Master table into Local table (to reflect current status)‏

    • Also use “create” and “use” mask of thread descriptor(to for dependence check)‏

  • Before every subsequent thread invocation:

    • Pre-assign rename maps into Pre-assign table

    • Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.


Hiding thread startup delay

  • Rename tables to be setup before execution begins

  • Occupies table bandwidth, hence cannot be done for a number of threads in parallel

  • Hence overlap setting up of rename tables with previous thread’s execution


Load/Store Queue

  • Per context

  • Speculative load / store: Search through current and other contexts for dependence

  • No searching for non-speculative loads

  • Searching can take time, so schedules load-dependent instructions accordingly



  • Average improvement: 24%

  • Reduction in data dependence stalls

  • Little overhead of optimizations

  • Not all benchmark programs





Critique such branches uncommon)


Compiler Support such branches uncommon)

  • Improvement in applications compiled using Multiscalar compiler

  • Scientific computing applications, not for desktop applications


LSQ Limitations such branches uncommon)

  • LSQ size deciding the size of speculative thread

  • Pentium 4 (without SMT):48 Loads, 24 Stores

  • Pentium 4 HT:24 Loads, 12 Stores per thread

  • IBM Power5:32 Loads, 32 Stores per thread


LSQ Limitations: Alternative such branches uncommon)

  • Cache-based approachi.e. Partition the cache to support different versions

  • Extra support required, but scalable


Register file size such branches uncommon)

  • IMT considers register file sizes of 128 and up.

  • Pentium 4 (as well as HT):Register file size = 128

  • IBM POWER5:Register file size = 80


Searching LSQ such branches uncommon)

  • Since loads and stores organized as per thread, search involves all locations of other threads.

  • If loads/stores organized according to addresses then lesser values to search.

  • Can make use of associativity of cache


Searching LSQ (contd.) such branches uncommon)‏


So how is performance still high? such branches uncommon)

  • Assistance from Compiler

  • Resource and dependency-aware fetching

  • Multiple threads on an execution context

  • Overlapping rename table creation with execution


Term project such branches uncommon)

  • “Cache-based throughput improvement techniques for Speculative SMT processors”

  • Optimizations from IMT

  • Increasing granularity to reduce number of thread squashes


Thank you such branches uncommon)


ad