Chapter3 limitations on instruction level parallelism l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Chapter3 Limitations on Instruction-Level Parallelism PowerPoint PPT Presentation


  • 196 Views
  • Uploaded on
  • Presentation posted in: General

Chapter3 Limitations on Instruction-Level Parallelism. Bernard Chen Ph.D. University of Central Arkansas. Overcome Data Hazards with Dynamic Scheduling. If there is a data dependence, the hazard detection hardware stalls the pipeline

Download Presentation

Chapter3 Limitations on Instruction-Level Parallelism

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Chapter3 limitations on instruction level parallelism l.jpg

Chapter3 Limitations on Instruction-Level Parallelism

Bernard Chen Ph.D.

University of Central Arkansas


Overcome data hazards with dynamic scheduling l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • If there is a data dependence, the hazard detection hardware stalls the pipeline

  • No new instructions are fetched or issued until the dependence is cleared

  • Dynamic Scheduling:the hardware rearrange the instruction execution to reduce the stalls while maintaining data flow and exception behavior


Slide3 l.jpg

RAW

  • If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped

  • If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard

I: add r1,r2,r3

J: sub r4,r1,r3


Overcome data hazards with dynamic scheduling4 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • Key idea: Allow instructions behind stall to proceed

    DIVF0 <- F2/F4ADDF10<- F0+F8SUBF12<- F8-F14


Overcome data hazards with dynamic scheduling5 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • Key idea: Allow instructions behind stall to proceed

    DIVF0 <- F2/F4

    SUBF12<- F8-F14

    ADDF10<- F0+F8


Overcome data hazards with dynamic scheduling6 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • Key idea: Allow instructions behind stall to proceed

    DIVF0 <- F2/F4

    SUBF12<- F8-F14

    ADDF10<- F0+F8

  • Enables out-of-order execution and allows out-of-order completion(e.g., SUB)

  • In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)


Overcome data hazards with dynamic scheduling7 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • It offers several advantages:

    • Simplifies the compiler

    • It allows code that compiled for one pipeline to run efficiently on a different pipeline

    • (Allow the processor to tolerate unpredictable delays such as cache misses)


Overcome data hazards with dynamic scheduling8 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder

  • Name dependence:when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;

  • There are 2 versions of name dependence


Slide9 l.jpg

I: sub r4,r1,r3

J: add r1,r2,r3

K: mul r6,r1,r7

WAR

  • InstrJ writes operand before InstrI reads it

  • If it caused a hazard in the pipeline, called a Write After Read (WAR) hazard


Slide10 l.jpg

I: sub r1,r4,r3

J: add r1,r2,r3

K: mul r6,r1,r7

WAW

  • InstrJ writes operand before InstrI writes it.

  • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard


Example l.jpg

Example

DIVr0 <- r2 / r4

ADDr6 <- r0 + r8

SUB r8 <- r10 – r14

MULr6 <- r10 * r7

ORr3 <- r5 or r9


Example raw l.jpg

Example RAW


Example war l.jpg

Example WAR


Example waw l.jpg

Example WAW


For you to practice l.jpg

For you to practice

  • DIVr0 <- r2 / r4

  • ADDr6 <- r0 + r8

  • STr1 <- r6

  • SUBr8 <- r10 - r14

  • MULr6 <- r10 * r8


Overcome data hazards with dynamic scheduling16 l.jpg

Overcome Data Hazards with Dynamic Scheduling

  • Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict

    • Register renaming resolves name dependence for regs

    • Either by compiler or by HW


Limits to ilp l.jpg

Limits to ILP

Assumptions for ideal/perfect machine to start:

1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided

2. Branch prediction – perfect; no mispredictions

3. Perfect Cache


Slide18 l.jpg

Limits to ILP HW Model comparison


Performance beyond single thread ilp l.jpg

Performance beyond single thread ILP

  • There can be much higher natural parallelism in some applications

  • Such as “Online processing system”: which has natural parallelism among the multiple queries and updates that are presented by requests


Thread level parallelism tlp l.jpg

Thread-level parallelism (TLP)

  • Thread: process with own instructions and data

    • thread may be a process part of a parallel program of multiple processes, or it may be an independent program

    • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute


Thread level parallelism tlp21 l.jpg

Thread-level parallelism (TLP)

  • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

  • Goal: Use multiple instruction streams to improve

    • Throughput of computers that run many programs

    • Execution time of multi-threaded programs

  • TLP could be more cost-effective to exploit than ILP


New approach mulithreaded execution l.jpg

New Approach: Mulithreaded Execution

  • Multithreading: multiple threads to share the functional units of 1 processor via overlapping

  • Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table


New approach mulithreaded execution23 l.jpg

New Approach: Mulithreaded Execution

  • When switch?

    • Alternate instruction per thread (fine grain)

    • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)


Fine grained multithreading l.jpg

Fine-Grained Multithreading

  • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved

  • Usually done in a round-robin fashion, skipping any stalled threads

  • CPU must be able to switch threads every clock


Multithreaded categories l.jpg

Multithreaded Categories

Fine-Grained

Thread 4

Thread 1

Thread 2

Thread 3

Thread 5


Multithreaded categories26 l.jpg

Multithreaded Categories


Fine grained multithreading27 l.jpg

Fine-Grained Multithreading

  • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls

  • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads


Course grained multithreading l.jpg

Course-Grained Multithreading

  • Switches threads only on costly stalls, such as cache misses

  • Advantages

    • Relieves need to have very fast thread-switching

    • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall


Course grained multithreading29 l.jpg

Course-Grained Multithreading

  • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs

    • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen

    • New thread must fill pipeline before instructions can complete

  • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time


Multithreaded categories30 l.jpg

Multithreaded Categories

Coarse-Grained

(2clock cycle)

Thread 4

Thread 1

Thread 2

Thread 3

Thread 5


  • Login