Lecture 6 score board contd and tomasulo s algorithm l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 62

Lecture 6 Score Board Contd. And Tomasulo’s Algorithm PowerPoint PPT Presentation


  • 183 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 6 Score Board Contd. And Tomasulo’s Algorithm. Instructor: Laxmi Bhuyan. Three Parts of the Scoreboard. 1.Instruction status — which of 4 steps the instruction is in (Issue, Operand Read, EX, Write)

Download Presentation

Lecture 6 Score Board Contd. And Tomasulo’s Algorithm

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 6 score board contd and tomasulo s algorithm l.jpg

Lecture 6Score Board Contd. And Tomasulo’s Algorithm

Instructor: Laxmi Bhuyan

Lec. 7


Three parts of the scoreboard l.jpg

Three Parts of the Scoreboard

1.Instruction status—which of 4 steps the instruction is in(Issue, Operand Read, EX, Write)

2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy—Indicates whether the unit is busy or not

Op—Operation to perform in the unit (e.g., + or –)

Fi—Destination register

Fj, Fk—Source-register numbers

Qj, Qk—Functional units producing source registers Fj, Fk

Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to

No after operand are read.

3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Lec. 7


Detailed scoreboard pipeline control l.jpg

Instruction status

Issue

Read operands

Execution complete

Write result

Detailed Scoreboard Pipeline Control

Wait until

Bookkeeping

Not busy (FU) and not result(D)

Busy(FU)¬ yes; Op(FU)¬ op; Fi(FU)¬ `D’; Fj(FU)¬ `S1’; Fk(FU)¬ `S2’; Qj¬ Result(‘S1’); Qk¬ Result(`S2’); Rj¬ not Qj; Rk¬ not Qk; Result(‘D’)¬ FU;

WAW

Rj and Rk

Rj¬ No; Rk¬ No

Functional unit done

"f((Fj( f )!=Fi(FU) or Rj( f )=No) & (Fk( f )!=Fi(FU) or Rk( f )=No))

"f(if Qj(f)=FU then Rj(f)¬ Yes);"f(if Qk(f)=FU then Rj(f)¬ Yes); Result(Fi(FU))¬ 0; Busy(FU)¬ No

A.55 on page A-76

WAR

Lec. 7


Scoreboard example l.jpg

Scoreboard Example

  • The following numbers are to illustrate behavior, not representative

  • LD – 1 cycle

    • (compute address + data cache access)

  • ADDDs and SUBs are 2 cycles

  • Multiply is 10 cycles

  • Divide is 40 cycles

Lec. 7


Scoreboard example5 l.jpg

Scoreboard Example

Lec. 7


Scoreboard example cycle 1 l.jpg

Scoreboard Example Cycle 1

Lec. 7


Scoreboard example cycle 2 l.jpg

Scoreboard Example Cycle 2

Note: Can’t issue I2 because Integer unit is busy. Can’t issue next instruction due to in-order issue

Lec. 7


Scoreboard example cycle 3 l.jpg

Scoreboard Example Cycle 3

Lec. 7


Scoreboard example cycle 4 l.jpg

Scoreboard Example Cycle 4

Lec. 7


Scoreboard example cycle 5 l.jpg

Scoreboard Example Cycle 5

Now I2 is issued

Lec. 7


Scoreboard example cycle 6 l.jpg

Scoreboard Example Cycle 6

Lec. 7


Scoreboard example cycle 7 l.jpg

Scoreboard Example Cycle 7

I3 stalled at read because I2 isn’t complete

Lec. 7


Scoreboard example cycle 8 l.jpg

Scoreboard Example Cycle 8

Lec. 7


Scoreboard example cycle 9 l.jpg

Scoreboard Example Cycle 9

Note: I3 and I4 read operands because F2 is now available. ADDD (I6) can’t be issued because SUBD (I4) uses the adder

Lec. 7


Scoreboard example cycle 11 l.jpg

Scoreboard Example Cycle 11

Note: Add takes 2 cycles, so nothing happens in cycle 10. MUL continues.

Lec. 7


Scoreboard example cycle 12 l.jpg

Scoreboard Example Cycle 12

Lec. 7


Scoreboard example cycle 13 l.jpg

Scoreboard Example Cycle 13

Now ADDD is issued because SUBD has completed

Lec. 7


Scoreboard example cycle 14 l.jpg

Scoreboard Example Cycle 14

Lec. 7


Scoreboard example cycle 15 l.jpg

Scoreboard Example Cycle 15

Note: ADDD takes 2 cycles, so no change

Lec. 7


Scoreboard example cycle 16 l.jpg

Scoreboard Example Cycle 16

ADDD completes, but MULTD and DIVD go on

Lec. 7


Scoreboard example cycle 17 l.jpg

Scoreboard Example Cycle 17

ADDD stalls, can’t write back due to WAR with DIVD. MULT and DIV continue

Lec. 7


Scoreboard example cycle 18 l.jpg

Scoreboard Example Cycle 18

MULT and DIV continue

Lec. 7


Scoreboard example cycle 19 l.jpg

Scoreboard Example Cycle 19

19

MULT completes after 10 cycles

Lec. 7


Scoreboard example cycle 20 l.jpg

Scoreboard Example Cycle 20

MULTD completes and writes to F0

Lec. 7


Scoreboard example cycle 21 l.jpg

Scoreboard Example Cycle 21

Now DIVD reads because F0 is available

Lec. 7


Scoreboard example cycle 22 l.jpg

Scoreboard Example Cycle 22

ADDD writes result because WAR is removed.

Lec. 7


Scoreboard example cycle 61 l.jpg

Scoreboard Example Cycle 61

DIVD completes execution

Lec. 7


Scoreboard example cycle 62 l.jpg

Scoreboard Example Cycle 62

Execution is finished

Lec. 7


Review scoreboard l.jpg

Review: Scoreboard

  • Limitations of 6600 scoreboard

    • No forwarding

    • Limited to instructions in basic block (small window)

    • Large number of functional units (structural hazards)

    • Stall on WAR hazards

    • Stall on WAW hazards

      DIV.DF0, F2, F4

      ADD.DF6, F0, F8

      S.DF6, 0(R1)

      SUB.DF8, F10, F14

      MUL.DF6, F10, F8

WAR

WAW

Output dependence

Antidependence

Name dependence

Lec. 7


Another dynamic algorithm tomasulo algorithm l.jpg

Another Dynamic Algorithm: Tomasulo Algorithm

  • For IBM 360/91 about 3 years after CDC 6600

  • Goal: High Performance without special compilers

  • Differences between Tomasulo Algorithm & Scoreboard

    • Control & buffers distributed with Function Units vs. centralized in scoreboard; called “reservation stations”

    • Registers in instructions replaced by pointers to reservation station buffer

    • HW renaming of registers to avoid WAW hazards

    • Buffer operand values to avoid WAR hazards

    • Common Data Bus broadcasts results to all FUs

    • Load and Stores treated as FUs as well

  • Why study? Lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power PC 604 …

Lec. 7


Fp unit and load store unit using tomasulo s alg l.jpg

FP unit and load-store unit using Tomasulo’s alg.

Lec. 7


Another dynamic algorithm tomasulo algorithm32 l.jpg

Another Dynamic Algorithm: Tomasulo Algorithm

DIV.DF0, F2, F4

ADD.DS, F0, F8

S.DS, 0(R1)register renaming

SUB.DT, F10, F14

MUL.DF6, F10, T

  • Implemented through reservation stations (rs) per functional unit

    • Buffers an operand as soon as it is available – avoids WAR hazards.

    • Pending instr. designate rs that will provide their inputs – avoids WAW hazards.

    • The last write in a sequence of same-register-writing actually updates the register

    • Decentralize hazard detection and execution control

    • Instruction results are passed directly to the FU from rs rather than from registers

      • Through common data bus (CDB)

Lec. 7


Three stages of tomasulo algorithm l.jpg

Three Stages of Tomasulo Algorithm

1.Issue—get instruction from FP Op Queue

Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn’t completed => Solves WAW hazards.

2.Execution—operate on operands (EX)

When both operands are ready then execute; if not ready, watch CDB for result – Solves RAW

3.Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW.

  • Normal data bus:data + destination(“go to” bus)

  • CDB:data + source (“come from” bus)

    • 64 bits of data + 4 bits of Functional Unit source address

    • Write if matches expected Functional Unit (produces result)

    • Does broadcast

Lec. 7


Reservation station components l.jpg

Reservation Station Components

Op—Operation to perform in the unit (e.g., + or –)

Vj, Vk— Value of the source operand.

Qj, Qk— Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary.

Busy—Indicates reservation station or FU is busy

Register File Status Qi:

Qi —Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available.

Lec. 7


Tomasulo example cycle 0 l.jpg

Tomasulo Example Cycle 0

Lec. 7


Tomasulo example cycle 1 l.jpg

Tomasulo Example Cycle 1

Lec. 7


Tomasulo example cycle 2 l.jpg

Tomasulo Example Cycle 2

Lec. 7


Tomasulo example cycle 3 l.jpg

Tomasulo Example Cycle 3

Lec. 7


Tomasulo example cycle 4 l.jpg

Tomasulo Example Cycle 4

Lec. 7


Tomasulo example cycle 5 l.jpg

Tomasulo Example Cycle 5

Lec. 7


Tomasulo example cycle 6 l.jpg

Tomasulo Example Cycle 6

Lec. 7


Tomasulo example cycle 7 l.jpg

Tomasulo Example Cycle 7

Lec. 7


Tomasulo example cycle 8 l.jpg

Tomasulo Example Cycle 8

Lec. 7


Tomasulo example cycle 9 l.jpg

Tomasulo Example Cycle 9

Lec. 7


Tomasulo example cycle 10 l.jpg

Tomasulo Example Cycle 10

Lec. 7


Tomasulo example cycle 11 l.jpg

Tomasulo Example Cycle 11

Lec. 7


Tomasulo example cycle 12 l.jpg

Tomasulo Example Cycle 12

Lec. 7


Tomasulo example cycle 15 l.jpg

Tomasulo Example Cycle 15

Lec. 7


Tomasulo example cycle 16 l.jpg

Tomasulo Example Cycle 16

Lec. 7


Tomasulo example cycle 56 l.jpg

Tomasulo Example Cycle 56

Lec. 7


Tomasulo example cycle 57 l.jpg

Tomasulo Example Cycle 57

Lec. 7


Branch prediction 3 4 3 5 l.jpg

Branch Prediction (3.4, 3.5)

Lec. 7


Branch prediction l.jpg

Branch Prediction

  • Easiest (static prediction)

    • Always taken, always not taken

    • Opcode based

    • Displacement based (forward not taken, backward taken)

    • Compiler directed (branch likely, branch not likely)

  • Next easiest

    • 1 bit predictor – remember last taken/not taken per branch

      • Use a branch-prediction buffer or branch-history table

      • Use part of the PC (low-order bits) to index buffer/table

        • Multiple branches may share the same bit

      • Invert the bit if the prediction is wrong

      • Backward branches for loops will be mispredicted twice

Lec. 7


Slide54 l.jpg

Q: Assume a loop branch is taken nine times in a row, then not taken once. What is the prediction accuracy using 1-bit predictor?

A: After first loop, the predictor will say not to take because the last time the execution came out of loop, it set a “0” in the predictor. So, it’s a misprediction. The bit will now be set to “1”. Works fine until the last loop when it is predicted as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy.

How about a 2-bit predictor? Let the prediction be changed only after it misses twice in a row.

Lec. 7


2 bit branch prediction l.jpg

2-bit Branch Prediction

  • Has 4 states instead of 2, allowing for more information about tendencies

  • A prediction must miss twice before it is changed

  • Good for backward branches of loops

Lec. 7


Branch history table l.jpg

BHT

branch PC

01

Branch History Table

  • Has limited size

  • 2 bits by N (e.g. 4K)

  • 4K same as infinite, see Fig. 3.9

  • Uses low-order bits of branch PC to choose entry

Lec. 7


Can we do better l.jpg

Prediction if the last branch is NT

Prediction if the last branch is T

Can we do better ?

  • Correlating branch predictors also look at other branches for clues

    if (aa==2)T

    aa = 0

    if (bb==2)T

    bb = 0

    if(aa!=bb) { …NT

(1,1) predictor – uses history of 1 branch and uses a 1-bit predictor

Lec. 7


Correlating branch predictor l.jpg

Correlating Branch Predictor

  • If we use 2 branches as histories, then there are 4 possibilities (T-T, NT-T, NT-NT, NT-T).

  • For each possibility, we need to use a predictor (1-bit, 2-bit).

  • And this repeats for every branch.

(2,2) branch prediction

Lec. 7


Performance of correlating branch prediction l.jpg

Performance of Correlating Branch Prediction

  • With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor.

  • Outperforms a 2-bit predictor with infinite number of entries

Lec. 7


General m n branch predictors l.jpg

PC

Combining

funciton

General (m,n) Branch Predictors

  • The global history register is an m-bit shift register that records the last m branches encountered by the processor

  • Usually use both the PC address and the GHR (2-level)

m-bit ghr

01

n-bit predictors

00

Lec. 7


Is branch predictor enough l.jpg

Is Branch Predictor Enough?

  • When is using branch prediction beneficial?

    • When the outcome is known later than the target

    • For example, in our standard MIPS pipeline, we compute the target in ID stage but testing the branch condition incur a structure hazard in register file.

  • If we predict the branch is taken and suppose it is correct, what is the target address?

    • Need a mechanism to provide target address as well

  • Can we eliminate the one cycle delay for the 5-stage pipeline?

    • Need to fetch from branch target immediately after branch

Lec. 7


Branch target buffer btb l.jpg

Branch Target Buffer (BTB)

Is the current instruction a branch ?

• BTB provides the answer before the current instruction is decoded and therefore enables fetching to begin after IF-stage .

What is the branch target ?

• BTB provides the branch target if the prediction is a taken direct branch (for not taken branches the target is simply PC+4 ) .

Lec. 7


  • Login