Lecture 6 Score Board Contd. And Tomasulo’s Algorithm

Lecture 6Score Board Contd. And Tomasulo’s Algorithm Instructor: Laxmi Bhuyan Lec. 7

Three Parts of the Scoreboard 1. Instruction status—which of 4 steps the instruction is in(Issue, Operand Read, EX, Write) 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to No after operand are read. 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register Lec. 7

Instruction status Issue Read operands Execution complete Write result Detailed Scoreboard Pipeline Control Wait until Bookkeeping Not busy (FU) and not result(D) Busy(FU)¬ yes; Op(FU)¬ op; Fi(FU)¬ `D’; Fj(FU)¬ `S1’; Fk(FU)¬ `S2’; Qj¬ Result(‘S1’); Qk¬ Result(`S2’); Rj¬ not Qj; Rk¬ not Qk; Result(‘D’)¬ FU; WAW Rj and Rk Rj¬ No; Rk¬ No Functional unit done "f((Fj( f )!=Fi(FU) or Rj( f )=No) & (Fk( f )!=Fi(FU) or Rk( f )=No)) "f(if Qj(f)=FU then Rj(f)¬ Yes);"f(if Qk(f)=FU then Rj(f)¬ Yes); Result(Fi(FU))¬ 0; Busy(FU)¬ No A.55 on page A-76 WAR Lec. 7

Scoreboard Example • The following numbers are to illustrate behavior, not representative • LD – 1 cycle • (compute address + data cache access) • ADDDs and SUBs are 2 cycles • Multiply is 10 cycles • Divide is 40 cycles Lec. 7

Scoreboard Example Lec. 7

Scoreboard Example Cycle 1 Lec. 7

Scoreboard Example Cycle 2 Note: Can’t issue I2 because Integer unit is busy. Can’t issue next instruction due to in-order issue Lec. 7

Scoreboard Example Cycle 5 Now I2 is issued Lec. 7

Scoreboard Example Cycle 7 I3 stalled at read because I2 isn’t complete Lec. 7

Scoreboard Example Cycle 9 Note: I3 and I4 read operands because F2 is now available. ADDD (I6) can’t be issued because SUBD (I4) uses the adder Lec. 7

Scoreboard Example Cycle 11 Note: Add takes 2 cycles, so nothing happens in cycle 10. MUL continues. Lec. 7

Scoreboard Example Cycle 13 Now ADDD is issued because SUBD has completed Lec. 7

Scoreboard Example Cycle 15 Note: ADDD takes 2 cycles, so no change Lec. 7

Scoreboard Example Cycle 16 ADDD completes, but MULTD and DIVD go on Lec. 7

Scoreboard Example Cycle 17 ADDD stalls, can’t write back due to WAR with DIVD. MULT and DIV continue Lec. 7

Scoreboard Example Cycle 18 MULT and DIV continue Lec. 7

Scoreboard Example Cycle 19 19 MULT completes after 10 cycles Lec. 7

Scoreboard Example Cycle 20 MULTD completes and writes to F0 Lec. 7

Scoreboard Example Cycle 21 Now DIVD reads because F0 is available Lec. 7

Scoreboard Example Cycle 22 ADDD writes result because WAR is removed. Lec. 7

Scoreboard Example Cycle 61 DIVD completes execution Lec. 7

Scoreboard Example Cycle 62 Execution is finished Lec. 7

Review: Scoreboard • Limitations of 6600 scoreboard • No forwarding • Limited to instructions in basic block (small window) • Large number of functional units (structural hazards) • Stall on WAR hazards • Stall on WAW hazards DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 WAR WAW Output dependence Antidependence Name dependence Lec. 7

Another Dynamic Algorithm: Tomasulo Algorithm • For IBM 360/91 about 3 years after CDC 6600 • Goal: High Performance without special compilers • Differences between Tomasulo Algorithm & Scoreboard • Control & buffers distributed with Function Units vs. centralized in scoreboard; called “reservation stations” • Registers in instructions replaced by pointers to reservation station buffer • HW renaming of registers to avoid WAW hazards • Buffer operand values to avoid WAR hazards • Common Data Bus broadcasts results to all FUs • Load and Stores treated as FUs as well • Why study? Lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, Power PC 604 … Lec. 7

FP unit and load-store unit using Tomasulo’s alg. Lec. 7

Another Dynamic Algorithm: Tomasulo Algorithm DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) register renaming SUB.D T, F10, F14 MUL.D F6, F10, T • Implemented through reservation stations (rs) per functional unit • Buffers an operand as soon as it is available – avoids WAR hazards. • Pending instr. designate rs that will provide their inputs – avoids WAW hazards. • The last write in a sequence of same-register-writing actually updates the register • Decentralize hazard detection and execution control • Instruction results are passed directly to the FU from rs rather than from registers • Through common data bus (CDB) Lec. 7

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn’t completed => Solves WAW hazards. 2. Execution—operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result – Solves RAW 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW. • Normal data bus: data + destination (“go to” bus) • CDB: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does broadcast Lec. 7

Reservation Station Components Op—Operation to perform in the unit (e.g., + or –) Vj, Vk— Value of the source operand. Qj, Qk— Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. Busy—Indicates reservation station or FU is busy Register File Status Qi: Qi —Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available. Lec. 7

Tomasulo Example Cycle 0 Lec. 7

Lecture 6 Score Board Contd. And Tomasulo’s Algorithm