Lecture 9 dynamic scheduling of pipeline
Sponsored Links
This presentation is the property of its rightful owner.
1 / 82

Lecture 9 Dynamic Scheduling of Pipeline PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 9 Dynamic Scheduling of Pipeline. Static vs Dynamic Scheduling. Static Scheduling by compiler Code motion for LD delay slots and branch delay slots Code motion for avoiding data dependency In-order instruction issue:

Download Presentation

Lecture 9 Dynamic Scheduling of Pipeline

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 9Dynamic Scheduling of Pipeline

CS510 Computer Architectures


Static vs Dynamic Scheduling

  • Static Scheduling by compiler

    • Code motion for LD delay slots and branch delay slots

    • Code motion for avoiding data dependency

    • In-order instruction issue:

      • If an instruction is stalled, no later instructions can proceed.

      • Multiple copies of a unit may be idle - inefficiency

  • Dynamic Scheduling by Hardware

    • Allow Out-of-order execution, Out-of-order completion

    • Even though an instruction is stalled, later instructions, with no data dependencies with the instructions which are stalled and causing the stall, can proceed

    • Efficient utilization of functional unit with multiple units

CS510 Computer Architectures


HW Schemes:Instruction Parallelism

  • Why scheduling in HW at run time?

    • Works when dependencies are unknown at compile time

    • Simpler compiler

    • Code for one machine runs well on another

  • Key idea: Allow instructions behind stall to proceed

    DIVDF0,F2,F4

    ADDDF10,F0,F8

    SUBDF8,F8,F14

    In DLX,SUBDcannot be executed even if there is a separate adder available to maintain in-order-execution.

    • Enables out-of-order execution => out-of-order completion

    • DLX ID stage: checked both for structural hazards and data dependencies

CS510 Computer Architectures


HW Schemes:Instruction Parallelism

  • Out-of-order execution divides ID stage:

    1.Issue - Decode instructions, check for structural hazards

    2.Read operands - Wait until no data hazards, then read operands

  • Scoreboards(Control Data Corp. CDC 6600) allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions

    • Centralized implementation of Hazard Detection and Resolution

    • Every instruction goes through scoreboard

    • Scoreboard determines when instruction can read operands and begin execution

    • Monitoring every change in hardware and determine when to execute instruction

CS510 Computer Architectures


Scoreboard Implications

  • Out-of-order completion => WAR, WAW hazards?

    WARWAW

    ADDD R1,R2,R3ADDD R1,R2,R3

    LD R2,XLD R1,X

  • Solutions for WAR

    • Queue both the operation and copies of its operands

    • Read registers only during Read Operands stage

  • For WAW: stall until other to complete

  • Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units (superpipeline)

  • Scoreboard keeps track of dependencies, and the state of operations

  • Scoreboard replaces ID, EX, WB with 4 stages

CS510 Computer Architectures


4 Stages of Scoreboard Control:1st Stage(ID1) - Issue

  • Decode instructions and check for structural hazards

  • If functional unitfor the instruction is free(no structural hazard), and no other active instruction has the same destination register(WAW)

    • Scoreboard issues instruction to functional unit

    • Updates internal data structure

  • IfStructural Hazard orWAW Hazardexists

    • Stall instruction issue

    • No further instruction issue until hazards are cleared

    • IF/ID1 Buffer allows further instruction fetch(IF)

CS510 Computer Architectures


4 Stages of Scoreboard Control:2nd Stage(ID2) - Read Operands

  • Wait until no Data Hazard, then Read Operands

  • To prevent RAW,

  • If no earlier issued active instruction is going to writing it, or

  • If the register containing the operand is being written by none of the currently active functional units

    • Source operand is available for read

    • Scoreboard tells the functional unit to read and begin execution

    • Scoreboard resolves RAW Hazard dynamically

  • => out of order execution

CS510 Computer Architectures


4 Stages of Scoreboard Control:3rd Stage(EX) - Execution

  • Operates on Operands

    • Functional Unit begins execution upon receiving operands

    • When the result is ready, the functional unit notifies the Scoreboard of the completion of execution

CS510 Computer Architectures


4 Stages of Scoreboard Control:4th Stage(WB) - Write Result

  • Finish Execution

  • When Scoreboard knows the functional unitcompleted execution

    • Scoreboard checks for WAR Hazard If not, it writes the results If WAR Hazard, it stalls the instruction

    • Example:

    • DIVDF0,F2,F4

    • ADDDF10,F0,F8

    • SUBDF8,F8,F14

    • CDC 6600 scoreboard would stall SUBD until ADDD reads operands

CS510 Computer Architectures


CS510 Computer Architectures


3 Parts of the Scoreboard

1.Instruction status- Indicates which of 4 steps(Issue,ReadOperands, Execution Complete, Write Result) the instruction is in

2.Functional unit status- Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy: Indicates whether the unit is busy or not

Op: Operation to perform in the unit (e.g., + or - )

Fi:Destination register number

Fj, Fk:Source-register numbers

Qj, Qk: Functional units producing source registers Fj, Fk

Rj, Rk: Flags indicating when Fj, Fk are ready

3.Register result status- Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

CS510 Computer Architectures


WAW(if the same destination register)

Wait until

Bookkeeping

Instruction status

WAR

Scoreboard Pipeline Control

Issue

Not busy (FU) and not result(D)

Busy(FU) ¬ yes; Op(FU) ¬ op;

Fi(FU) ¬ ‘D’; Fj(FU) ¬ ‘S1’;Fk(FU) ¬ ‘S2’; Qj ¬ Result(‘S1’); Qk ¬ Result(‘S2’); Rj ¬ not Qj; Rk ¬ not Qk; Result(‘D’) ¬ FU;

Read operands

Rj and Rk

Rj ¬ No; Rk ¬ No; Qj ¬ 0; Qk ¬ 0;

Execution complete

Functional unit done

Write result

"f((Fj( f ) ¹ Fi(FU) or Rj(f)=No) & (Fk( f ) ¹ Fi(FU) or Rk( f )=No))

"f(if Qj(f)=FU then Rj(f) ¬ Yes);"f(if Qk(f)=FU then Rk(f) ¬ Yes); Result(Fi(FU)) ¬ 0; Busy(FU) ¬ No

f: register number

CS510 Computer Architectures


Instruction Status

Instruction j k

Read Execution Write

Issue Operands Complete Result

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Register Result Status

Clock0F0 F2 F4 F6 F8 F10 F12 …… F30

FU

Scoreboard Example

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

Functional Unit Status

Name

IntegerN

Mult1N

Mult2N

Add N

DivideN

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F6 R2Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

1

Cycle 1

1

Int

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F6 R2 Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

2

Int

Cycle 2

2

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F6 R2

N

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

3

Int

Cycle 3

3

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F6 R2

N

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

4

Int

Cycle 4

4

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

Load F2 R3

Y

Y

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

5

Int

Int

Cycle 5

5

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

F2

Functional Unit Status

Name

Integer N

Mult1 Y

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F2 R3

N

Y

Y

F2

Y Mult1 F0 F2 F4 Int N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

6

Int

Cycle 6

6

6

Mult1

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

6

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

AddN

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F2 R3

F2

Y

N

Y

Y

F2

Mult F0 F2 F4 Int N Y

Sub F8 F6 F2 Int Y N

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

7

Mult1Int

Cycle 7

7

7

Add

Int

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

6

7

Functional Unit Status

Name

Integer N

Mult1 N

Mult2 N

Add N

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Load F2 R3

F2

Y

N

Y

F2

Y

F2

Int

N

Mult F0 F2 F4 Int Y

F0

F2

Int

N

Sub F8 F6 F2 Int Y N

Y

F2

F0

Div F10 F0 F6 Mult N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

8

Mult1Int

Add

Cycle 8a

8

Div

Int

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

6

7

8

Y

Mult F0 F2F4 Int Y

F2

F2

Int

N

Y

Functional Unit Status

Name

IntegerN

Mult1 N

Mult2 N

Add Y

Divide N

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

N

Load F2 R3

F2

Y

N

Y

F2

F2

Sub F8 F6 F2 Int Y N

F2

Y

Div F10 F0 F6 Mult N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

8

Int

Add

Div

Mult1

Cycle 8b

8

N

Y

Int

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

6

7

8

Functional Unit Status

Name

Integer

Mult1

Mult2

Add Y

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

N

Load F2 R3

F2

Y

N

Y

F2

Y

Mult F0 F2 F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

Sub F8 F6 F2 Int Y Y

F2

N

N

Y

Div F10 F0 F6 Mult N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

9

Int

Add

Div

Mult1

Cycle 9

8

9

9

Time

10

2

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

9

8

Functional Unit Status

Name

Integer

Mult1

Mult2

Add Y

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

N

Load F2 R3

F2

Y

N

Y

F2

Y

Mult F0 F2 F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

N

Sub F8 F6 F2 Int Y Y

F2

N

N

Y

Div F10 F0 F6 Mult N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

11

Int

Add

Div

Mult1

Cycle 11

11

8

0

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

8

Functional Unit Status

Name

Integer

Mult1

Mult2

Add Y

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

N

Load F2 R3

F2

Y

N

Y

F2

Y

Mult F0 F2 F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

N

Sub F8 F6 F2 Int Y Y

F2

N

N

Y

Div F10 F0 F6 Mult N Y

Y

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

12

Int

Add

Div

Mult1

Cycle 12

12

7

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4Y

F2

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

13

Mult1

Int

Add

Div

Cycle 13

13

6

Y

Add

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

5

2

Y

Mult F0 F2 F4 Y

F2

5

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

14

Mult1

Int

Add

Add

Div

Cycle 14

14

Y

N

N

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

14

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

4

1

Y

Mult F0 F2 F4 Y

F2

4

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

15

Mult1

Int

Add

Add

Div

Cycle 15

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

14

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

3

0

Y

Mult F0 F2 F4 Y

F2

3

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

16

Mult1

Int

Add

Add

Div

Cycle 16

16

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

17

Mult1

Int

Add

Add

Div

Cycle 17

2

2

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

18

Mult1

Int

Add

Add

Div

Cycle 18

1

1

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

7

11

9

12

8

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2F4 Y

F2

F2

F2

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 Mult N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

19

Mult1

Int

Add

Add

Div

Cycle 19

19

0

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

19

7

11

9

12

8

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

MultF0F2F4 Y

F2

F2

F2

Int

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 N Y

Y

F6

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

20

Mult1

Int

Add

Add

Div

Cycle 20

20

N

Y

Mult1

F0

Y

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

19

20

7

11

9

12

8

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4 Y

F2

N

F2

F2

Int

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 N Y

Y

F0

F6

Y

N

N

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

21

Mult1

Int

Add

Add

Div

Cycle 21

21

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

19

20

7

11

9

12

8

21

13

14

16

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4 Y

F2

N

F2

F2

Int

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

F6

N

N

Div F10 F0 F6 N Y

Y

F0

F6

Y

N

N

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

22

Mult1

Int

Add

Add

Div

Cycle 22

22

N

40

F6

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

19

20

7

11

9

12

8

21

13

14

16

22

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4 Y

F2

N

F2

F2

Int

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

N

F6

N

N

Div F10 F0 F6

Y

F0

F6

F6

Y

N

N

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

61

Mult1

Int

Add

Add

Div

Cycle 61

61

0

CS510 Computer Architectures


Instruction Status

Instruction j k

LDF6 34 + R2

LDF2 45 + R3

MULTF0 F2 F4

SUBDF8 F6 F2

DIVDF10 F0 F6

ADDDF6 F8 F2

Read Execution Write

Issue Operands Complete Result

1

2

3

4

5

6

7

8

6

9

19

20

7

11

9

12

8

21

61

13

14

16

22

Functional Unit Status

Name

Integer N

Mult1

Mult2

Add

Divide

dest S1 S2 FU for j FU for k Fj? Fk?

Busy Op Fi Fj Fk Qj Qk Rj Rk

Time

Y

Mult F0 F2 F4 Y

F2

N

F2

F2

Int

Int

N

N

Y

N

Y

N

N

Add F6 F8 F2 Y Y

Y

N

F6

N

N

Div F10 F0 F6

Y

F0

F6

F6

Y

N

N

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 …… F30

FU

62

Mult1

Int

Add

Div

Add

Cycle 62

62

N

CS510 Computer Architectures


Scoreboard Summary

Scoreboard Summary

  • Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit

  • Limitations of 6600 scoreboard:

    • No forwarding hardware

    • Limited to instructions in basic block (small window)

    • Small number of functional units (structural hazards)

    • Wait for WAR hazards

    • Prevent WAW hazards

  • Speedup 1.7 from FORTRAN program, 2.5 by hand coded Assembly Language program BUT slow memory (no cache) limits benefit

  • Limitations of 6600 scoreboard:

    • No forwarding hardware

    • Limited to instructions in basic block (small window)

    • Small number of functional units (structural hazards)

    • Wait for WAR hazards

    • Prevent WAW hazards

CS510 Computer Architectures


CS510 Computer Architectures


Case Study:Tomasulo Algorithm

CS510 Computer Architectures


Limitations of Scoreboard

  • No forwarding

  • Limited to instructions in basic block (small window)

  • Number of functional units(structural hazards)

  • Wait for WAR hazards

  • Prevent WAW hazards

CS510 Computer Architectures


Another Dynamic Algorithm: Tomasulo Algorithm

  • For IBM 360/91 about 3 years after CDC 6600

  • Goal: High Performance without special compilers

  • Differences between IBM 360 & CDC 6600 ISA

    • IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

    • IBM has 4 FP registers vs. 8 in CDC 6600

  • Differences between Tomasulo Algorithm & Scoreboard

    • Control & buffers are distributed with Function Units, called “reservation stations” vs. centralized in scoreboard;

    • Registers in instructions are replaced by pointers to reservation station buffer

    • HW renaming of registers to avoid WAR, WAW hazards

    • Common Data Bus(CDB) broadcasts results to all FUs

    • Load and Stores treated as FUs as well

CS510 Computer Architectures


Only Data Dependence

with Register Renaming

Name dependence(arrows) and

Data Dependence(blue& green)

Loop:LDF0, 0(R1)

ADDDF4,F0, F2

SD0(R1),F4

LDF6, -8(R1)

ADDDF8,F6, F2

SD-8(R1),F8

LDF10, -16(R1)

ADDDF12,F10, F2

SD-16(R1),F12

LDF14, -24(R1)

ADDDF16,F14, F2

SD-24(R1),F16

SUBIR1,R1, #32

BNEZR1, Loop

Loop:LDF0, 0(R1)

ADDDF4,F0,F2

SD0(R1),F4

LDF0, -8(R1)

ADDDF4,F0, F2

SD-8(R1),F4

LDF0, -16(R1)

ADDDF4,F0, F2

SD-16(R1),F4

LDF0, -24(R1)

ADDDF4,F0, F2

SD-24(R1),F4

SUBIR1, R1, #32

BNEZR1, Loop

Register

Renaming

Register Renaming

CS510 Computer Architectures


FromInstructionUnit

FromMemory

FP

Registers

Floating

Point

Operations

Queue

(Issue)

Load

Buffers

(values to be

loaded in

registers)

6

5

4

3

2

1

Operand

Bus

Store

Buffers

(addresses)

3

2

1

Operation Bus

To Memory

To Memory

FP Multiply

Reservation

Station

FP Add

Reservation

Station

3

2

1

2

1

FP Multiplier

FP Adder

Tomasulo Organization

Reservation

Station

Common Data Bus(CDB)

CS510 Computer Architectures


Reservation Station Components

Op:Operation to perform in the unit (e.g., + or - )

Qj, Qk:Reservation stations producing source Vj, Vk. 0 indicates that Vj,Vk are ready, eliminating Rj, Rk fields in scoreboard

Vj, Vk:Value of Source operands

Busy:Indicates reservation station and FU is busy

Register result status:Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

CS510 Computer Architectures


Three Stages of Tomasulo Algorithm

1.Issue: Get instruction from FP Op Queue

  • FP op: If reservation station is free, issue instr, and send operation & operands if they are in Reg’s(renames Reg’s).

  • LD/ST: If Buffer is available, issue instr.

  • If reservation station or buffer is not available, structural hazard-stall

  • Register renaming

    2.Execution: Operate on operands (EX)

  • When an operand is ready, put it in the reservation station.

  • If not ready, watch CDB for registers.

  • When both operands are available, execute

  • RAW check

    3. Write Result: Finish execution (WB)

  • When result is available write on Common Data Bus, and from there to all awaiting units; Registers, Reservation stations

  • Mark reservation station available.

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

LD1

No

LD

F6

34+

R2

LD

F2

45+

R3

LD2

No

MULTD

F0

F2

F4

LD3

No

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

No

0

Add1

No

0

0

Add2

No

Add3

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock

R2

R3

Qi

0

80

90

Cycle 0

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

LD F8 34+ R2

LD1

No

LD

F6

34+

R2

LD

F2

45+

R3

LD2

No

MULTD

F0

F2

F4

LD3

No

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

No

0

Add1

No

0

Add2

No

Add3

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock

R2

R3

Qi

0

80

90

Cycle 1

Yes 34+80

1

LD1

1

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

LD F8 34+ R2

1

LD1

No

LD

F6

34+

R2

LD F2 45+ R3

LD

F2

45+

R3

LD2

No

MULTD

F0

F2

F4

LD3

No

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

No

0

Add1

No

0

Add2

No

Add3

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock

R2

R3

Qi

1

0

80

90

Cycle 2

Yes 45+90

2

LD2

LD1

2

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

LD F8 34+ R2

1

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

MULTD

F0

F2

F4

LD3

No

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

No

0

Add1

No

0

Add2

No

Add3

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock

R2

R3

LD2

Qi

1

2

0

80

90

Cycle 3

3

3

Yes MULTD R(F4) LD2

0

Mult1

LD1

3

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

LD F6 34+ R2

1

3

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

No

0

Add1

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

0

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock

R2

R3

Mult1

LD2

Qi

2

1

3

0

80

90

Cycle 4a

4

Yes SUBD LD1 LD2

LD1

Add1

4

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

No

Yes 34+80

LD F6 34+ R2

1

3

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

4

SUBD

F8

F6

F2

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

Yes SUBD LD1 LD2

0

No

0

Add1

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

0

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

LD1

Add1

Clock

R2

R3

Mult1

LD2

Qi

4

3

2

1

0

80

90

Cycle 4b

4

4

M(114)

M(114)

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

No

LD F6 34+ R2

1

3

4

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

4

No

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

4

SUBD

F8

F6

F2

DIVD F10 F0 F6

DIVD

F10

F0

F6

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

Yes SUBD LD1 LD2

M(114)

0

No

0

Add1

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

M(135)

0

No

0

Mult1

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

LD1

Add1

Clock

R2

R3

Mult1

LD2

Qi

2

1

4

3

0

80

90

M(114)

Cycle 5

5

5

M(135)

2

10

Yes DIVD M(114) Mult1

Mult2

5

M(135)

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

No

LD F6 34+ R2

1

3

4

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

4

5

No

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

4

SUBD

F8

F6

F2

DIVD F10 F0 F6

5

DIVD

F10

F0

F6

ADDD F6 F8 F2

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

Yes SUBD LD1 LD2

M(114)

M(135)

2

0

No

0

Add1

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

M(135)

10

0

No

0

Mult1

Yes DIVD M(114) Mult1

0

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

LD1

Add1

Mult2

Clock

R2

R3

Mult1

LD2

Qi

3

1

2

5

4

0

80

90

M(135)

M(114)

Cycle 6

6

1

Yes ADDD M(135) Add1

9

Add2

6

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

No

Yes 34+80

LD F6 34+ R2

1

3

4

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

4

5

No

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

4

SUBD

F8

F6

F2

DIVD F10 F0 F6

5

DIVD

F10

F0

F6

6

ADDD F6 F8 F2

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

Yes SUBD LD1 LD2

M(114)

M(135)

1

0

2

No

0

Add1

Yes ADDD M(135) Add1

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

M(135)

9

0

10

No

0

Mult1

Yes DIVD M(114) Mult1

0

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Add2

LD1

Add1

Mult2

Clock

R2

R3

Mult1

LD2

Qi

6

5

4

1

2

3

0

80

90

M(135)

M(114)

Cycle 7

7

0

8

7

CS510 Computer Architectures


Instruction status

Exec

Write

Busy

Address

Instruction j k

Issue

complete

Result

Yes 34+80

No

LD F6 34+ R2

1

3

4

LD1

No

LD

F6

34+

R2

Yes 45+90

LD F2 45+ R3

2

4

5

No

LD

F2

45+

R3

LD2

No

MULTD F0 F2 F4

3

MULTD

F0

F2

F4

LD3

No

SUBD F8 F6 F2

4

7

SUBD

F8

F6

F2

DIVD F10 F0 F6

5

DIVD

F10

F0

F6

6

ADDD F6 F8 F2

ADDD

F6

F8

F2

S1

S2

RS for j

Reservation Stations

RS for k

Busy

Op

Vj

Vk

Qj

Qk

Time

Name

Yes SUBD LD1 LD2

M(114)

M(135)

0

2

1

0

No

0

Add1

Yes ADDD M(135) Add1

M()-M()

No

0

Add2

No

Add3

Yes MULTD R(F4) LD2

M(135)

8

9

10

0

No

0

Mult1

Yes DIVD M(114) Mult1

0

No

0

Mult2

Register result status

F0

F2

F4

F6

F8

F10

F12

...

F30

Add2

LD1

Add1

Mult2

Clock

R2

R3

Mult1

LD2

Qi

3

2

7

5

6

4

1

0

80

90

M(135)

M(114)

M()-M()

Cycle 8

8

No

7

8

CS510 Computer Architectures


CS510 Computer Architectures


Dynamic Loop Unrolling by Tomasulo

  • Eliminating WAW and WAR hazard by dynamic renaming of registers

  • Predict branch TAKEN will allow multiple instruction in the loop proceed in parallel

  • By the dynamic loop unrolling and register renaming, requirement of many registers in the loop unrolling can be avoided

CS510 Computer Architectures


Tomasulo Loop Example

Loop:LDF0,0(R1)

MULTDF4,F0,F2

SD0(R1), F4

SUBIR1,R1,#8

BNEZR1,Loop

  • This example shows dynamic loop unrolling, it shows the

  • completion of the first 2 iterations

    • Multiply takes 4 clocks

    • The Load in the 1st iteration has a cache miss which takes 8 cycles

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Qi

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

080

Loop Example: Cycle 0

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Qi

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

080

Cycle 1

Yes 80

1

1

Load1

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Qi

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

080

Cycle 2

Yes 80

LD cannot progress

due to cache miss

1

2

Yes MULTD R(F2) Load1

1

2

Load1

Mult1

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

2

Qi

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

1

2

080

Load1

Mult1

Cycle 3

X cannot progress

due to F0

3

Mult1

Yes 80

3

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

2

Qi

3

Mult1

Yes 80

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

080

1

3

2

72

Load1

Mult1

Execute SUBI for R1 in the 2nd iteration

Cycle 4

SD cannot progress

due to F4

4

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

2

Qi

3

Mult1

Yes 80

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

2

4

080

3

1

72

Load1

Mult1

Cycle 5

5

Execute BNEZ to get to the 2nd iteration

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

2

Predict TAKEN,

issue LD

Qi

3

6

Mult1

Yes 80

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

5

4

2

080

1

3

72

Load1

Mult1

Cycle 6

Yes 72

6

Load2

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

Yes 72

2

Qi

3

6

Mult1

Yes 80

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

2

1

5

4

080

3

6

72

Load1

Load2

Mult1

Cycle 7

7

Yes MULTD R(F2) Load2

7

Mult2

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

Yes 72

2

Qi

3

6

Mult1

Yes 80

7

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Yes MULTD R(F2) Load2

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

7

6

5

4

2

1

080

3

72

Load1

Load2

Mult1

Mult2

Cycle 8

Mult2

Yes 72

8

8

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

9

Yes 72

2

Cache miss

penalty over

Qi

3

6

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

Yes MULTD R(F2) Load2

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

6

2

3

4

1

080

7

8

5

72

64

Load1

Load2

Mult2

Mult1

Execute SUBI for 3rd iteration

Cycle 9

9

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

9

Yes 72

2

Qi

3

6

10

Mult1

Yes 80

Cache available

for access

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

Yes MULTD R(F2) Load1

M(80)

Yes MULTD R(F2) Load2

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

2

1

080

5

4

3

7

8

6

9

64

72

Load1

Load2

Mult2

Mult1

Cycle 10

10

Start x 4

10

Execute BNEZ to get to the 3rd iteration

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

1

9

10

Yes 72

2

Qi

3

6

10

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

0

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

4

YesMULTD R(F2) Load1

M(80)

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

10

3

8

080

1

4

2

9

6

7

5

72

64

Load1

Load2

Mult2

Mult1

Cycle 11

N

11

3

Start x 4

11

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

2

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

3

4

Yes MULTD R(F2) Load1

M(80)

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

10

11

8

6

5

080

7

2

3

9

1

4

72

64

Load1

Load2

Load2

Mult2

Mult1

Cycle 12

N

N

2

3

12

Load3

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

2

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

4

3

2

Yes MULTD R(F2) Load1

M(80)

3

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

11

10

12

4

5

2

1

9

080

8

3

6

7

64

72

Load1

Load2

Load2

Load3

Mult1

Mult2

Cycle 13

N

N

1

2

13

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

N

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

2

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

1

4

3

2

Yes MULTD R(F2) Load1

M(80)

2

3

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

11

12

13

10

1

4

080

5

3

7

2

8

6

9

64

72

Load1

Load2

Load3

Load2

Mult1

Mult2

Cycle 14

N

N

14

12

0

1

14

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

N

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

2

14

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

7

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

3

2

1

0

4

N

Yes MULTD R(F2) Load1

M(80)

1

2

3

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

14

13

10

12

11

6

9

5

4

8

7

1

080

2

3

72

64

Load1

Load2

Load3

Load2

Mult2

Mult1

Cycle 15

13 14

N

N

15

12

M[80]*R(F2)

15

0

15

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

2

14

15

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

M[80]*R(F2)

7

15

Mult2

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

1

3

4

0

2

N

Yes MULTD R(F2) Load1

M(80)

3

2

1

0

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

14

11

15

13

12

10

6

2

3

1

5

080

9

4

8

7

72

64

Load1

Load2

Load2

Load3

Mult2

Mult1

Cycle 16

13 14

N

N

12

15

16

16

M[72]*R(F2)

N

16

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

16

2

14

15

Qi

Yes 64

3

6

10

11

Mult1

Yes 80

M[80]*R(F2)

7

15

16

Mult2

M[72]*R(F2)

Yes 72

8

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

0

1

4

2

3

N

Yes MULTD R(F2) Load1

M(80)

2

3

1

0

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

16

15

10

12

14

11

13

9

4

5

6

1

080

8

3

7

2

64

72

Load1

Load3

Load2

Load2

Mult1

Mult2

Cycle 17

N

17

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

16

2

14

15

Qi

Yes 64

3

3/17

6

10

11

Mult1

Yes 80

M[80]*R(F2)

7

15

16

M[72]*R(F2)

Mult2

Yes 72

8

Yes 64

Mult1

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

0

1

4

3

2

N

Yes MULTD R(F2) Load1

M(80)

2

1

0

3

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

13

10

16

12

14

17

11

15

7

6

3

5

2

080

1

4

9

8

56

64

72

Load1

Load2

Load2

Load3

Mult1

Mult2

Mult1

Execute SUBI for 4th iteration

Cycle 18

18

N

18

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

16

2

14

15

Qi

Yes 64

3

3/17

18

N

6

10

11

Mult1

Yes 80

M[80]*R(F2)

7

15

16

Mult2

M[72]*R(F2)

Yes 72

8

Yes 64

Mult1

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

1

3

2

0

4

N

Yes MULTD R(F2) Load1

M(80)

2

3

0

1

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

11

12

10

18

15

16

14

13

17

4

1

8

7

9

080

3

2

6

5

64

72

56

Load1

Load2

Load3

Load2

Mult2

Mult1

Mult1

Cycle 19

19

N

19

Execute BNEZ to get to the 4th iteration

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

16

2

14

15

Qi

Yes 64

3/17

3

18

19

6

10

11

Mult1

Yes 80

M[80]*R(F2)

7

15

16

Mult2

M[72]*R(F2)

Yes 72

8

Yes 64

Mult1

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

2

0

4

3

1

Yes MULTD R(F2) Load1

N

M(80)

3

0

1

2

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

19

10

15

14

16

12

11

18

17

13

6

8

9

7

5

2

1

080

3

4

72

56

64

Load1

Load2

Load3

Load2

Mult2

Mult1

Mult1

Cycle 20

N

20

N

20

CS510 Computer Architectures


Instruction StatusExec Write

Instruction j k IterIssue Comp Result

Busy Address

N

Load1

Load2

Load3

Store1

Store2

Store3

N

N

N

N

N

N

LDF0 0R1 1

MULTDF4 F0F2 1

SDF4 0R1 1

LDF0 0R1 2

MULTDF4 F0F2 2

SDF4 0R1 2

Yes 80

11

1

9

10

Yes 72

16

2

14

15

Qi

Yes 64

3/17

3

18

19

N

6

10

11

Mult1

Yes 80

M[80]*R(F2)

N

7

15

16

Mult2

M[72]*R(F2)

Yes 72

8

20

Yes 64

Mult1

Reservation StationS1 S2 RS for j RS for k

TimeName Busy Operation Vj Vk Qj Qk

Add1

Add2

Add3

Mult1

Mult2

0

0

0

0

4

N

NNNN

LDF0 0 R1

MULTDF4 F0 F2

SDF4 0 R1

SUBIR1 R1 #8

BNEZR1 Loop

0

3

2

4

1

Yes MULTD R(F2) Load1

N

M(80)

3

1

2

0

Yes MULTD R(F2) Load2

M(72)

Register Result Status

ClockR1 F0 F2 F4 F6 F8 F10 . . . F30

Qi

13

14

10

11

12

15

19

18

17

16

20

9

080

1

2

8

5

3

4

6

7

64

56

72

Load1

Load3

Load2

Load2

Mult2

Mult1

Mult1

Cycle 21

21

N

21

CS510 Computer Architectures


Tomasulo Summary

  • Prevents Register as bottleneck

  • Avoids WAR, WAW hazards of Scoreboard

  • Allows loop unrolling in HW

  • Not limited to basic blocks (provided branch prediction)

  • Lasting Contributions

    • Dynamic scheduling

    • Register renaming

    • Load/store disambiguation

  • Next: More branch prediction

CS510 Computer Architectures


  • Login