310 likes | 676 Views
2. Reading. Chapter 3: ILP and Its Dynamic ExploitationSection 3.1-3.3. 3. Dynamic Scheduling: Tomasulo's Algorithm. For IBM 360/91 (about three years after CDC 6600)Goal: High performance without special compilersDifferences between IBM 360 and CDC 6600 ISAIBM has only 2 register specifiers/instruction versus 3 in CDC 6600IBM has 4 FP registers versus 8 in CDC 6600Differences between Tomasulo Algorithm and ScoreboardControl and buffers distributed with Function Units versus centralized 9455
E N D
1. 1 COMP 206:Computer Architecture and Implementation Montek Singh
Mon, Oct 10, 2005
Topic: Instruction-Level Parallelism
(Dynamic Scheduling: Tomasulo’s Algorithm)
2. 2
3. 3 Dynamic Scheduling: Tomasulo’s Algorithm For IBM 360/91 (about three years after CDC 6600)
Goal: High performance without special compilers
Differences between IBM 360 and CDC 6600 ISA
IBM has only 2 register specifiers/instruction versus 3 in CDC 6600
IBM has 4 FP registers versus 8 in CDC 6600
Differences between Tomasulo Algorithm and Scoreboard
Control and buffers distributed with Function Units versus centralized in scoreboard; called “reservation stations”
Registers in instructions replaced by pointers to reservation station buffer
Hardware renaming of registers to avoid WAR and WAW hazards
Common Data Bus broadcasts results to all FUs (forwarding)
Load and Stores treated as FUs as well
4. 4 Tomasulo: Organization
5. 5 More Details of Tomasulo Organization Entities that produce values are assigned 4-bit tags
1, 2, 3, 4, 5, 6 for load buffers
8, 9 for multiplier reservation stations
10, 11, 12 for adder reservation stations
Tag 0 indicates presence of valid data
FP registers have “busy bits”
0 means that register holds valid data
1 means that it is waiting to receive value from source identified by its tag field
6. 6 Tomasulo: Representing Data Dependences Inputs
Operand is a register with busy bit = 0
Data copied immediately (through register bus) into reservation station
Tag field of RS set to 0
Operand is a register with busy bit = 1
Tag field of RS receives a copy of the register tag field
Operand is a load buffer that contains valid data
Data copied into RS
Operand is a load buffer that is awaiting data
Tag field of RS receives tag of load buffer
Outputs
Output is a register
Busy bit set to 1, tag set to RS tag
Output is a store buffer
Tag set to RS tag, destination address set
7. 7 Three Stages of Tomasulo Algorithm Issue: get instruction from FP operation queue
If reservation station free, the scoreboard issues instruction and sends operands (renames registers)
Execution: operate on operands (EX)
When both operands ready then execute;if not ready, watch CDB for result
Write Result: finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available
8. 8 Tomasulo: State Transitions
9. 9 Tomasulo: Example
10. 10 Tomasulo Example Cycle 0 System is quiescent
11. 11 Tomasulo Example Cycle 1 (A) will arrive at tag 4
(F0) will come from tag 4
F0 is set to “busy”
12. 12 Tomasulo Example Cycle 2 (F0) will be produced at tag 10
Right input of adder came from register (tag bit = 0)
Left input of adder will come from tag 4
Forwarding tag of F0 has been changed from 4 to 10
13. 13 Tomasulo Example Cycle 3 (F0) will be produced at tag 11
(B) will arrive at tag 3
Right input of adder will come from tag 3
Left input of adder will come from tag 10
(A) arrives from memory
Forwarding tag of F0 has been changed from 10 to 11
14. 14 Tomasulo Example Cycle 4 (F2) will be produced at tag 12
Right input of adder came from register (tag bit = 0)
Left input of adder came from register (tag bit = 0)
(A) with tag 4 is broadcast on CDB
Adder (at tag 10) picks it up, and is thereby enabled
The instruction that will write F2 has already read the old contents of F2
15. 15 Tomasulo Example Cycle 5 (F1) will be produced at tag 8
Right input of multiplier will come from tag 12
Left input of multiplier came from register (tag bit = 0)
Adder (at tag 10) starts computing
(B) arrives from memory
16. 16 Tomasulo Example Cycle 6 Memory address of destination is C
Data will come from tag 8
Adder (at tag 10) finishes computing
(B) with tag 3 is broadcast on CDB
Adder (at tag 11) picks it up
Adder (at tag 12) starts computing
17. 17 Tomasulo Example Cycle 7 (F1) will be produced at tag 9
Right input of divider will come from tag 11
Left input of divider will come from tag 8
Result of adder (with tag 10) is broadcast on CDB
Adder (at tag 11) picks it up and is thereby enabled
Adder (at tag 12) finishes computing
18. 18 Tomasulo Example Cycle 8 Result of adder (at tag 12) is broadcast on CDB
Multiplier (at tag 12) picks it up and is thereby enabled
19. 19 Tomasulo Example Cycle 9 Multiplier (at tag 8) starts computing
Adder (at tag 11) finishes computing
20. 20 Tomasulo Example Cycle 10 Result of adder (with tag 11) is broadcast on CDB
Divider (at tag 9) picks it up
Register F0 picks it up
21. 21 Tomasulo Example Cycle 11 Multiplier (at tag 8) finishes computing
22. 22 Tomasulo Example Cycle 12 Result of multiplier (at tag 8) is broadcast on CDB
Divider (at tag 9) picks it up, and is thereby enabled
Store buffer (at tag 1) picks it up, and is thereby enabled
23. 23 Observations on Tomasulo’s Algorithm Instructions: move from decoder to reservation stations
in program order
dependences can be correctly recorded
Data Flow Graph: The graph of pointers connecting the RS, registers, and memory buffers
helps accomplish out-of-order sequencing of instructions
Chief cost of this scheme: high-speed associative hardware
RS hardware has to search for tags when CDB broadcasts some value with its tag
Full load bypassing is supported
load and store buffers are treated just like functional units
additional hardware on 360/91 also supported load forwarding
24. 24 Tomasulo: Example of Load Bypassing Instruction 202 depends on instructions 200 and 201, so instruction 203 will start executing much before 202 (assuming C and D are found to be different memory addresses)
Work out details off-line
25. 25 Tomasulo: “Loop Unrolling in Hardware” 360/91 supported limited kind of speculation
Small loops could be held in a loop buffer
Loop closing branches were predicted as taken
This has the effect of loop unrolling at run-time
Given the small number of FP registers in machine, software loop unrolling was not a viable option
26. 26 Tomasulo Loop Example Loop: L.D F0 0 R1
MULT.D F4 F0 F2
S.D F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
Multiply takes 4 clocks
Loads have cache misses
27. 27 Loop Example Cycle 0
28. 28 Loop Example Cycle 1
29. 29 Loop Example Cycle 2
30. 30 Loop Example Cycle 3
31. 31 Loop Example Cycle 4
32. 32 Loop Example Cycle 5
33. 33 Loop Example Cycle 6
34. 34 Loop Example Cycle 7
35. 35 Loop Example Cycle 8
36. 36 Loop Example Cycle 9
37. 37 Loop Example Cycle 10
38. 38 Loop Example Cycle 11
39. 39 Loop Example Cycle 12
40. 40 Loop Example Cycle 13
41. 41 Loop Example Cycle 14
42. 42 Loop Example Cycle 15
43. 43 Loop Example Cycle 16
44. 44 Loop Example Cycle 17
45. 45 Loop Example Cycle 18
46. 46 Loop Example Cycle 19
47. 47 Loop Example Cycle 20
48. 48 Loop Example Cycle 21
49. 49 Summary of Tomasulo’s Algorithm Prevents registers as bottleneck
Avoids WAR and WAW hazards of scoreboard
Allows loop unrolling in hardware
Not limited to basic blocks (provided we have branch prediction)
Lasting contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
Next: Dynamic branch prediction