1 / 31

Tomasulo’s Approach and Hardware Based Speculation

Vincent H. Berk October 22nd Reading for Today: 3.1 – 3.7. Tomasulo’s Approach and Hardware Based Speculation. Hardware Schemes for ILP. Why in hardware at run time? Works when dependence is not known at run time Simplifies compiler Allows code for one machine to run well on another

jena-jones
Download Presentation

Tomasulo’s Approach and Hardware Based Speculation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENGS 116 Lecture 10 Vincent H. Berk October 22nd Reading for Today: 3.1 – 3.7 Tomasulo’s Approach andHardware Based Speculation

  2. ENGS 116 Lecture 8 Hardware Schemes for ILP • Why in hardware at run time? • Works when dependence is not known at run time • Simplifies compiler • Allows code for one machine to run well on another • Key idea: Allow instructions behind stall to proceed • DIVD F0, F2, F4 • ADDD F10, F0, F8 • SUBD F12, F8, F14 • Enables out-of-order execution  out-of-order completion • ID stage checks for both structural hazards and data dependences

  3. ENGS 116 Lecture 8 Hardware Schemes for ILP • Out-of-order execution divides ID stage: 1. Issue — decode instructions, check for structural hazards 2. Read operands — wait until no data hazards, then read operands

  4. ENGS 116 Lecture 8 Tomasulo’s Algorithm • For IBM 360/91 about 3 years after CDC 6600 • Goal: High performance without special compilers • Differences between IBM 360 & CDC 6600 ISA • IBM has only 2 register specifiers/instruction vs. 3 in CDC 6600 • IBM has 4 FP registers vs. 8 in CDC 6600 • Differences between Tomasulo’s Algorithm & Scoreboard • Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard • Registers in instructions replaced by pointers to reservation station buffer • HW renaming of registers to avoid WAR, WAW hazards • Common data bus (CDB) broadcasts results to functional units • Load and stores treated as functional units as well • Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, ...

  5. ENGS 116 Lecture 8 Three Stages of Tomasulo Algorithm • 1. Issue: Get instruction from FP operation queue • If reservation station free, issues instruction & sends operands (renames registers). • 2. Execution: Operate on operands (EX) • When operands ready then execute; if not ready, watch common data bus for result. • 3. Write result: Finish execution (WB) • Write on common data bus to all awaiting units; mark reservation station available. • Common data bus: data + source (“come from” bus)

  6. ENGS 116 Lecture 8 Tomasulo Organization From Instruction Unit FP Registers From Memory Load Buffers FP Op Queue Store Buffers Operand Bus To Memory Operation Bus FP Add Res. Station FP Mul Res. Station Reservation Stations FP Adders FP Multipliers Common data bus (CDB)

  7. ENGS 116 Lecture 8 Reservation Station Components • Op – Operation to perform in the unit (e.g., + or – ) • Qj, Qk – Reservation stations producing source registers • Vj, Vk – Value of source operands • Rj, Rk – Flags indicating when Vj, Vk are ready • Busy – Indicates reservation station and FU is busy • Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

  8. ENGS 116 Lecture 8 Tomasulo Example Cycle 1

  9. ENGS 116 Lecture 8 Tomasulo Example Cycle 2

  10. ENGS 116 Lecture 8 Tomasulo Example Cycle 3 Register names are renamed in reservation stations Load1 completing — who is waiting for Load1?

  11. ENGS 116 Lecture 8 Tomasulo Example Cycle 4 Load2 completing — who is waiting for it?

  12. ENGS 116 Lecture 8 Tomasulo Example Cycle 5

  13. ENGS 116 Lecture 8 Tomasulo Example Cycle 6

  14. ENGS 116 Lecture 8 Tomasulo Summary • Reservation stations: renaming to larger set of registers + buffering source operands • Prevents registers as bottleneck • Avoids WAR, WAW hazards of scoreboard • Allows loop unrolling in HW • Not limited to basic blocks • (integer units get ahead, beyond branches) • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

  15. Hardware-Based Speculation • Instead of just instruction fetch and decode, also execute instructions based on prediction of branch. • Execute instructions out of order as soon as their operands are available. • Wait with instruction commit until branch is decided. • Re-order instructions after execution and commit them in order • reorder buffer or ROB • register file not updated until commit • Do not raise exceptions until instruction is committed • ROB holds and provides operands until commit.

  16. Tomasulo with Speculation • Issue – Empty reservation station and an empty ROB slot. Send operands to reservation station from register file or from ROB. This stage is often referred to as: dispatch • Execute – Monitor CDB for operands, check RAW hazards. When both operands are available, then execute. • Write Result – When available, write result to CDB through to ROB and any waiting reservation stations. Stores write to value field in ROB. • Commit – Three cases: • Normal Commit: write registers, in order commit • Store: update memory • Incorrect branch: flush ROB, reservation stations and restart execution at correct PC

  17. Problems with speculation • Multi Issue Machines: • Must be able to commit multiple instructions from ROB • More registers, more renaming • How much speculation: • How many branches deep? • What to do on a cache miss? • TLB miss? • Cache interference due to incorrect branch prediction

  18. Figure: 3.41 Number of registers available for renaming.

  19. Figure: 3.45 Window size: the number of instructions the issue unit may look ahead and schedule from.

  20. ENGS 116 Lecture 11 ILP: Software Approaches 2

  21. HW Support for More ILP • Avoid branch prediction by turning branches into conditionally executed instructions: • If (X) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. • IA-64: 61 1-bit condition fields selected so conditional execution of any instruction • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline X A = B op C

  22. Iteration 0 Iteration Iteration 1 Iteration 2 Iteration 3 4 pipelined Software Pipelining • Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop Software- iteration

  23. Before: Unrolled 3 times After: Software Pipelined SW Pipelining Example 1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0, F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, –8 (R1) 4 LD F6, –8 (R1) 1 SD 0 (R1), F4 Stores M[i] 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to M[i-1] 6 SD –8, (R1), F8 3 LD F0, –16 (R1) Loads M[i-2] 7 LD F10, –16 (R1) 4 SUBI R1, R1, #8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD –16 (R1), F12 SD 0 (R1), F4 10 SUBI R1, R1, #24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD –8 (R1), F4 Read F4Read F0 SD IF ID EX Mem WB Write F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB Write F0

  24. SW Pipelining Example • Symbolic Loop Unrolling • Smaller code space • Overhead paid only once vs. each iteration in loop unrolling 100 iterations = 25 loops with 4 unrolled iterations each Software Pipelining Number of overlapped operations (a) Software pipelining Time Loop Unrolling Number of overlapped operations Time (b) Loop unrolling

  25. Trace Scheduling • Focus on critical path (trace selection) • Compiler has to decide what the critical path (the trace) is • Most likely basic blocks are put in the trace • Loops are unrolled in the trace • Now speed it up (trace compaction) • Focus on limiting instruction count • Branches are seen as jumps into or out of the trace • Problem: • Significant overhead for parts that are not in the trace • Unclear if it is feasible in practice

  26. Superblocks • Similar to Trace Scheduling but: • Single entrance, multiple exits • Tail duplication: • Handle cases that exited the superblock • Residual loop handling • Could in itself be a superblock • Problem: • Code size • Worth the hassle?

  27. Conditional instructions • Instruction that is executed depending on one of its arguments: • Instruction is executed but results are not always written. • Should only be used for very small sequences, else use normal branch. BNEZ R1, L ADDU R2, R3, R0 L: CMOVZ R2, R3, R1

  28. Speculation • Compiler moves instructions before branch if: • Data flow is not affected (optionally with use of renaming) • Preserve exception behavior • Avoid load/store address conflicts (no renaming for memory loc.) • Preserving exception behavior • Mechanism to indicate an instruction is speculative • Poison bit: raise exception when value is used • Using Conditional instructions: • Requires In-Order instruction commit • Register renaming • Writeback at commit • Forwarding • Raise exceptions at commit

  29. Speculation if (A==0) A=B; else A=A+4; LD R1, 0(R3) ; load A BNEZ R1, L1 ; test A LD R1, 0(R2) ; then J L2 ; skip else L1: DADDI R1, R1, #4 ; else L2: SD R1, 0(R3) ; store A LD R1, 0(R3) ; load A LD R14, 0(R2) ; load B (speculative) BEQZ R1, L3 ; branch if DADDI R14, R1, #4 ; else L3: SD R14, 0(R3) ; store A

More Related