html5-img
1 / 48

Advanced Pipelining

Advanced Pipelining. Out of Order Processors. COMP25212. From Monday…. Out-of-Order Execution with Scoreboard Centralized data structure which tracks the status of registers, FUs and instructions and creates, dynamically in hardware, the dependency graph

anitra
Download Presentation

Advanced Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Pipelining Out of Order Processors COMP25212

  2. From Monday… Out-of-Order Execution with Scoreboard • Centralized data structure which tracks the status of registers, FUs and instructions and creates, dynamically in hardware, the dependency graph • The centralized nature limits scalability: • Small number of FUs and small window of instructions • Dependencies • RAW – stall conflicted instruction • WAW – stall the pipeline • WAR – stall WB

  3. Out of Order Execution with Tomasulo

  4. Tomasulo’s Algorithm • Control logic for out-of-order execution is decentralized • Reservation Stations (RS) in the functional units keep instruction information • In addition RS seamlessly rename registers • A Common Data Bus (CDB) broadcasts data and results to the different devices • A single instruction can finish each cycle • Distributed control allows for a larger window of instructions – Dynamic scheduling

  5. Tomasulo’s Algorithm • Structural hazards stall the pipeline • RS tracks when operands are available and buffers them as soon as they are • No need for accessing register bank (store values or sources) • Impact of RAW dependencies are limited • Execute an instruction when its operands are available • WAW and WAR dependencies are avoided • Register renaming

  6. Register Renaming (Example) T T • Eliminates WAR and WAW hazards by renaming all destination registers. • Can be done by compiler True dependences DIV.D F0, F2, F4 ADD.D F6, F0, F8 ST.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 S S Antidependence Output dependence

  7. Tomasulo Organization FP Op Queue FP Registers From Mem Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB) Normal data bus: data + destination Common data bus: data + source

  8. Stages of a Tomasulo Pipeline Execute Integer Write Back Execute FP Multiplication Write Back Execute FP Multiplication Issue Write Back Execute FP Division Execute FP Add Write Back Write Back

  9. Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both source operands are ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast

  10. Reservation Station Components No information about instructions needed

  11. Tomasulo Example Instruction stream Instruction status: Tomasulo does not need this info We will show the times for each stage, for convenience

  12. Reservation Station Components No information about instructions needed Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands • Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) • Note: Qj,Qk=0 => ready • Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy

  13. Tomasulo Example Reservation Stations: 3 Load Buffers Reservation Stations: 3 Adder 2 Multiplication FU count down Source registers Which FU will produceoperands Source registers

  14. Reservation Station Components No information about instructions needed Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands • Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) • Note: Qj,Qk=0 => ready • Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

  15. Tomasulo Example Which RS will write in each register? Clock cycle counter

  16. A Tomasulo Example Functional Unit (FU) # of FUs EX cycles FP Multiply/Division 2 10/40 FP Addition/Substraction 3 2 Mem Load 3 2 The following code is run on a Tomasulo pipeline with: L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Functional units not pipelined

  17. Dependency Graph For Example Code 1 2 3 4 5 6 L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 1 3 2 4 6 5 L.D F2, 45 (R3) L.D F6, 34 (R2) MUL.D F0, F2, F4 DIV.D F10, F0, F6 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW) Example Code Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6)

  18. Tomasulo Example

  19. Tomasulo Example Cycle 1 LD#1 issued

  20. Tomasulo Example Cycle 2 LD#2 issued

  21. Tomasulo Example Cycle 3 • MULTD is issued • LD#1 completes and broadcasts its result

  22. Tomasulo Example Cycle 4 • SUBD is issued • LD#1 result updates the register bank • LD#2 completes, broadcasting its result

  23. Tomasulo Example Cycle 5 • DIVD is issued • LD#2 result updates the register bank • Add1, Mult1 start execution

  24. Tomasulo Example Cycle 6 • ADDD issued

  25. Tomasulo Example Cycle 7 • Add1 (SUBD) completes and broadcasts result

  26. Tomasulo Example Cycle 8 • Add1 (SUBD) result updates the register bank • Add2 (ADDD) start execution

  27. Tomasulo Example Cycle 9 • ADDD and MULTD continue execution

  28. Tomasulo Example Cycle 10 • Add2 (ADDD) completes

  29. Tomasulo Example Cycle 11 • ADDD result updates the register bank

  30. Tomasulo Example Cycle 12 • MULTD continues execution

  31. Tomasulo Example Cycle 13 • MULTD continues execution

  32. Tomasulo Example Cycle 14 • MULTD continues execution

  33. Tomasulo Example Cycle 15 • MULTD completes and broadcasts result

  34. Tomasulo Example Cycle 16 • MULTD result updates the register bank • DIVD starts execution

  35. 39 cycles later…

  36. Tomasulo Example Cycle 55 • DIVD is about to complete

  37. Tomasulo Example Cycle 56 • DIVD completes

  38. Tomasulo Example Cycle 57 • DIVD result updates the register bank

  39. Tomasulo Example Cycle 57 In-order issue Out-of-order execution Out-of-order completion

  40. Tomasulo’s advantages • Distributed hazard detection logic • distributed reservation stations and the CDB • If multiple instructions waiting on a single result, & each instruction has other operand, then instructions can be dispatched simultaneously by broadcasting on CDB • If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) Avoids stalling due to WAW or WAR hazards

  41. Tomasulo Drawbacks • Complexity of hardware • Performance limited by Common Data Bus • Each CDB must go to all functional units  high capacitance, high wiring density • Number of functional units that can complete per cycle limited to one! • Multiple CDBs  more FU logic for parallel stores

  42. Summary • Reservations stations: implicit register renaming to larger set of registers + buffering source operands • Prevents registers from being bottleneck • Avoids the WAR and WAW hazards of Scoreboard • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation

  43. Summary of Out-of-Order Processors

  44. BENEFITS: Accelerates the execution of programs More efficient design Increases the utilisation of processor resources LIMITATIONS: More complex design Very expensive in terms of area and power Non-precise interrupts Interrupting exactly after an instruction might not be possible Out of Order Processors

  45. Scoreboard vs Tomasulo

  46. Example RAW – Stall the pipeline RAW – ADD stalled, SUB could be issued RAW – ADD stalled, SUB can be issued RAW WAW Assuming no structural Hazards LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles

  47. Example WAW – Allowed by register renaming in RS WAW – SUB cannot be issued Stall the pipeline WAW Assuming no structural Hazards LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles

  48. Example 2 instrs. can finish atthe same time CDB limits finishinginstrs. to one/cycle Assuming no structural Hazards LD – 4 cycles Add/Sub – 2 cycles Mul/Div – 2 cycles

More Related