1 / 70

Instruction Level Parallelism and Tomasulo’s approach

Instruction Level Parallelism and Tomasulo’s approach. Instruction Level Parallelism. Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Reduce stalls, reduce CPI Reduce CPI, increase IPC Instruction-level parallelism (ILP) seeks to reduce stalls

amadis
Download Presentation

Instruction Level Parallelism and Tomasulo’s approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instruction Level Parallelism andTomasulo’s approach CSCI 620 NOTE8

  2. Instruction Level Parallelism • Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls • Reduce stalls, reduce CPI • Reduce CPI, increase IPC • Instruction-level parallelism (ILP) seeks to reduce stalls • Importance of ILP is more visible in Loop-level parallelism: • for (i=1; i<1000; i=i+1) • { • x[i] = x[i] + y[i]; • } CSCI 620 NOTE8

  3. Major Techniques to increase ILP CSCI 620 NOTE8

  4. Instruction Level Parallelism • ILP by SW (static) or HW (dynamic) techniques • HW intensive ILP dominates desktop and server markets • SW compiler intensive approaches more likely seen in embedded systems—but IA-64 uses the approach CSCI 620 NOTE8

  5. Dependences • Two instructions are parallel if they can execute simultaneously in a pipeline without causing any stalls (assuming no structural hazards) and can be reordered • Two instructions that are dependent are not parallel and cannot be reordered—must be executed in-order—even though they can be partially overlapped • Three types of dependences • Data dependences(=true data dependences) • Name dependences • Control dependences CSCI 620 NOTE8

  6. Dependences • Dependences are properties of programs • Whether a dependence results in an actual hazard(& the length of stalls) are properties of the pipeline organization • Dependence • indicates the potential for a hazard • Determines the order in which results must be calculated • Sets an upperbound for ILP • Problems caused by Dependences can be solved by: • Try to avoid by rescheduling • Eliminate by transforming the code (alter the code) • Compiler concerned about dependences in program, whether or not a HW hazard occurs depends on a given pipeline CSCI 620 NOTE8

  7. Review of Data Hazards • Consider instructions i and j, where i occurs before j. • RAW (read after write) — j tries to read a source before i writes it, so j gets the old value • WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled) • WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline) CSCI 620 NOTE8

  8. (1) Data Dependences • (True) Data dependences • Instruction i produces a result used by instruction j(directly), or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i (inderectly). j  k  i j  i • Easy to determine in cases of registers (fixed names) • Harder to determine for memory: • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R4) = 20(R4)? • Will see hardware technique in chap 2 i: ADD.D F0, F2, F4 j: SUB.D F6, F0, F8 CSCI 620 NOTE8

  9. (2) Name Dependences • Second type of dependences called name dependence: two instructions use same name (same register or memory location) but don’t exchange data • Antidependence • Instruction j writes a register or memory location that instruction i reads from and instruction i must be executed first—if not, then WAR hazard • Output dependence • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved—if not, then WAW • * Name Dependences are harder to handle for memory accesses • Does 100(R4) = 20 (R6)? • From different loop iterations, does 20(R4) = 20(R4)? i : ADD.D F0, F2, F4 j : SUB.D F2, F6, F8 i : ADD.D F0, F2, F4 j : SUB.D F0, F6, F8 CSCI 620 NOTE8

  10. Register Renaming eliminates WAR & WAW • Assuming temporary registers S and T : • DIV.D F0, F2, F4 DIV.D F0, F2, F4 • ADD.D F6, F0, F8 ADD.D S, F0, F8 • S.D F6, 0(R1) S.D S, 0(R1) • SUB.D F8, F10, F14 SUB.D T, F10, F14 • MUL.D F6, F10, F8 MUL.D F6, F10, T • (True) Data Dependences ? • Antidependences(WAR) ? • Output dependences(WAW) ? • Which dependences are eliminated by renaming? • Subsequent F8 must be replaced by T • How about F6? Not needed to be replaced as F8 because MULT.D will change F6 WAR & WAW are eliminated by register renaming—will be implemented in hardware Register renaming (True) Data Dependences= (1) DIV.D—ADD.D (2) ADD.D—S.D (3) SUB.D—MUL.D Antidependences = ADD.D—SUB.D Output dependences = ADD.D—MUL.D CSCI 620 NOTE8

  11. (3) Control Dependence • Final kind of dependence called control dependence • Example if pl {S1; }; if p2 {S2; } • S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. • Note that S2 could be data dependent on S1. CSCI 620 NOTE8

  12. Control Dependences • Two (obvious) constraints on control dependences: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch • S1; • if p1 {S1; • }; • if p2 {S2; • } • if p1 {S1; • }; • if p2 {S2; • } • if pl {S1; • }; • S3; • if p2 {S2; • } • S3 • if pl {S1; • }; • S3; • if p2 {S2; • } CSCI 620 NOTE8

  13. Limitations of Scoreboarding(Scoreboard hardware onnext slide) • No forwarding hardware • Limited to instructions in basic block (small window) • Small number of functional units (structural hazards), especially integer/load/store units—only one each • Can not issue if structural or WAW hazards • Must wait until WAR hazards resolved • Imprecise exceptions due to out-of-order execution Improvement? Tomasulo’s Approach CSCI 620 NOTE8

  14.     FP mult FP mult Integer unit FP add FP divide Scoreboard Scoreboard Hardware— centralized control by Scoreboard Registers Data buses Data flows Control/status flows Control/status Control/status Figure A.50 The basic structure of a MIPS processor with a scoreboard Scoreboard originally proposed in CDC6600 (Seymore Cray,1964) CSCI 620 NOTE8

  15. Busy – Indicates whether the unit is busy or not • Op – Operation to perform in the unit (e.g., add or subtract) • Fi – Destination register • Fj, Fk – Source-register numbers • Qj, Qk – Functional units producing source registers Fj, Fk • Rj, Rk – Flags indicating when Fj, Fk are available and not yet read. CSCI 620 NOTE8

  16. Tomasulo’s Algorithm • For IBM 360/91 about 3 years after CDC 6600 (Late 1960s) • Goal: High performance without special compilers • Differences between Tomasulo’s Algorithm & Scoreboard • (Similar to Scoreboarding, but added Register Renaming) • Control & buffers (called “reservation stations”) distributed with functional units vs. centralized in scoreboard—Scoreboard/Inst buffer  Reservation Stations for each FU • Registers in instructions replaced by pointers to reservation station buffer • HW renaming of registers to avoid WAR, WAW hazards • Common data bus (CDB) broadcasts results to functional units • Load and stores treated as functional units as well • Very Importantly • – Tomasulo’s algorithm are adopted to many modern CPUs; • Alpha 21264, HP PA-8000, MIPS R10K, Pentium III, Pentium 4, PowerPC 604, etc… CSCI 620 NOTE8

  17. Key concept: Reservation Stations(RS) • • Distributed (rather than centralized) control scheme • – Bypassing(data directly to RS rather than via registers) is allowed via Common Data Bus (CDB) to RS • – Register Renaming eliminates WAR/WAW hazards • • Scoreboard/Instruction Buffer => Reservation Stations • – Fetch and Buffer operands as soon as available • • Eliminates need to always get values from registers at execute • – Pending instructions designate reservation stations that will provide their inputs • – Successive writes to a register cause only the last one to update the register CSCI 620 NOTE8

  18. MIPS Floating-point unit using Tomasulo’s Algorithm CSCI 620 NOTE8

  19. Details • Each reservation station holds instructions that has been issued and waiting for execution—an instruction may already have all the operands or it has the name(s) of RS or the names of load buffers which will provide them. These name fields are called “tags”—4-bits each to denote one of 5 RSs & 6 Load buffers—RSs are used for renaming • Load buffer & Store buffer behave almost exactly like RS • All results from the FUs and from memory are sent on the Common Data Bus which is connected to everywhere except the Load buffer CSCI 620 NOTE8

  20. Three Stages of Tomasulo’s Algorithm • 1. Issue: Get the next instruction from FP operation queue (FIFO) If reservation station free (if Not free  stall (=structural hazard)), issues instruction & sends operands (if available in register, else provide name of FU(=renaming)). Avoids WAR & WAW • 2. Execution: Operate on operands (EX) • When both operands ready(already in Vj/Vk or from CDB), get them, then execute; if not ready, watch common data bus for result. RAW avoided • 3. Write result: Finish execution (WB) • Write on common data bus so that all awaiting FUs can hear; mark reservation station as available. • Common data bus: 64 bit data + 4 bit source (“come from”) CSCI 620 NOTE8

  21. Data Buses in Tomasulo’s Algorithm • • Compare to Normal data bus which has: data + destination (“go to” bus) • • CDB(Common Data Bus): data + source (“come from” bus) • – 64 bits of data + 4 bits of Functional Unit source address • (RS’s number) • – Any receiving unit(Store buffer, RSs, FP registers) will accept(Write) if the RS’s number matches the expected number CSCI 620 NOTE8

  22. Reservation Station Components • Op – Operation to perform in the unit (e.g., + or – ) • Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here • Vj, Vk – Registers that store the Value of source operands—temp registers for renaming • Busy – Indicates reservation station and FU is busy • Register result status – Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register. CSCI 620 NOTE8

  23. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  24. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  25. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy Load & Store require 2 steps: Step 1: Compute effective addr(ea) Step 2: Place ea in buffer Execution(Load or Store) can start when memory unit is not busy CSCI 620 NOTE8

  26. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  27. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  28. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  29. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  30. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  31. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  32. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  33. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  34. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  35. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  36. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  37. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  38. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  39. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  40. Wait until DIVD finishesDivide takes 40 cycles CSCI 620 NOTE8

  41. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  42. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  43. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  44. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy Assuming(for Scoreboard): Add takes 2 clock cycles, multiply=10, divide=40 Tomasulo Scoreboard • Why take longer on scoreboard of CDC 6600? Structural Hazards Lack of forwarding • Both in-order issue and out-of-order execution • Scoreboard cannot handle WAR & WAW • Tomasulo can with register renaming • Both will stall with Branch instruction—later see Tomasulo with Speculation CSCI 620 NOTE8

  45. Let’s try this site--http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo/AppletTomasulo.html CSCI 620 NOTE8

  46. CSCI 620 NOTE8

  47. Tomasulo’s Algorithm: A Loop-Based Example Loop: LD F0 0(R1) MULTD F4 F0 F2 SD F4 0(R1) SUBI R1 R1 #8 BNEZ R1 Loop • Multiply takes 4 clocks • Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit)—on a cache miss, a block(several words) is brought into the cache • Reality: integer instructions run ahead CSCI 620 NOTE8

  48. Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  49. Cache miss occurs, so LD must wait for 8 cycles Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

  50. Cache miss occurs, so LD must wait for 8 cycles Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – The name of Reservation stations that will produce source registers—no values stored here Vj, Vk – Registers that store the Value of source operands—temp registers for renaming Busy – Indicates reservation station and FU is busy CSCI 620 NOTE8

More Related