html5-img
1 / 51

Appendix C

Computer Architecture A Quantitative Approach, Fifth Edition. Appendix C. Pipelining: Basic and Intermediate Concepts. Basic Pipelining. Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance.

jaser
Download Presentation

Appendix C

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts

  2. Basic Pipelining • Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance. • Overview of basic pipelining • What is pipelining? • Computing pipeline speedup • Clocking pipelines • Pipelining MIPS • Pipeline hazards • Handling interrupts.

  3. Pipelining

  4. Pipelining 3 Stages • Assume a 2 ns flip-flop delay

  5. Pipelining: Computing the speedup • Time per instruction • TPI = CPI cycle time • We can think about pipelining as reducing either CPI or cycle time • Ideal speedup • Requires that all stages be perfectly balanced • No synchronization (latch, flip-flop) overhead • No stall cycles • The speedup from a pipeline is limited • CPIreal = CPIideal + CPI stall • CCTreal = Timelongestpipestage+ Timelatch overhead

  6. MIPS Instruction Formats

  7. Basic MIPS Pipeline

  8. Basic MIPS Pipeline (simplified)

  9. Pipelining By Adding Registers

  10. MIPS Pipelined Execution

  11. Rules for pipeline registers • Each stage must be independent, so inter-stage registers must hold • Data values • Control signals, including • Decoded instruction fields • MUX controls • ALU controls • Think of the register file as two independent units • Read file, accessed in ID • Write file, accessed in WB • There is no “final” set of registers after WB, (WB/IF) because the instruction is finished and all results are recorded in permanent machine state (register file, memory, and PC)

  12. A More Accurate Pipeline Schematic

  13. Pipeline Dataflow: the details

  14. Problems with Pipelining (Dependencies and Hazards) • Dependencies: a property of the program • Data dependencies • Instruction j uses the result produced by instruction I • Control dependencies • The execution of instruction j depends upon the result of instruction i

  15. Dependencies and Hazards • Hazard a result of dependencies in the pipeline • Hazards lead to pipeline stalls or the execution of the wrong instruction • Data hazards • Instruction depends upon the result of an instruction still in the pipeline • Structural Hazard • Two instructions try to use the same hardware resource in a single cycle • Control hazard • Caused by the delay in fetching an instruction and decision about changes in instruction flow

  16. Structural hazards • When two instructions need to use the same hardware resource in the same cycle. • Resources are not duplicated • Register file write ports • Resources is not fully pipelined, I.e. takes more than one cycle • Division, floating points • Fix #1: Stall later instruction • Low cost, but increases average CPI • Best used for rare events • Examples: • MIPS R2000 multi-cycle multiply • SPARC V1 single memory port for instruction and data • Fix #2: Duplicate the resource • Increase cost, but preserves CPI • Best used for cheap resources and/or frequent events

  17. Structural hazards, continued • Example resource duplication • Separate instruction and data memory • Separate ALU and PC adders • Register files with multiple ports • Fix #3: Pipeline expensive resource • Moderate cost compared to duplication, expensive compared to stalling • Best used for high performance or specialty machines • Fully pipelined floating point units for scientific machines. • How to avoid structural hazards altogether • Design the ISA so that each resource needed by an instruction: • Is used once • Is always used in the same pipeline stage • Takes one cycle • MIPS is designed with pipelining in mind, x86 is not

  18. Types of Data Hazards • RAW (Read After Write) • Only hazard for “fixed” pipelines • Later instruction must be read after the earlier instruction writes • WAW (Write After Write) • Variable length pipeline • Later instructions must write after earlier instruction I • WAR (Write after Read) • Pipeline with late read • Later instruction must write after earlier instruction reads • We can have Data hazard through memory locations

  19. Example RAW pipeline hazard

  20. Stall for RAW hazards • Relatively cheap: just needs some extra compare and control logic • Detected in ID stage by comparing the registers to be read with the registers to be written for the instruction currently in the EX, MEM, or WB stages • Stall if a match is found • Increases the average CPI • Would happen much too frequently Write Data to R1 here Bubble ADD R1, R2, R3 ADD R4, R1, R5 Read from R1 here

  21. Stall type #1: Freeze the whole pipeline • Freeze all pipe stages for one or more cycles, and suppress writeback • Needs only one global stall signal which suppresses all latching in all pipeline stages • Sometimes called a “fixed pipe” or “frozen pipe” stall • Works for cache misses • Will not work to remove pipeline hazards

  22. Stall type #2: Delay completion of an instruction • Instruction progress stops for one cycle • Earlier instructions continue towards completion • Prior instructions must suspend and make no more progress • An “elastic pipe: stall • Good when the need for stalling is only detected after decode, like for pipeline hazards Bubble in: EX MEM WB

  23. Bypass (Forwarding) • If data is available elsewhere in the pipeline, there is no need to stall • Detect condition • Bypass (or forward) data directly to the consuming pipeline stage • Bypass eliminates stalls for single-cycle operations • Reduces longest stall to N-1 cycles for N-cycle operations

  24. Physical Forwarding Paths • The third forwarding operation might not be necessary • if we can make read-after-write register file

  25. Example forwarding decisions • If EX has just finished an operation for which ID wants to read the value from either operand, we must forward • If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS1_Reg_Num then ALUmuxA =SelectALU4 • If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS2_Reg_Num then ALUmuxB =SelectALU4 • Need one comparison and multiplex control for each forwarding path • Be careful: if you forward from more than one instruction, choose the closest in the pipeline

  26. Physical Forwarding Paths

  27. Forwarding Animation (1)

  28. Forwarding Animation (2)

  29. Forwarding Animation (3)

  30. Forwarding Animation (4)

  31. Other Data Hazards • WAR (Write After Read) • Can happen if the instruction pipeline has early writes and/or late reads; something like: DIV (R1), Suppose that it does not read destination indirect until after the divide ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1 • Can not happen in DLX because all reads are early (ID) and all writes are late (WB) • WAW (Write After Write) • Can happen when a fast operation follows a slow one; like LW R1,0(R2) IF ID EX MEM WB ADD R1, R2, R3 IF ID EX WB • Can not happen in DLX (integer) because there is only one WB stage and instructions use it in order

  32. One data hazard left • Loaded data is not available until the end of MEM, which is too late for the following instruction • Forwarding can not help, so we must stall – or just “decree” that you can not write code like this. Such a decree is called a “delayed load” and was used in the original MIPS 2000

  33. Stalling to interlock

  34. The software fix: instruction scheduling to avoid stalls • Since we can not avoid a stall following a load, avoid the stall by rearranging the code (“pipeline scheduling”), if possible • Replace sub r4, r5, r7 lw r1, 50(r2) add r3, r1, r4 • With lw r1, 50(r2) sub r4, r5, r7 add r3, r1, r4 • This can improve a simple RISC machine performance

  35. The software fix: instruction scheduling to avoid stalls • But it is limited • Usually limited to basic blocks between branches, 5-7 instructions • Difficult to do interchanges to variables referenced indirectly (pointer, array, or parameter) due to the risk of aliases.

  36. Branches and jumps • Control point: know target and condition • Mem control point • Branch penalty: number of pipeline stages to control point • This 3-cycle penalty works, but since branches occur every 5-7 instructions, it kills performance. What to do • Determine the branch condition earlier than EX • Compute the target address earlier than MEM

  37. Characteristics of MIPS branches and jumps • The branch condition • Only has EQ/NE comparison to zero • Fast and cheap, no need for a full ALU • Use a 32-bit NOR gate instead • The branch target • Always PC-relative • Needs only 16 bit adder (and carry propagation) • The jump target • Always PC-relative • Target = {PC[31:28], offset, 00} • All can be moved to the ID stage, at the cost of additional hardware (and maybe increased cycle time) • Still requires one stall

  38. Pipelining and Branch ISA Design • Simple branches • Makes ID control point possible • Maybe increases cycle time • 1 cycle penalty • Complex branches • Requires EX control point • Maybe lower cycle time • 2 cycle branch penalty

  39. Reducing branch penalties (1) • Predict that the branch will not be taken • Continue fetching from sequential addresses. • Cancel later if branch was taken • Easy to do • If it is not, continue • If it is, change the following instructions into a NOP and thus take a 1-cycle penalty • Helps a little, but bets the wrong way for loops

  40. Reducing branch penalties (2) • Predict that the branch will be taken • Only useful if the target address is known before the branch condition – not true for MIPS • Cancel later if the branch was not taken • Always has some delay in fetching the branch target

  41. Reducing branch penalties (3) • Change the ISA: delay the effect of the branch • Always execute the instruction(s) after the branch or jump • Depends on the compiler to find something useful to do in the branch delay slot(s). • An ugly dependence of ISA on implementation – may change • Interaction with branch prediction, interrupts.

  42. Filling the branch delay slot

  43. How useful are canceling branches Integer : 35 % slots wasted Floating point : 25% slots wasted

  44. Performance of Branch schemes? • Effective CPI = 1 + %branches  average branch penalty • For integer MIPS: 20% of instructions are branches or jumps. 70% of them go to the target

  45. Pipeline example • Consider the following pipeline which implements the MIPS-like ISA. The only variation on the MIPS ISA is the support of full register compares in branch instructions

  46. The Pipeline stages

  47. Assumptions • Writes to the register file occur in the first half of the clock cycle while reads from the register file occur in the second half • All bypass paths have been implemented to minimize pipeline stalls due to data hazards • The pipeline implements hardware interlocks

  48. Questions • How many register file ports does the processor need to minimize structural hazards? • Indicate all forwarding required to minimize stalls in the given pipeline. Also, specify the minimum number of comparators needed to implement forwarding? • What is the worst case delay due to RAW data hazards? • What is the branch delay of this pipeline?

  49. Instruction Dependencies • The frequencies in the table are presented as percentages of all instructions executed

  50. More Questions • List the instruction sequences from the previous table that cause data stalls in the pipeline. Indicate the corresponding number of stall cycles. • Compute the CPI for the pipeline due to data hazards only. Ignore instruction sequences that are not listed in the table • If the frequency of conditional branches is 10% of which 65% are taken and the frequency of unconditional branches is 6%, compute the overall CPI assuming a TAKEN branch prediction scheme.

More Related