1 / 41

Lecture 8 Advanced Pipeline

Lecture 8 Advanced Pipeline. Interrupts. Interrupts: 5 instructions executing in 5-stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt?. Stage Exceptional Conditions IF Page fault on instruction fetch;

eyad
Download Presentation

Lecture 8 Advanced Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 8Advanced Pipeline CS510 Computer Architectures

  2. Interrupts Interrupts: 5 instructions executing in 5-stage pipeline • How to stop the pipeline? • How to restart the pipeline? • Who caused the interrupt? Stage Exceptional Conditions IFPage fault on instruction fetch; Unaligned memory access; Memory-protection violation IDUndefined or illegal opcode EXArithmetic interrupt MEM Page fault on data fetch; Unaligned memory access; Memory-protection violation CS510 Computer Architectures

  3. Delays updating the machine state until late in pipeline, possibly at the completion of an instruction! Simultaneous Exceptions in More Than One Pipe Stages • Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD • LD with data page(DM) fault in MEM stage • ADD with instruction page(IM) fault in IF stage • ADD fault will happen BEFORE Load fault • Solution #1 • Interrupt status vector per instruction • Defer check till the last stage, and kill machine state update if exception • Solution #2 • Interrupt ASAP • Restart everything that is incomplete CS510 Computer Architectures

  4. Simultaneous Exceptions Complex Addressing Modes and Instructions • Address modes:Auto-incrementcauses register change during the instruction execution -Register write in EX stage instead of WB stage • Interrupts? Need to restore register state • Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage • Memory-Memory Move Instructions • Must be able to handle multiple page faults • Long-lived instructions: partial state save on interrupt • Condition Codes CS510 Computer Architectures

  5. EX intunit IF ID IF ID EX FP/int Multiply EX MEM WB MEM WB EX FP adder EX FP/Int divider Extending the DLX to Handle Multi-cycle Operations CS510 Computer Architectures

  6. integer unit EX FP/integer multiply M1 M2 M3 M4 M6 M7 M5 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Multicycle Operations CS510 Computer Architectures

  7. Data neededResult available Example latency initiation interval Latency and Initiation Interval * FP LD and ST are same as integer by having 64-bit path to memory. MULTD IF IDM1M2 M3 M4 M5 M6M7 MEM WB ADDD IF IDAIA2 A3A4 MEM WB LD* IF IDEXMEMWB SD* IF IDEXMEMWB Latency: Number of intervening cycles between instructions that produces a result and uses the result Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type Integer ALU 0 1 Load 1 1 FP add 3 1 FP mul 6 1 FP div 14 15 CS510 Computer Architectures

  8. Reality: MIPS R4000 Floating Point Operations Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2 Cycles before using result Cycles before issuing instr of the same type CS510 Computer Architectures

  9. Complications Due to FP Operations Divide, Square Root take @ 10X to @ 30X longer than Add • exceptions? • Adds WAR and WAW hazards since pipelines are no longer same length CS510 Computer Architectures

  10. Summary of Pipelining Basics • Hazards limit performance • Structural: need more HW resources • Data: need forwarding, compiler scheduling • Control: early evaluation of PC, delayed branch, prediction • Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency • Interrupts, FP Instruction Set makes pipelining harder • Compilers reduce cost of data and control hazards • Load delay slots • Branch delay slots • Branch prediction CS510 Computer Architectures

  11. Case Study:MIPS R4000 and Introduction to Advanced Pipelining CS510 Computer Architectures

  12. Case Study:MIPS R4000 (100 MHz to 200 MHz) • Cache miss exception • 10s of cycles delay • What is impact on • Load Delay? • Why? 8 Stage Pipeline: IFFirst half of fetching of instruction • PC selection • Initiation of instruction cache access IS- Second half of fetching of instruction • Access to instruction cache RFInstruction decode, register fetch, hazard checkingalso instruction cache hit detection(tag check) EX Execution • Effective address calculation • ALU operation • Branch target computation and condition evaluation DF - First half of access to data cache DS-Second half of access to data cache TC-Tag checkfor data cache hit WB -Write backfor loads and register-register operations CS510 Computer Architectures

  13. IF IS RF EX DF DS TC WB ALU Data Memory Instruction Memory REG REG load data available Instruction is available Tag check The Pipeline Structure of the R4000 CS510 Computer Architectures

  14. Load data available with forwarding EX IF IS RF EX DF DS TC . . . EX Load data needed 2 Stall Cycles Case Study: MIPS R4000LOAD Latency LD R1, X IF IS RF EX DF DS TC WB ADD R3, R1, R2 IF IS RF EX DF DS TC WB IF IS RF EX DF DS IF IS RF EX DF DS . . . IF IS RF EX DF . . . 2 Cycle Load Latency CS510 Computer Architectures

  15. IF IS IF RF IS IF EX RF IS IF DF stall stall stall DS stall stall stall TC EX RF IS WB DF ... EX ... RF ... LWR1 ADD R2,R1 SUB R3,R1 OR R4,R1 Forwarding Case Study: MIPS R4000LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit CS510 Computer Architectures

  16. IF IS IF RF IS IF EX RF IS IF DF stall stall stall DS stall stall stall TC EX RF IS WB DF ... EX ... RF ... LWR1 ADD R2,R1 SUB R3,R1 OR R4,R1 Forwarding Case Study: MIPS R4000LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit CS510 Computer Architectures

  17. (conditions evaluated during EX phase) R4000 uses PredictNOT TAKEN TAKENBr Delay Slot Stall Stall Br Target instr IF IS IF RF IS RF DF EX DS DF IS TC DS RF WB TC ... EX ... EX IF NOT TAKENBr Delay Slot Br instr +2 Br instr +3 Br instr +4 IF IS IF RF IS IF RF IS IF DF EX RF IS DS DF EX RF IS WB TC ... DS ... DF ... EX ... TC DS DF EX RF EX IF Case Study: MIPS R4000Branch Latency Branch target address available after EX stage Delay Slot plus 2 stall cycles Predict NOT TAKEN strategy NOT TAKEN:one-cycle delayed slot TAKEN:one-cycle delayed slot followedby two stalls - 3 cycle latency CS510 Computer Architectures

  18. Integer Unit(EX) FP Multiplier FP/integer multiply IF ID MEM WB FP Adder FP Divider Extending DLX to Handle Floating Point Operations CS510 Computer Architectures

  19. Stage Functional unit Description MIPS R4000 FP Unit • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: • A FP adder Mantissa ADD stage • D FP divider Divide pipeline stage • E FP multiplier Exception test stage • M FP multiplier First stage of multiplier • N FP multiplier Second stage of multiplier • R FP adder Rounding stage • S FP adder Operand shift stage • U Unpack FP numbers CS510 Computer Architectures

  20. MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 ¼ Add, SubtractU S+A A+R R+S MultiplyU E+M M M M N N+A R DivideU A D28¼ D+A D+R, D+A, D+R, A, R Square rootU E (A+R)108¼ A R NegateU S Absolute value U S FP compareU A R Stages: MFirst stage of multiplierN Second stage of multiplier R Rounding stage A Mantissa ADD stage S Operand shift stage D Divide pipeline stage U Unpack FP numbers E Exception test stage CS510 Computer Architectures

  21. Stall clock cycle Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12 A A A A ADD issued at4 cycles after Multiply will stall2 cycles. ADD issued at5 cyclesafter Multiply will stall1 cycle. MIPS R4000 FP Pipe Stages Multiply Issue U M M M M N N+ A R Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S Add Stall U S+A A + R R +S Stall Add Stall U S + A A +R R +S Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S CS510 Computer Architectures

  22. Pipeline CPI 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 li gcc eqntott doduc su2cor tomcatv nasa7 ora espresso spice2g6 Integer programs Floating Point programs Base Load stalls Branch stalls FP result stalls FP structural stalls R4000 Performance Not an ideal pipeline CPI of 1: • Load stalls: (1 or 2 clock cycles) • Branch stalls: (2 cycles for taken br. + unfilled branch slots) • FP result stalls: RAW data hazard (latency) • FP structural stalls: Not enough FP hardware (parallelism) CS510 Computer Architectures

  23. CS510 Computer Architectures

  24. Advanced PipelineAndInstruction Level Parallelism CS510 Computer Architectures

  25. Block of Code . . . Branch Target . . . Branch instruction . . . . . . Any instruction . . . Branch instruction . . . Advanced Pipelining and Instruction Level Parallelism • gcc 17% control transfer • 5 instructions + 1 branch • Beyond single block to get more instruction level parallelism • Loop level parallelism is one opportunity, SW and HW CS510 Computer Architectures

  26. Technique Reduces Advanced Pipelining and Instruction Level Parallelism Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling with scoreboarding RAW stalls Dynamic scheduling with register renaming WAR and WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal CPI Compiler dependence analysis IdealCPI and data stalls Software pipelining and trace scheduling Ideal CPI and data stalls Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory CS510 Computer Architectures

  27. Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double* FP ALU op 1 Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory. Fully pipelined or replicated --- no structural hazards, issue on every clock cycle for ( i =1; i <= 1000; i++) x[i] = x[i] + s; CS510 Computer Architectures

  28. Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double2 Load double FP ALU op1 Load double Store double 0 Integer op Integer op 0 FP Loop Hazards Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls? CS510 Computer Architectures

  29. FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8stall 9 BNEZ R1,Loop ;branch R1!=zero 10 stall ;delayed branch slot Rewrite code to minimize stalls? CS510 Computer Architectures

  30. For Load-ALU latency Consider moving SUBI into this Load Delay Slot. Reading R1 by LD is done before Writing R1 by SUBI. Yes we can. For ALU-ALU latency 8 When we do this, we need to change the immediate value 0 to 8 in SD Reducing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 stall 5 stall 6 SD 0(R1),F4 7 SUBI R1,R1,#8 8 stall 9 BNEZ R1,Loop 10 stall There is only one instruction left, i.e., BNEZ. When we do that, SD instruction fills the delayed branch slot. CS510 Computer Architectures

  31. Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Revised FP Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2 SUBI R1,R1,#8 3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4;altered when move past SUBI Unroll loop 4 times to make the code faster CS510 Computer Architectures

  32. Unroll Loop 4 Times 1Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32;alter to 4*8 14 BNEZ R1,Loop 15NOP 15 + 4 x(1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1 Rewrite loop to minimize the stalls CS510 Computer Architectures

  33. Assumptions - OK to move SD past SUBI even though SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB BNEZ IF ID EX MEM WB - OK to move loads before stores(Get right data) - When is it safe for compiler to do such changes? 14 clock cycles, or 3.5 per iteration Unrolled Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2LD F6,-8(R1) 3LD F10,-16(R1) 4LD F14,-24(R1) 5ADDD F4,F0,F2 6ADDD F8,F6,F2 7ADDD F12,F10,F2 8ADDD F16,F14,F2 9SD 0(R1),F4 10SD -8(R1),F8 11SUBI R1,R1,#32 12SD16(R1),F12; 16-32= -16 13BNEZ R1,LOOP 14SD8(R1),F16; 8-32 = -24 CS510 Computer Architectures

  34. Compiler Perspectives on Code Movement • Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline • Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either • Instruction i produces a result used by instruction j, or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • Easy to determine for registers (fixed names) • Hard for memory: • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? CS510 Computer Architectures

  35. Compiler Perspectives on Code Movement • Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data • Two kinds of Name Dependence Instruction i precedes instruction j • Antidependence (WAR if a hazard for HW) • Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first • Output dependence (WAW if a hazard for HW) • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. CS510 Computer Architectures

  36. Compiler Perspectives on Code Movement • Again Hard for Memory Accesses • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then:0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. CS510 Computer Architectures

  37. Compiler Perspectives on Code Movement • Control Dependence • Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. CS510 Computer Architectures

  38. Compiler Perspectives on Code Movement • Two (obvious) constraints on control dependencies: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. • Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow CS510 Computer Architectures

  39. When Safe to Unroll Loop? • Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) {A[i+1]= A[i] + C[i]; /* S1 */ B[i+1] = B[i] +A[i+1];}/* S2 */ 1. S2 uses the valueA[i+1],computed by S1 in the same iteration. 2. S1 uses a value computed byS1in an earlier iteration, since iteration i computesA[i+1]which is read in iteration i+1. The same is true ofS2 for B[i] and B[i+1]. This is aloop-carried dependencebetween iterations • Implies thatiterations are dependent, and can’t be executed in parallel • Not the case for our example; each iteration was distinct CS510 Computer Architectures

  40. When Safe to Unroll Loop? • Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) {A[i] = A[i] +B[i];/* S1 */B[i+1]= C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100]; CS510 Computer Architectures

  41. Summary • Instruction Level Parallelism in SW or HW • Loop level parallelism is easiest to see • SW parallelism dependencies defined for a program, hazards if HW cannot resolve • SW dependencies/compiler sophistication determine if compiler can unroll loops CS510 Computer Architectures

More Related