Lecture 8 Advanced Pipeline

Lecture 8Advanced Pipeline CS510 Computer Architectures

Technique Reduces Advanced Pipelining and Instruction Level Parallelism Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling with scoreboarding RAW stalls Dynamic scheduling with register renaming WAR and WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal CPI Compiler dependence analysis IdealCPI and data stalls Software pipelining and trace scheduling Ideal CPI and data stalls Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory CS510 Computer Architectures

Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double* FP ALU op 1 Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory. Fully pipelined or replicated --- no structural hazards, issue on every clock cycle for ( i =1; i <= 1000; i++) x[i] = x[i] + s; CS510 Computer Architectures

Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double2 Load double FP ALU op1 Load double Store double 0 Integer op Integer op 0 FP Loop Hazards Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls? CS510 Computer Architectures

FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8stall 9 BNEZ R1,Loop ;branch R1!=zero 10 stall ;delayed branch slot Rewrite code to minimize stalls? CS510 Computer Architectures

For Load-ALU latency Consider moving SUBI into this Load Delay Slot. Reading R1 by LD is done before Writing R1 by SUBI. Yes we can. For ALU-ALU latency 8 When we do this, we need to change the immediate value 0 to 8 in SD Reducing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 stall 5 stall 6 SD 0(R1),F4 7 SUBI R1,R1,#8 8 stall 9 BNEZ R1,Loop 10 stall There is only one instruction left, i.e., BNEZ. When we do that, SD instruction fills the delayed branch slot. CS510 Computer Architectures

Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Revised FP Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2 SUBI R1,R1,#8 3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4;altered when move past SUBI Unroll loop 4 times to make the code faster CS510 Computer Architectures

Unroll Loop 4 Times 1Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32;alter to 4*8 14 BNEZ R1,Loop 15NOP 15 + 4 x(1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1 Rewrite loop to minimize the stalls CS510 Computer Architectures

14 clock cycles, or 3.5 per iteration Unrolled Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2LD F6,-8(R1) 3LD F10,-16(R1) 4LD F14,-24(R1) 5ADDD F4,F0,F2 6ADDD F8,F6,F2 7ADDD F12,F10,F2 8ADDD F16,F14,F2 9SD 0(R1),F4 10SD -8(R1),F8 11SUBI R1,R1,#32 12SD16(R1),F12; -16 +32=16 13BNEZ R1,LOOP 14SD8(R1),F16; -24+32 = 8 CS510 Computer Architectures

Compiler Perspectives on Code Movement • Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline • Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either • Instruction i produces a result used by instruction j, or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • Easy to determine for registers (fixed names) • Hard for memory: • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? CS510 Computer Architectures

Compiler Perspectives on Code Movement • Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data • Two kinds of Name Dependence Instruction i precedes instruction j • Antidependence (WAR if a hazard for HW) • Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first • Output dependence (WAW if a hazard for HW) • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. CS510 Computer Architectures

Compiler Perspectives on Code Movement • Again Hard for Memory Accesses • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then:0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. CS510 Computer Architectures

Compiler Perspectives on Code Movement • Control Dependence • Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. CS510 Computer Architectures

Compiler Perspectives on Code Movement • Two (obvious) constraints on control dependencies: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. • Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow CS510 Computer Architectures

When Safe to Unroll Loop? • Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) {A[i+1]= A[i] + C[i]; /* S1 */ B[i+1] = B[i] +A[i+1];}/* S2 */ 1. S2 uses the valueA[i+1],computed by S1 in the same iteration. 2. S1 uses a value computed byS1in an earlier iteration, since iteration i computesA[i+1]which is read in iteration i+1. The same is true ofS2 for B[i] and B[i+1]. This is aloop-carried dependencebetween iterations • Implies thatiterations are dependent, and can’t be executed in parallel • Not the case for our example; each iteration was distinct CS510 Computer Architectures

When Safe to Unroll Loop? • Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) {A[i] = A[i] +B[i];/* S1 */B[i+1]= C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100]; CS510 Computer Architectures

Software Pipelining • Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop . CS510 Computer Architectures

Start-up code Iter i Iter i+1 Iter i+2 Finish code Read F4(i) Write F4(i+1) SD ADDD LD IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB Read F0(i) Write F0(i+2) SW Pipelining Example After: Software Pipelined version of loop LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 LOOP SD 0(R1),F4; Stores to M[i] 2 ADDD F4,F0,F2; Adds to M[i-1] 3 LD F0,-16(R1); Loads from M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(R1),F4 Before: Unrolled 3 times 1 LOOP LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F0,F2 6 SD -8(R1),F8 7 LD F10,16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP CS510 Computer Architectures

Software Pipelining Number of overlapped operations Time Loop Unrolling Number of overlapped operations . . . Time 100 iterations = 25 loops with 4 unrolled iterations each SW Pipelining Example • Symbolic Loop Unrolling • Less code space • Overhead paid only once vs. each iteration in loop unrolling CS510 Computer Architectures

Lecture 8 Advanced Pipeline