Exploiting ILP with SW approaches

Advanced Computer Architecture5MD00 / 5Z033Exploiting ILP with SW approaches Henk Corporaal www.ics.ele.tue.nl/~heco h.corporaal@tue.nl TUEindhoven 2007

Topics • Static branch prediction and speculation • Basic compiler techniques • Multiple issue architectures • Advanced compiler support techniques • Loop-level parallelism • Software pipelining • Hardware support for compile-time scheduling • EPIC: IA-64 ACA H.Corporaal

We discussed previously dynamic branch predictionThis does not help the compiler !!! We need Static Branch Prediction ACA H.Corporaal

Static Branch Prediction and Speculation • Static branch prediction useful for code scheduling • Example: ld r1,0(r2) sub r1,r1,r3 # hazard beqz r1,L or r4,r5,r6 addu r10,r4,r3 L: addu r7,r8,r9 • If the branch is taken most of the times and since r7 is not needed on the fall-through path, we could move addu r7,r8,r9 directly after the ld • If the branch is not taken most of the times and assuming that r4 is not needed on the taken path, we could move or r4,r5,r6 after the ld ACA H.Corporaal

Static Branch Prediction Methods • Always predict taken • Average misprediction rate for SPEC: 34% (9%-59%) • Backward branches predicted taken, forward branches not taken • In SPEC, most forward branches are taken, so always predict taken is better • Profiling • Run the program and profile all branches. If a branch is taken (not taken) most of the times, it is predicted taken (not taken) • Behavior of a branch is often biased to taken or not taken • Average misprediction rate for SPECint: 15% (11%-22%), SPECfp: 9% (5%-15%) • More advanced control flow restructuring ACA H.Corporaal

Basic compiler techniques • Dependences limit ILP • Stalls • Scheduling to avoid stalls • Loop unrolling: more parallelism ACA H.Corporaal

Dependencies Limit ILP • C loop: • for (i=1; i<=1000; i++) • x[i] = x[i] + s; MIPS assembly code: ; R1 = &x[1] ; R2 = &x[1000]+8 ; F2 = s Loop: L.D F0,0(R1) ; F0 = x[i] ADD.D F4,F0,F2 ; F4 = x[i]+s S.D 0(R1),F4 ; x[i] = F4 ADDI R1,R1,8 ; R1 = &x[i+1] BNE R1,R2,Loop ; branch if R1!=&x[1000]+8 ACA H.Corporaal

Schedule this on a MIPS Pipeline • FP operations are mostly multicycle • The pipeline must be stalled if an instruction uses the result of a not yet finished multicycle operation • We’ll assume the following latencies Producing Consuming Latency instruction instruction (clock cycles) FP ALU op FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 ACA H.Corporaal

Where to Insert Stalls • How would this loop be executed on the MIPS FP pipeline? • Loop: L.D F0,0(R1) • ADD.D F4,F0,F2 • S.D 0(R1),F4 • ADDI R1,R1,8 • BNE R1,R2,Loop ACA H.Corporaal

Where to Insert Stalls • How would this loop be executed on the MIPS FP pipeline? • 10 cycles per iteration Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D 0(R1),F4 ADDI R1,R1,8 stall BNE R1,R2,Loop stall ACA H.Corporaal

Code Scheduling to Avoid Stalls • Can we reorder the order of instruction to avoid stalls? • Execution time reduced from 10 to 6 cycles per iteration • But only 3 instructions perform useful work, rest is loop overhead Loop: L.D F0,0(R1) ADDI R1,R1,8 ADD.D F4,F0,F2 stall BNE R1,R2,Loop S.D -8(R1),F4 ACA H.Corporaal

Loop Unrolling: increasing ILP MIPS code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.DF8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI R1,R1,32 SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16 At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } • Drawbacks: • loop unrolling increases code size • more registers needed ACA H.Corporaal

Multiple issue architectures How to get CPI < 1 ? • Superscalar • Statically scheduled • Dynamically scheduled (see previous lecture) • VLIW ? • SIMD / Vector ? ACA H.Corporaal

Multiple-Issue Processors • Vector Processing: Explicit coding of independent loops as operations on large vectors of numbers • Multimedia instructions being added to many processors • Multiple-Issue Processors • Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) (dynamic issue capability) • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4 • VLIW (very long instr. word): fixed number of instructions (4-16) scheduled by the compiler (static issue capability) • Intel Architecture-64 (IA-64, Itanium), TriMedia, TI C6x • Anticipated success of multiple instructions led to Instructions Per Cycle (IPC) metric instead of CPI ACA H.Corporaal

Statically Scheduled Superscalar • Static Superscalar 2-issue processor: 1 Integer & 1 FP – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports needed for FP register file to execute FP load & FP op in parallel Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB • 1 cycle load delay impacts the next 3 instructions ! ACA H.Corporaal

Example Load: 1 cycle latency ALU op: 2 cycles latency for (i=1; i<=1000; i++) a[i] = a[i]+s; Integer instruction FP instruction Cycle L: LD F0,0(R1) 1 LD F6,8(R1) 2 LD F10,16(R1) ADDD F4,F0,F2 3 LD F14,24(R1) ADDD F8,F6,F2 4 LD F18,32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD 8(R1),F8 ADDD F20,F18,F2 7 SD 16(R1),F12 8 ADDI R1,R1,40 9 SD -16(R1),F16 10 BNE R1,R2,L 11 SD -8(R1),F20 12 • 2.4 cycles per element vs. 3.5 for ordinary MIPS pipeline • Int and FP instructions not perfectly balanced ACA H.Corporaal

Multiple Issue Issues • While Integer/FP split is simple for the HW, get CPI of 2 only for programs with: • Exactly 50% FP operations AND no hazards • More complex decode and issue: • Even 2-issue superscalar => examine 2 opcodes, 6 register specifiers, and decide if 1 or 2 instructions can issue (N-issue ~O(N2) comparisons) • Register file for 2-issue superscalar: needs 2x reads and 1x writes/cycle • Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p7 sub r4, r1, r2  sub p22, p11, p4 lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p4 Imagine doing this transformation in a single cycle! • Result buses: Need to complete multiple instructions/cycle • Need multiple buses with associated matching logic at every reservation station. ACA H.Corporaal

Ld/st 1 Ld/st 2 FP 1 FP 2 Int LD F0,0(R1) LD F6,8(R1) LD F10,16(R1) LD F14,24(R1) LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26,48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 ADD F20,F18,F2 ADD F24,F22,F2 SD 0(R1),F4 SD 8(R1),F8 ADDD F28,F26,F2 SD 16(R1),F12 SD 24(R1),F16 SD 32(R1),F20 SD 40(R1),F24 ADDI R1,R1,56 SD –8(R1),F28 BNE R1,R2,L VLIW Processors • Superscalar HW too difficult to build => let compiler find independent instructions and pack them in one Very Long Instruction Word (VLIW) • Example: VLIW processor with 2 ld/st units, two FP units, one integer/branch unit, no branch delay ACA H.Corporaal

Superscalar versus VLIW VLIW advantages: • Much simpler to build. Potentially faster VLIW disadvantages and proposed solutions: • Binary code incompatibility • Object code translation or emulation • Less strict approach (EPIC, IA-64, Itanium) • Increase in code size, unfilled slots are wasted bits • Use clever encodings, only one immediate field • Compress instructions in memory and decode them when they are fetched • Lockstep operation: if the operation in one instruction slot stalls, the entire processor is stalled • Less strict approach ACA H.Corporaal

Advanced compiler support techniques • Loop-level parallelism • Software pipelining • Global scheduling (across basic blocks) ACA H.Corporaal

Detecting Loop-Level Parallelism • Loop-carried dependence: a statement executed in a certain iteration is dependent on a statement executed in an earlier iteration • If there is no loop-carried dependence, then its iterations can be executed in parallel for (i=1; i<=100; i++){ A[i+1] = A[i]+C[i]; /* S1 */ B[i+1] = B[i]+A[i+1]; /* S2 */ } S1 S2 A loop is parallel  the corresponding dependence graph does not contain a cycle ACA H.Corporaal

Finding Dependences • Is there a dependence in the following loop? for (i=1; i<=100; i++) A[2*i+3] = A[2*i] + 5.0; • Affine expression: an expression of the form a*i + b(a, b constants, i loop index variable) • Does the following equation have a solution? a*i + b = c*j + d • GCD test: if there is a solution, then GCD(a,c) must divide d-b Note: Because the GCD test does not take the loop bounds into account, there are cases where the GCD test says “yes, there is a solution” while in reality there isn’t ACA H.Corporaal

Software Pipelining • We have already seen loop unrolling • Software pipelining is a related technique that that consumes less code space. It interleaves instructions from different iterations • instructions in one iteration are often dependent on each other Iteration 0 Iteration 1 Iteration 2 Software- pipelined iteration Steady state kernel instructions ACA H.Corporaal

Simple Software Pipelining Example L: l.d f0,0(r1) # load M[i] add.d f4,f0,f2 # compute M[i] s.d f4,0(r1) # store M[i] addi r1,r1,-8 # i = i-1 bne r1,r2,L • Software pipelined loop: L: s.d f4,16(r1) # store M[i] add.d f4,f0,f2 # compute M[i-1] l.d f0,0(r1) # load M[i-2] addi r1,r1,-8 bne r1,r2,L • Need hardware to avoid the WAR hazards ACA H.Corporaal

Global code scheduling • Loop unrolling and software pipelining work well when there are no control statements (if statements) in the loop body -> loop is a single basic block • Global code scheduling: scheduling/moving code across branches: larger scheduling scope • When can the assignments to B and C be moved before the test? A[i]=A[i]+B[i] T F A[i]=0? B[i]= X C[i]= ACA H.Corporaal

Which scheduling scope? Hyperblock/region Trace Superblock Decision Tree ACA H.Corporaal

Trace Sup. Hyp. Dec. Region block block Tree Multiple exc. paths No No Yes Yes Yes Side-entries allowed Yes No No No No Join points allowed Yes No Yes No Yes Code motion down joins Yes No No No No Must be if-convertible No No Yes No No Tail dup. before sched. No Yes No Yes No Comparing scheduling scopes ACA H.Corporaal

A A C tail duplication B C B D D’ D E E’ F F E G G’ G Trace Superblock Scheduling scope creation Partitioning a CFG into scheduling scopes: ACA H.Corporaal

A A tail duplication C B C B D D’ D E F’ F E’ F E G G’ G’’ G Hyperblock/ region Decision Tree Scheduling scope creation Partitioning a CFG into scheduling scopes: ACA H.Corporaal

Trace Scheduling • Find the most likely sequence of basic blocks that will be executed consecutively (trace selection) • Optimize the trace as much as possible (trace compaction) • move operations as early as possible in the trace • pack the operations in as few VLIWs as possible • additional bookkeeping code may be necessary on exit points of the trace ACA H.Corporaal

Hardware support for compile-time scheduling • Predication • (discussed already) • Deferred exceptions • Speculative loads ACA H.Corporaal

Predicated Instructions • Avoid branch prediction by turning branches into conditional or predicated instructions: If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. • IA-64/Itanium: conditional execution of any instruction • Examples: if (R1==0) R2 = R3; CMOVZ R2,R3,R1 if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9 else CMOVZ R3,R2,R9 R3 = R2; ACA H.Corporaal

Deferred Exceptions Assuming then-part is almost always executed • ld r1,0(r3) # load A • bnez r1,L1 # test A • ld r1,0(r2) # then part; load B • j L2 • L1: addi r1,r1,4 # else part; inc A • L2: st r1,0(r3) # store A • What if the load generates a page fault? • What if the load generates an “index-out-of-bounds” exception? if (A==0) A = B; else A = A+4; ld r1,0(r3) # load A ld r9,0(r2) # speculative load B beqz r1,L3 # test A addi r9,r1,4 # else part L3: st r9,0(r3) # store A ACA H.Corporaal

HW supporting Speculative Loads • Speculative load (sld): does not generate exceptions • Speculation check instruction (speck): check for exception. The exception occurs when this instruction is executed. • ld r1,0(r3) # load A • sld r9,0(r2) # speculative load of B • bnez r1,L1 # test A • speck 0(r2) # perform exception check • j L2 • L1: addi r9,r1,4 # else part • L2: st r9,0(r3) # store A ACA H.Corporaal

Avoiding superscalar complexity • An alternative: EPIC (explicit parallel instr computer) • Best of both worlds? • Superscalar: expensive but binary compatible • VLIW: simple, but not compatible ACA H.Corporaal

EPIC Architecture: IA-64 Explicit Parallel Instruction Computer • IA-64 -> Merced (2001), McKinley (2002), Montecite (2 core, 2006), Tukwila (4-core 2008) • architecture is now called Itanium Register model: • 128 64-bit int x bits, stack, rotating • 128 82-bit floating point, rotating • 64 1-bit booleans • 8 64-bit branch target address • system control registers ACA H.Corporaal

(2002) ACA H.Corporaal

EPIC Architecture: IA-64 • Instructions grouped in 128-bit bundles • 3 * 41-bit instruction • 5 template bits, indicate type and stop location • Each 41-bit instruction • starts with 4-bit opcode, and • ends with 6-bit guard (boolean) register-id ACA H.Corporaal

ACA H.Corporaal

EPIC Architecture: IA-64 • IA-64 looks like a VLIW However: • Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel • HW does the Operation – FU binding • Pipeline latencies not visible in the ISA • These measures make the ISA independent of #FUs and pipeline latencies  ISA supports multiple implementations ACA H.Corporaal

Montecito 2006 ACA H.Corporaal

Exploiting ILP with SW approaches