Compiler techniques for exposing ILP

Compiler techniques for exposing ILP

Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic block • Blocks are small (6-7 instructions) • Instructions are dependent • Goal: Exploit ILP across multiple basic blocks • Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Basic Scheduling Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Loop Unrolling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations

Loop Transformations • Instruction independency is the key requirement for the transformations • Example • Determine that is legal to move SD after SUBI and BNEZ • Determine that unrolling is useful (iterations are independent) • Use different registers to avoid unnecessary constrains • Eliminate extra tests and branches • Determine that LD and SD can be interchanged • Schedule the code, preserving the semantics of the code

1. Eliminating Name Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

2. Eliminating Control Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

3. Eliminating Data Dependences Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop • Data dependencies SUBI, LD, SD • Force sequential execution of iterations • Compiler removes this dependency by: • Computing intermediate R1 values • Eliminating intermediate SUBI • Changing final SUBI • Data flow analysis • Can do on Registers • Cannot do easily on memory locations • 100(R1) = 20(R2)

4. Alleviating Data Dependencies Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

Some General Comments • Dependences are a property of programs • Actual hazards are a property of the pipeline • Techniques to avoid dependence limitations • Maintain dependences but avoid hazards • Code scheduling • hardware • software • Eliminate dependences by code transformations • Complex • Compiler-based

Loop-level Parallelism • Primary focus of dependence analysis • Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

Dependence Analysis Algorithms • Assume array indexes are affine (ai + b) • GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) • General graph cycle determination is NP • a, b, c, and d may not be known at compile time

Software Pipelining Start-up Finish-up Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

Trace (global-code) Scheduling • Find ILP across conditional branches • Two-step process • Trace selection • Find a trace (sequence of basic blocks) • Use loop unrolling to generate long traces • Use static branch prediction for other conditional branches • Trace compaction • Squeeze the trace into a small number of wide instructions • Preserve data and control dependences

Trace Selection A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else . . . . SW 0(R2), . . . J join Else: . . . . X Join: . . . . SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

Summary of Compiler Techniques • Try to avoid dependence stalls • Loop unrolling • Reduce loop overhead • Software pipelining • Reduce single body dependence stalls • Trace scheduling • Reduce impact of other branches • Compilers use a mix of three • All techniques depend on prediction accuracy

Food for thought: Analyze this • Analyze this for different values of X and Y • To evaluate different branch prediction schemes • For compiler scheduling purposes • add r1, r0, 1000 #  all numbers in decimal • add r2, r0, a # Base address of array a • loop: • andi r10, r1, X • beqz r10, even • lw r11, 0(r2) • addi r11, r11, 1 • sw 0(r2), r11 • even: • addi r2, r2, 4 • subi r1, r1, Y • bnez r1, loop

Compiler techniques for exposing ILP

Compiler techniques for exposing ILP

Presentation Transcript

Compiler Techniques for Single Processor Tuning

Advanced Compiler Techniques

CS 380C Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

CS 380C Advanced Compiler Techniques

Advanced Compiler Techniques

Compiler techniques for exposing ILP

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP

Static ILP Static (Compiler Based) Scheduling

Advanced Compiler Techniques

Static Compiler Optimization Techniques

Compiler Techniques for ILP

Compiler Techniques for Single Processor Tuning

Static Compiler Optimization Techniques