Compiler techniques for exposing ILP

1 / 18

# Compiler techniques for exposing ILP - PowerPoint PPT Presentation

Compiler techniques for exposing ILP . Instruction Level Parallelism. Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Compiler techniques for exposing ILP' - darice

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Compiler techniques for exposing ILP

Instruction Level Parallelism
• Potential overlap among instructions
• Few possibilities in a basic block
• Blocks are small (6-7 instructions)
• Instructions are dependent
• Goal: Exploit ILP across multiple basic blocks
• Iterations of a loop

for (i = 1000; i > 0; i=i-1)

x[i] = x[i] + s;

Basic Scheduling

Sequential MIPS Assembly Code

Loop: LD F0, 0(R1)

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

for (i = 1000; i > 0; i=i-1)

x[i] = x[i] + s;

Pipelined execution:

Loop: LD F0, 0(R1) 1

stall 2

stall 4

stall 5

SD 0(R1), F4 6

SUBI R1, R1, #8 7

stall 8

BNEZ R1, Loop 9

stall 10

Scheduled pipelined execution:

Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2

stall 4

BNEZ R1, Loop 5

SD 8(R1), F4 6

Loop Unrolling

Loop: LD F0, 0(R1)

SD 0(R1), F4

SUBI R1, R1, #8

BEQZ R1, Exit

LD F6, 0(R1)

SD 0(R1), F8

SUBI R1, R1, #8

BEQZ R1, Exit

LD F10, 0(R1)

SD 0(R1), F12

SUBI R1, R1, #8

BEQZ R1, Exit

LD F14, 0(R1)

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

Exit:

Pros:

Larger basic block

More scope for scheduling

and eliminating dependencies

Cons:

Increases code size

Comment:

Often a precursor step for

other optimizations

Loop Transformations
• Instruction independency is the key requirement for the transformations
• Example
• Determine that is legal to move SD after SUBI and BNEZ
• Determine that unrolling is useful (iterations are independent)
• Use different registers to avoid unnecessary constrains
• Eliminate extra tests and branches
• Determine that LD and SD can be interchanged
• Schedule the code, preserving the semantics of the code
1. Eliminating Name Dependences

Loop: LD F0, 0(R1)

SD 0(R1), F4

LD F0, -8(R1)

SD -8(R1), F4

LD F0, -16(R1)

SD -16(R1), F4

LD F0, -24(R1)

SD -24(R1), F4

SUBI R1, R1, #32

BNEZ R1, Loop

Loop: LD F0, 0(R1)

SD 0(R1), F4

LD F6, -8(R1)

SD -8(R1), F8

LD F10, -16(R1)

SD -16(R1), F12

LD F14, -24(R1)

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

Register Renaming

2. Eliminating Control Dependences

Loop: LD F0, 0(R1)

SD 0(R1), F4

SUBI R1, R1, #8

BEQZ R1, Exit

LD F6, 0(R1)

SD 0(R1), F8

SUBI R1, R1, #8

BEQZ R1, Exit

LD F10, 0(R1)

SD 0(R1), F12

SUBI R1, R1, #8

BEQZ R1, Exit

LD F14, 0(R1)

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

Exit:

Intermediate BEQZ are never taken

Eliminate!

3. Eliminating Data Dependences

Loop: LD F0, 0(R1)

SD 0(R1), F4

SUBI R1, R1, #8

LD F6, 0(R1)

SD 0(R1), F8

SUBI R1, R1, #8

LD F10, 0(R1)

SD 0(R1), F12

SUBI R1, R1, #8

LD F14, 0(R1)

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

• Data dependencies SUBI, LD, SD
• Force sequential execution of iterations
• Compiler removes this dependency by:
• Computing intermediate R1 values
• Eliminating intermediate SUBI
• Changing final SUBI
• Data flow analysis
• Can do on Registers
• Cannot do easily on memory locations
• 100(R1) = 20(R2)
4. Alleviating Data Dependencies

Unrolled loop:

Loop: LD F0, 0(R1)

SD 0(R1), F4

LD F6, -8(R1)

SD -8(R1), F8

LD F10, -16(R1)

SD -16(R1), F12

LD F14, -24(R1)

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

Scheduled Unrolled loop:

Loop: LD F0, 0(R1)

LD F6, -8(R1) LD F10, -16(R1)

LD F14, -24(R1) ADDD F4, F0, F2

ADDD F16, F14, F2 SD 0(R1), F4

SD -8(R1), F8

SUBI R1, R1, #32

SD 16(R1), F12

BNEZ R1, Loop

SD 8(R1), F16

• Dependences are a property of programs
• Actual hazards are a property of the pipeline
• Techniques to avoid dependence limitations
• Maintain dependences but avoid hazards
• Code scheduling
• hardware
• software
• Eliminate dependences by code transformations
• Complex
• Compiler-based
Loop-level Parallelism
• Primary focus of dependence analysis
• Determine all dependences and find cycles

for (i=1; i<=100; i=i+1) {

x[i] = y[i] + z[i];

w[i] = x[i] + v[i];

}

for (i=1; i<=100; i=i+1) {

x[i+1] = x[i] + z[i];

}

x[1] = x[1] + y[1];

for (i=1; i<=99; i=i+1) {

y[i+1] = w[i] + z[i];

x[i+1] = x[i +1] + y[i +1];

}

y[101] = w[100] + z[100];

for (i=1; i<=100; i=i+1) {

x[i] = x[i] + y[i];

y[i+1] = w[i] + z[i];

}

Dependence Analysis Algorithms
• Assume array indexes are affine (ai + b)
• GCD test:

For two affine array indexes ai+b and ci+d:

if a loop-carried dependence exists, then GCD (c,a) must

divide (d-b)

x[8*i ] = x[4*i + 2] +3

(2-0)/GCD(8,4)

• General graph cycle determination is NP
• a, b, c, and d may not be known at compile time
Software Pipelining

Start-up

Finish-up

Iteration 0 Iteration 1 Iteration 2 Iteration 3

Software pipelined iteration

Example

Iteration i Iteration i+1 Iteration i+2

LD F0, 0(R1)

SD 0(R1), F4

LD F0, 0(R1)

SD 0(R1), F4

LD F0, 0(R1)

SD 0(R1), F4

Loop: LD F0, 0(R1)

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

Loop: SD 16(R1), F4

LD F0, 0(R1)

SUBI R1, R1, #8

BNEZ R1, Loop

Trace (global-code) Scheduling
• Find ILP across conditional branches
• Two-step process
• Trace selection
• Find a trace (sequence of basic blocks)
• Use loop unrolling to generate long traces
• Use static branch prediction for other conditional branches
• Trace compaction
• Squeeze the trace into a small number of wide instructions
• Preserve data and control dependences
Trace Selection

A[I] = A[I] + B[I]

LW R4, 0(R1)

LW R5, 0(R2)

SW 0(R1), R4

BNEZ R4, else

. . . .

SW 0(R2), . . .

J join

Else: . . . .

X

Join: . . . .

SW 0(R3), . . .

T

F

A[I] = 0?

X

B[I] =

C[I] =

Summary of Compiler Techniques
• Try to avoid dependence stalls
• Loop unrolling
• Software pipelining
• Reduce single body dependence stalls
• Trace scheduling
• Reduce impact of other branches
• Compilers use a mix of three
• All techniques depend on prediction accuracy
Food for thought: Analyze this
• Analyze this for different values of X and Y
• To evaluate different branch prediction schemes
• For compiler scheduling purposes
• add r1, r0, 1000 #  all numbers in decimal