compiler techniques for exposing ilp l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Compiler techniques for exposing ILP PowerPoint Presentation
Download Presentation
Compiler techniques for exposing ILP

Loading in 2 Seconds...

play fullscreen
1 / 18

Compiler techniques for exposing ILP - PowerPoint PPT Presentation


  • 946 Views
  • Uploaded on

Compiler techniques for exposing ILP . Instruction Level Parallelism. Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiler techniques for exposing ILP' - darice


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
instruction level parallelism
Instruction Level Parallelism
  • Potential overlap among instructions
  • Few possibilities in a basic block
    • Blocks are small (6-7 instructions)
    • Instructions are dependent
  • Goal: Exploit ILP across multiple basic blocks
    • Iterations of a loop

for (i = 1000; i > 0; i=i-1)

x[i] = x[i] + s;

basic scheduling
Basic Scheduling

Sequential MIPS Assembly Code

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

for (i = 1000; i > 0; i=i-1)

x[i] = x[i] + s;

Pipelined execution:

Loop: LD F0, 0(R1) 1

stall 2

ADDD F4, F0, F2 3

stall 4

stall 5

SD 0(R1), F4 6

SUBI R1, R1, #8 7

stall 8

BNEZ R1, Loop 9

stall 10

Scheduled pipelined execution:

Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2

ADDD F4, F0, F2 3

stall 4

BNEZ R1, Loop 5

SD 8(R1), F4 6

loop unrolling
Loop Unrolling

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BEQZ R1, Exit

LD F6, 0(R1)

ADDD F8, F6, F2

SD 0(R1), F8

SUBI R1, R1, #8

BEQZ R1, Exit

LD F10, 0(R1)

ADDD F12, F10, F2

SD 0(R1), F12

SUBI R1, R1, #8

BEQZ R1, Exit

LD F14, 0(R1)

ADDD F16, F14, F2

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

Exit:

Pros:

Larger basic block

More scope for scheduling

and eliminating dependencies

Cons:

Increases code size

Comment:

Often a precursor step for

other optimizations

loop transformations
Loop Transformations
  • Instruction independency is the key requirement for the transformations
  • Example
    • Determine that is legal to move SD after SUBI and BNEZ
    • Determine that unrolling is useful (iterations are independent)
    • Use different registers to avoid unnecessary constrains
    • Eliminate extra tests and branches
    • Determine that LD and SD can be interchanged
    • Schedule the code, preserving the semantics of the code
1 eliminating name dependences
1. Eliminating Name Dependences

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, -8(R1)

ADDD F4, F0, F2

SD -8(R1), F4

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F4

SUBI R1, R1, #32

BNEZ R1, Loop

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

Register Renaming

2 eliminating control dependences
2. Eliminating Control Dependences

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BEQZ R1, Exit

LD F6, 0(R1)

ADDD F8, F6, F2

SD 0(R1), F8

SUBI R1, R1, #8

BEQZ R1, Exit

LD F10, 0(R1)

ADDD F12, F10, F2

SD 0(R1), F12

SUBI R1, R1, #8

BEQZ R1, Exit

LD F14, 0(R1)

ADDD F16, F14, F2

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

Exit:

Intermediate BEQZ are never taken

Eliminate!

3 eliminating data dependences
3. Eliminating Data Dependences

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

LD F6, 0(R1)

ADDD F8, F6, F2

SD 0(R1), F8

SUBI R1, R1, #8

LD F10, 0(R1)

ADDD F12, F10, F2

SD 0(R1), F12

SUBI R1, R1, #8

LD F14, 0(R1)

ADDD F16, F14, F2

SD 0(R1), F16

SUBI R1, R1, #8

BNEZ R1, Loop

  • Data dependencies SUBI, LD, SD
  • Force sequential execution of iterations
  • Compiler removes this dependency by:
  • Computing intermediate R1 values
  • Eliminating intermediate SUBI
  • Changing final SUBI
  • Data flow analysis
  • Can do on Registers
  • Cannot do easily on memory locations
  • 100(R1) = 20(R2)
4 alleviating data dependencies
4. Alleviating Data Dependencies

Unrolled loop:

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, #32

BNEZ R1, Loop

Scheduled Unrolled loop:

Loop: LD F0, 0(R1)

LD F6, -8(R1) LD F10, -16(R1)

LD F14, -24(R1) ADDD F4, F0, F2

ADDD F8, F6, F2

ADDD F12, F10, F2

ADDD F16, F14, F2 SD 0(R1), F4

SD -8(R1), F8

SUBI R1, R1, #32

SD 16(R1), F12

BNEZ R1, Loop

SD 8(R1), F16

some general comments
Some General Comments
  • Dependences are a property of programs
  • Actual hazards are a property of the pipeline
  • Techniques to avoid dependence limitations
    • Maintain dependences but avoid hazards
      • Code scheduling
        • hardware
        • software
    • Eliminate dependences by code transformations
      • Complex
      • Compiler-based
loop level parallelism
Loop-level Parallelism
  • Primary focus of dependence analysis
  • Determine all dependences and find cycles

for (i=1; i<=100; i=i+1) {

x[i] = y[i] + z[i];

w[i] = x[i] + v[i];

}

for (i=1; i<=100; i=i+1) {

x[i+1] = x[i] + z[i];

}

x[1] = x[1] + y[1];

for (i=1; i<=99; i=i+1) {

y[i+1] = w[i] + z[i];

x[i+1] = x[i +1] + y[i +1];

}

y[101] = w[100] + z[100];

for (i=1; i<=100; i=i+1) {

x[i] = x[i] + y[i];

y[i+1] = w[i] + z[i];

}

dependence analysis algorithms
Dependence Analysis Algorithms
  • Assume array indexes are affine (ai + b)
    • GCD test:

For two affine array indexes ai+b and ci+d:

if a loop-carried dependence exists, then GCD (c,a) must

divide (d-b)

x[8*i ] = x[4*i + 2] +3

(2-0)/GCD(8,4)

  • General graph cycle determination is NP
  • a, b, c, and d may not be known at compile time
software pipelining
Software Pipelining

Start-up

Finish-up

Iteration 0 Iteration 1 Iteration 2 Iteration 3

Software pipelined iteration

example
Example

Iteration i Iteration i+1 Iteration i+2

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

Loop: SD 16(R1), F4

ADDD F4, F0, F2

LD F0, 0(R1)

SUBI R1, R1, #8

BNEZ R1, Loop

trace global code scheduling
Trace (global-code) Scheduling
  • Find ILP across conditional branches
  • Two-step process
    • Trace selection
      • Find a trace (sequence of basic blocks)
      • Use loop unrolling to generate long traces
      • Use static branch prediction for other conditional branches
    • Trace compaction
      • Squeeze the trace into a small number of wide instructions
      • Preserve data and control dependences
trace selection
Trace Selection

A[I] = A[I] + B[I]

LW R4, 0(R1)

LW R5, 0(R2)

ADD R4, R4, R5

SW 0(R1), R4

BNEZ R4, else

. . . .

SW 0(R2), . . .

J join

Else: . . . .

X

Join: . . . .

SW 0(R3), . . .

T

F

A[I] = 0?

X

B[I] =

C[I] =

summary of compiler techniques
Summary of Compiler Techniques
  • Try to avoid dependence stalls
  • Loop unrolling
    • Reduce loop overhead
  • Software pipelining
    • Reduce single body dependence stalls
  • Trace scheduling
    • Reduce impact of other branches
  • Compilers use a mix of three
  • All techniques depend on prediction accuracy
food for thought analyze this
Food for thought: Analyze this
  • Analyze this for different values of X and Y
    • To evaluate different branch prediction schemes
    • For compiler scheduling purposes
  • add r1, r0, 1000 #  all numbers in decimal
  • add r2, r0, a # Base address of array a
  • loop:
    • andi r10, r1, X
    • beqz r10, even
    • lw r11, 0(r2)
    • addi r11, r11, 1
    • sw 0(r2), r11
  • even:
    • addi r2, r2, 4
    • subi r1, r1, Y
    • bnez r1, loop