cs 378 programming for performance single thread performance compiler scheduling for pipelines l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines PowerPoint Presentation
Download Presentation
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines

Loading in 2 Seconds...

play fullscreen
1 / 21

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines - PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines. Siddhartha Chatterjee Spring 2008. Review of Pipelining. Pipelining reduces throughput of an instruction sequence, not the latency of an individual instruction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines' - africa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 378 programming for performance single thread performance compiler scheduling for pipelines

CS 378Programming for PerformanceSingle-Thread Performance: Compiler Scheduling for Pipelines

Siddhartha Chatterjee

Spring 2008

review of pipelining
Review of Pipelining
  • Pipelining reduces throughput of an instruction sequence, not the latency of an individual instruction
  • Speedup due to pipelining limited by hazards
    • Structural hazards lead to contention for limited resources
    • Data hazards require stalling or forwarding to maintain sequential semantics
    • Control hazards require cancellation of instructions (to maintain sequential branch semantics) or delayed branches (to define a new branch semantics)
  • Hazard
    • Detection: interlocks in hardware
    • Elimination: renaming, branch elimination
    • Resolution: stalling, forwarding, scheduling

Siddhartha Chatterjee

cpi of a pipelined machine
CPI of a Pipelined Machine

Issuing multiple instructions per cycle

Compiler dependence analysis

Software pipelining

Trace scheduling

Pipeline CPI = Ideal pipeline CPI

+ Structural stalls

+ RAW stalls

+ WAR stalls

+ WAW stalls

+ Control stalls

Dynamic scheduling

with register renaming

Compiler dependence

analysis

Software pipelining

Trace scheduling

Speculation

Basic pipeline scheduling

Dynamic scheduling with scoreboarding

Dynamic memory disambiguation

(for stalls involving memory)

Compiler dependence analysis

Software pipeline

Trace scheduling

Speculation

Loop unrolling

Dynamic branch prediction

Speculation

Siddhartha Chatterjee

instruction level parallelism ilp
Instruction-Level Parallelism (ILP)
  • Pipelining is most effective when we have parallelism among instructions
    • Instructions u and v are parallel if neither P(u,v) nor P(v,u) holds
  • Parallelism within a basic block is limited
    • Branch frequency of 15% implies about six instructions in basic block
    • These instructions are likely to depend on each other
    • Need to look beyond basic blocks
  • Loop-level parallelism
    • Parallelism among iterations of a loop
    • To convert loop-level parallelism into ILP, we need to “unroll” the loop
      • Statically, by the compiler
      • Dynamically, by the hardware
      • Using vector instructions

Siddhartha Chatterjee

motivating example for loop unrolling

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZ R1, LOOP

NOP

Motivating Example for Loop Unrolling

for (u = 0; u < 1000; u++)

x[u] = x[u] + s;

for (u = 999; u >= 0; u--)

x[u] = x[u] + s;

  • Assumptions
  • Loop is being run backwards
  • Scalar s is in register pair F2:F3
  • Array x starts at memory address 0
  • 1-cycle branch delay
  • No structural hazards

10 cycles per iteration

Siddhartha Chatterjee

how far can we get with scheduling
How Far Can We Get With Scheduling?

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZ R1, LOOP

NOP

LOOP: LD F0, 0(R1)

SUBI R1, R1, 8

ADDD F4, F0, F2

BNEZ R1, LOOP

SD 8(R1), F4

for (u = 999; u >= 0; ){

register double d = x[u];

u--; d += s; x[u+1] = d;

}

6 cycles per iteration

Note change in SD instruction, from 0(R1) to 8(R1); this is a non-trivial change.

Siddhartha Chatterjee

observations on scheduled code
Observations on Scheduled Code
  • 3 out of 5 instructions involve FP work
  • The other two constitute loop overhead
  • Could we improve loop performance by unrolling the loop?
  • Assume number of loop iterations is a multiple of 4, and unroll loop body four times
    • In real life, would need to handle the fact that loop trip count may not be a multiple of 4

Siddhartha Chatterjee

unrolling take 1
Unrolling: Take 1

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZR1, LOOP

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZR1, LOOP

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZR1, LOOP

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZ R1, LOOP

  • This is not any different from situation before unrolling
  • Branches induce control dependence
    • Can’t move instructions much during scheduling
  • However, the whole point of unrolling was to guarantee that the three internal branches will fall through
  • So, maybe we can delete the intermediate branches
  • There is an implicit NOP after the final branch

Siddhartha Chatterjee

unrolling take 2
Unrolling: Take 2

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

BNEZ R1, LOOP

  • Even though we have got rid of the control dependences, we have flow dependences through R1
  • We could remove flow dependences by observing that R1 is decremented by 8 each time
    • Adjust the address specifiers
    • Delete the first three SUBIs
    • Change the constant in the fourth SUBI to 32
  • These are non-trivial inferences for a compiler to make

for (u = 999; u >= 0; ){

register double d;

d = x[u]; d += s; x[u] = d; u--;

d = x[u]; d += s; x[u] = d; u--;

d = x[u]; d += s; x[u] = d; u--;

d = x[u]; d += s; x[u] = d; u--;

}

Siddhartha Chatterjee

unrolling take 3
Unrolling: Take 3

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F0, -8(R1)

ADDD F4, F0, F2

SD -8(R1), F4

LD F0, -16(R1)

ADDD F4, F0, F2

SD -16(R1), F4

LD F0, -24(R1)

ADDD F4, F0, F2

SD -24(R1), F4

SUBI R1, R1, 32

BNEZ R1, LOOP

  • Performance is now limited by the anti-dependences and output dependences on F0 and F4
  • These are name dependences
    • The instructions are not in a producer-consumer relation
    • They are simply using the same registers, but they don’t have to
    • We can use different registers in different loop iterations, subject to availability
  • Let’s rename registers

for (u = 999; u >= 0; u -= 4){

register double d;

d = x[u]; d += s; x[u] = d;

d = x[u-1]; d += s; x[u-1] = d;

d = x[u-2]; d += s; x[u-2] = d;

d = x[u-3]; d += s; x[u-3] = d;

}

Siddhartha Chatterjee

unrolling take 4
Unrolling: Take 4

LOOP: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

LD F6, -8(R1)

ADDD F8, F6, F2

SD -8(R1), F8

LD F10, -16(R1)

ADDD F12, F10, F2

SD -16(R1), F12

LD F14, -24(R1)

ADDD F16, F14, F2

SD -24(R1), F16

SUBI R1, R1, 32

BNEZ R1, LOOP

  • Time for execution of 4 iterations
    • 14 instruction cycles
    • 4 LDADDD stalls
    • 8 ADDDSD stalls
    • 1 SUBIBNEZ stall
    • 1 branch delay stall
  • 28 cycles for 4 iterations, or 7 cycles per iteration
  • Slower than scheduled version of original loop
  • Let’s schedule the unrolled loop

for (u = 999; u >= 0; u -= 4){

register double d0, d1, d2, d3;

d0 = x[u]; d0 += s; x[u] = d0;

d1 = x[u-1]; d1 += s; x[u-1] = d1;

d2 = x[u-2]; d2 += s; x[u-2] = d2;

d3 = x[u-3]; d3 += s; x[u-3] = d3;

}

Siddhartha Chatterjee

unrolling take 5
Unrolling: Take 5

LOOP: LD F0, 0(R1)

LD F6, -8(R1)

LD F10, -16(R1)

LD F14, -24(R1)

ADDD F4, F0, F2

ADDD F8, F6, F2

ADDD F12, F10, F2

ADDD F16, F14, F2

SD 0(R1), F4

SD -8(R1), F8

SUBI R1, R1, 32

SD 16(R1), F12

BNEZ R1, LOOP

SD 8(R1), F16

  • This code runs without stalls
    • 14 cycles for 4 iterations
    • 3.5 cycles per iteration
    • Performance is limited by loop control overhead once every four iterations
  • Note that original loop had three FP instructions that were not independent
  • Loop unrolling exposed independent instructions from multiple loop iterations
  • By unrolling further, can approach asymptotic rate of 3 cycles per iteration
    • Subject to availability of registers

for (u = 999; u >= 0; ){

register double d0, d1, d2, d3;

d0 = x[u]; d1 = x[u-1]; d2 = x[u-2]; d3 = x[u-3];

d0 += s; d1 += s; d2 += s; d3 += s;

x[u] = d0; x[u-1] = d1; u -= 4;

x[u+2] = d2; x[u+1] = d3;

}

Siddhartha Chatterjee

what did the compiler have to do
What Did The Compiler Have To Do?
  • Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset
  • Determine that loop unrolling would be useful by discovering independence of loop iterations
  • Rename registers to avoid name dependences
  • Eliminate extra tests and branches and adjust loop control
  • Determine that LDs and SDs can be interchanged by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations
  • Schedule the code, preserving dependences
  • Resources consumed: Code space, architectural registers

Siddhartha Chatterjee

dependences
Dependences
  • Three kinds of dependences
    • Data dependence
    • Name dependence
    • Control dependence
  • In the context of loop-level parallelism, data dependence can be
    • Loop-independent
    • Loop-carried
  • Data dependences act as a limit of how much ILP can be exploited in a compiled program
  • Compiler tries to identify and eliminate dependences
  • Hardware tries to prevent dependences from becoming stalls

Siddhartha Chatterjee

data and name dependences
Data and Name Dependences
  • Instruction v is data-dependent on instruction u if
    • u produces a result that v consumes
  • Instruction v is anti-dependent on instruction u if
    • u precedes v
    • v writes a register or memory location that u reads
  • Instruction v is output-dependent on instruction u if
    • u precedes v
    • v writes a register or memory location that u writes
  • A data dependence that cannot be removed by renaming corresponds to a RAW hazard
  • Anti-dependence corresponds to a WAR hazard
  • Output dependence corresponds to a WAW hazard

Siddhartha Chatterjee

control dependences
Control Dependences
  • A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be

if (p1) {s1;}

if (p2) {s2;}

  • Control dependence constrains code motion
    • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch
    • An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Siddhartha Chatterjee

data dependence in loop iterations

A[u+1] = A[u]+C[u];

B[u+1] = B[u]+A[u+1];

A[u+2] = A[u+1]+C[u+1];

B[u+2] = B[u+1]+A[u+2];

A[u] = A[u]+B[u];

B[u+1] = C[u]+D[u];

A[u+1] = A[u+1]+B[u+1];

B[u+2] = C[u+1]+D[u+1];

B[u+1] = C[u]+D[u];

A[u+1] = A[u+1]+B[u+1];

B[u+2] = C[u+1]+D[u+1];

A[u+2] = A[u+2]+B[u+2];

Data Dependence in Loop Iterations

A[u+1] = A[u]+C[u];

B[u+1] = B[u]+A[u+1];

A[u] = A[u]+B[u];

B[u+1] = C[u]+D[u];

B[u+1] = C[u]+D[u];

A[u+1] = A[u+1]+B[u+1];

Siddhartha Chatterjee

loop transformation

A[u] = A[u]+B[u];

B[u+1] = C[u]+D[u];

A[u+1] = A[u+1]+B[u+1];

B[u+2] = C[u+1]+D[u+1];

A[u+2] = A[u+2]+B[u+2];

B[u+3] = C[u+2]+D[u+2];

A[u] = A[u]+B[u];

B[u+1] = C[u]+D[u];

A[u+1] = A[u+1]+B[u+1];

B[u+2] = C[u+1]+D[u+1];

A[u+2] = A[u+2]+B[u+2];

B[u+3] = C[u+2]+D[u+2];

Loop Transformation
  • Sometimes loop-carried dependence does not prevent loop parallelization
  • Example: Second loop of previous slide
  • In other cases, loop-carried dependence prohibits loop parallelization
  • Example: First loop of previous slide

A[u] = A[u]+B[u];

B[u+1] = C[u]+D[u];

Siddhartha Chatterjee

software pipelining
Software Pipelining
  • Observation
    • If iterations from loops are independent, then we can get ILP by taking instructions from different iterations
  • Software pipelining reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop

i0

i1

i2

i3

Software Pipeline

Iteration

i4

Siddhartha Chatterjee

software pipelining example

(As in slide 10)

Before: Unrolled 3 times

1 LD F0,0(R1)

2 ADDD F4,F0,F2

3 SD 0(R1),F4

4 LD F0,-8(R1)

5 ADDD F4,F0,F2

6 SD -8(R1),F4

7 LD F0,-16(R1)

8 ADDD F4,F0,F2

9 SD -16(R1),F4

10 SUBI R1,R1,#24

11 BNEZ R1,LOOP

Software Pipelining Example

After: Software Pipelined

LD F0,0(R1)

ADDD F4,F0,F2

LD F0,-8(R1)

1 SD 0(R1),F4; Stores X[u]

2 ADDD F4,F0,F2; Adds to X[u-1]

3 LD F0,-16(R1); loads X[u-2]

4 SUBI R1,R1,#8

5 BNEZ R1,LOOP

SD 0(R1),F4

ADDD F4,F0,F2

SD -8(R1),F4

register double d, e;

d = x[999]; e = d+s;

d = x[998];

for (u = 999; u >= 2; u--){

x[u] = e;

e = d+s;

d = x[u-2];

}

x[1] = e;

e = d+s; x[0] = e;

Read F4

Read F0

SD

ADDD

LD

IF ID EX Mem WB

IF ID EX Mem WB

IF ID EX Mem WB

Write F4

Write F0

Siddhartha Chatterjee

software pipelining concept
Software Pipelining: Concept

“A Study of Scalar Compilation Techniques for Pipelined Supercomputers”, S. Weiss and J. E. Smith, ISCA 1987, pages 105-109

  • Notation: Load, Execute, Store
  • Iterations are independent
  • In normal sequence, Ei depends on Li, and Si depends on Ei, leading to pipeline stalls
  • Software pipelining attempts to reduce these delays by inserting other instructions between such dependent pairs and “hiding” the delay
    • “Other” instructions are L and S instructions from other loop iterations
  • Does this without consuming extra code space or registers
    • Performance usually not as high as that of loop unrolling
  • How can we permute L, E, S to achieve this?

L1

E1

S1

JC Loop

L2

E2

S2

JC Loop

L3

E3

S3

JC Loop

Ln

En

Sn

Loop:

Li

Ei

Si

JC Loop

Siddhartha Chatterjee