Eeng 449bg cpsc 439bg computer systems lecture 15 software ilp chapter 4 text sections 4 1 4 5
1 / 31

March 24, 2004 Prof. Andreas Savvides Spring 2004 eng.yale/courses/eeng449bG - PowerPoint PPT Presentation

  • Uploaded on

EENG 449bG/CPSC 439bG Computer Systems Lecture 15 Software ILP – Chapter 4 Text Sections 4.1 – 4.5. March 24, 2004 Prof. Andreas Savvides Spring 2004 Compiler Techniques for Exposing ILP. In Chapter 3 we discussed hardware based techniques for ILP

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' March 24, 2004 Prof. Andreas Savvides Spring 2004 eng.yale/courses/eeng449bG' - mikkel

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eeng 449bg cpsc 439bg computer systems lecture 15 software ilp chapter 4 text sections 4 1 4 5

EENG 449bG/CPSC 439bG Computer SystemsLecture 15Software ILP – Chapter 4Text Sections 4.1 – 4.5

March 24, 2004

Prof. Andreas Savvides

Spring 2004

Compiler techniques for exposing ilp
Compiler Techniques for Exposing ILP

  • In Chapter 3 we discussed hardware based techniques for ILP

    • Dynamic scheduling and other hardware based optimizations

      – mostly apply to superscalar processors

  • In this chapter

    • Static scheduling techniques at the compiler level

    • Mostly apply to VLIW processors

    • Start by examining how to optimize loops

Running example
Running Example

  • This code, adds a scalar to a vector:

    for (i=1000; i>0; i=i–1)

    x[i] = x[i] + s;

  • Assume following latency all examples

Instruction Instruction Execution Latency producing result using result in cycles in cycles

FP ALU op Another FP ALU op 4 3

FP ALU op Store double 3 2

Load double FP ALU op 1 1

Load double Store double 1 0

Integer op Integer op 1 0

Fp loop where are the hazards
FP Loop: Where are the Hazards?

  • First translate into MIPS code:

    • -To simplify, assume 8 is lowest address

Loop: L.D F0,0(R1) ;F0=vector element

ADD.D F4,F0,F2 ;add scalar from F2

S.D 0(R1),F4 ;store result

DADDUI R1,R1,#-8 ;decrement pointer 8B (DW)

BNEZ R1,Loop ;branch R1!=zero

NOP ;delayed branch slot

Where are the stalls?

Fp loop showing stalls
FP Loop Showing Stalls

1 Loop: L.D F0,0(R1) ;F0=vector element

2 stall

3 ADD.D F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 S.D F4, 0(R1) ;store result

7 DADDUI R1,R1,#-8 ;decrement pointer 8B (DW)

8 stall

9 BNE R1,Loop ;branch R1!=zero

10 stall ;delayed branch slot

  • 10 clocks: Rewrite code to minimize stalls?

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Revised fp loop minimizing stalls
Revised FP Loop Minimizing Stalls

1 Loop: L.D F0,0(R1)

2 DADDUI R1,R1,#-8

3 ADD.D F4,F0,F2

4 stall

5 BNE R1,R2, Loop ;delayed branch

6 S.D F4, 8(R1)

6 clocks, but just 3 for execution, 3 for loop overhead; How make faster?

Swap BNE and S.D by changing address of S.D

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Unroll loop four times straightforward way
Unroll Loop Four Times (straightforward way)

1 cycle stall

1 Loop: L.D F0,0(R1)

2 ADD.D F4,F0,F2

3 S.D 0(R1),F4 ;drop DADDUI & BNE

4 L.D F6,-8(R1)

5 ADD.D F8,F6,F2

6 S.D F8,-8(R1) ;drop DADDUI & BNE

7 L.D F10,-16(R1)

8 ADD.D F12,F10,F2

9 S.D F12,-16(R1) ;drop DADDUI & BNE

10 L.D F14,-24(R1)

11 ADD.D F16,F14,F2

12 S.D F16,-24(R1)

13 DADDUI R1,R1,#-32 ;alter to 4*8


14 + (4 x (1+2))+ 2= 28 clock cycles, or 7 per iteration

2 cycles stall

Rewrite loop to minimize stalls?

1 cycle stall

1 cycle stall (delayed branch)

Unrolled loop detail
Unrolled Loop Detail

  • Do not usually know upper bound of loop

  • Suppose it is n, and we would like to unroll the loop to make k copies of the body

  • Instead of a single unrolled loop, we generate a pair of consecutive loops:

    • 1st executes (n mod k) times and has a body that is the original loop

    • 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times

    • For large values of n, most of the execution time will be spent in the unrolled loop

  • Problem: Although it improves execution performance, it increases the code size substantially!

Unrolled loop that minimizes stalls scheduled based on the latencies from slide 4
Unrolled Loop That Minimizes Stalls(scheduled based on the latencies from slide 4)

1 Loop: L.D F0,0(R1)

2 L.D F6,-8(R1)

3 L.D F10,-16(R1)

4 L.D F14,-24(R1)

5 ADD.D F4,F0,F2

6 ADD.D F8,F6,F2

7 ADD.D F12,F10,F2

8 ADD.D F16,F14,F2

9 S.D F4, 0(R1)

10 S.D F8, -8(R1)

11 S.D F12, -16(R1)

12 DADDUI R1,R1,#-32


14 S.D F16, 8(R1) ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

Better than 7 before scheduling and 6 when scheduled and not unrolled

  • What assumptions made when moved code?

    • OK to move store past DSUBUI even though changes register

    • OK to move loads before stores: get right data?

    • When is it safe for compiler to do such changes?

Compiler perspectives on code movement
Compiler Perspectives on Code Movement

  • Compiler concerned about dependencies in program

  • Whether or not a HW hazard depends on pipeline

  • Try to schedule to avoid hazards that cause performance losses

  • (True) Data dependencies (RAW if a hazard for HW)

    • Instruction i produces a result used by instruction j, or

    • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

  • If dependent, can’t execute in parallel

  • Easy to determine for registers (fixed names)

  • Hard for memory (“memory disambiguation” problem):

    • Does 100(R4) = 20(R6)?

    • From different loop iterations, does 20(R6) = 20(R6)?

Compiler perspectives on code movement1
Compiler Perspectives on Code Movement

  • Name Dependencies are Hard to discover for Memory Accesses

    • Does 100(R4) = 20(R6)?

    • From different loop iterations, does 20(R6) = 20(R6)?

  • Our example required compiler to know that if R1 doesn’t change then:0(R1)  -8(R1)  -16(R1)  -24(R1)

    There were no dependencies between some loads and stores so they could be moved by each other

Steps compiler performed to unroll
Steps Compiler Performed to Unroll

  • Check OK to move the S.D after DADDUI and BNEZ, and find amount to adjust S.D offset

  • Determine unrolling the loop would be useful by finding that the loop iterations were independent

  • Rename registers to avoid name dependencies

  • Eliminate extra test and branch instructions and adjust the loop termination and iteration code

  • Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent

    • requires analyzing memory addresses and finding that they do not refer to the same address.

  • Schedule the code, preserving any dependences needed to yield same result as the original code

Where are the name dependencies
Where are the name dependencies?

1 Loop: L.D F0,0(R1)

2 ADD.D F4,F0,F2

3 S.D F4,0(R1) ;drop DADDUI & BNE

4 L.D F0,-8(R1)

5 ADD.D F4,F0,F2

6 S.D F4, -8(R1) ;drop DADDUI & BNE

7 L.D F0,-16(R1)

8 ADD.D F4,F0,F2

9 S.D F4, -16(R1) ;drop DADDUI & BNE

10 L.D F0,-24(R1)

11 ADD.D F4,F0,F2

12 S.D F4, -24(R1)

13 DADDUI R1,R1,#-32 ;alter to 4*8


15 NOP

How can remove them? (See pg. 310 of text)

Where are the name dependencies1
Where are the name dependencies?

1 Loop: L.D F0,0(R1)

2 ADD.D F4,F0,F2

3 S.D 0(R1),F4 ;drop DSUBUI & BNEZ

4 L.D F6,-8(R1)

5 ADD.D F8,F6,F2

6 S.D -8(R1),F8 ;drop DSUBUI & BNEZ

7 L.D F10,-16(R1)

8 ADD.D F12,F10,F2

9 S.D -16(R1),F12 ;drop DSUBUI & BNEZ

10 L.D F14,-24(R1)

11 ADD.D F16,F14,F2

12 S.D -24(R1),F16

13 DSUBUI R1,R1,#32 ;alter to 4*8


15 NOP

The Orginal“register renaming” – instruction execution can be overlapped or in parallel

Limits to loop unrolling
Limits to Loop Unrolling

  • Decrease in the amount of loop overhead amortized with each unroll – After a few unrolls the loop overhead amortization is very small

  • Code size limitations – memory is not infinite especially in embedded systems

  • Compiler limitations – shortfall in registers due to excessive unrolling – register pressure – optimized code may loose its advantage due to the lack of registers

Static branch prediction
Static Branch Prediction

  • Simplest: Predict taken

    • average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%.

    • Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%)

  • Predict on the basis of branch direction?

    • choosing backward-going branches to be taken (loop)

    • forward-going branches to be not taken (if)

    • SPEC programs, however, most forward-going branches are taken => predict taken is better

  • Predict branches on the basis of profile information collected from earlier runs

    • Misprediction varies from 5% to 22%

Basic vliw architectures
Basic VLIW Architectures

  • Does not require the hardware for making dynamic issue decisions

    – the compiler is responsible for scheduling

  • Has as an advantage in wider issue processors

    • Small size instructions (2 or 3) superscalar overhead is minimal

    • For larger instructions hardware complexity grows

      • Better off with VLIW

    • Typical instruction width – 5

      • 1 Integer OP, 2 FP Ops and 2 Memory Refs

      • 12 – 24 bits per unit, instruction width 112 – 168 bits

Basic vliw architectures ii
Basic VLIW Architectures II

  • There must be enough parallelism to fill the slots

    • Unroll loops

    • Use local optimizations on straight line code

    • If code has many branches – need global optimizations (e.g trace scheduling)

  • VLIW disadvantage

    • Harder to update compiler between different versions of the hardware

      • Object code translation is a possible solution

  • General advantage of multiple issue processors vs. vector processors

    • Potential to extract parallelism from less structured code

    • Ability to use a more conventional and typically less expensive, cache based memory system

Vliw very large instruction word
VLIW: Very Large Instruction Word

  • Each “instruction” has explicit coding for multiple operations

    • In IA-64, grouping called a “packet”

    • In Transmeta, grouping called a “molecule” (with “atoms” as ops)

  • Tradeoff instruction space for simple decoding

    • The long instruction word has room for many operations

    • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel

    • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

      • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

    • Need compiling technique that schedules across several branches

When safe to unroll loop
When Safe to Unroll Loop?

  • Example: Where are data dependencies? (A,B,C distinct & nonoverlapping)for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ }

    1. S2 uses the value, A[i+1], computed by S1 in the same iteration.

    2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations

  • For our prior example, each iteration was distinct

  • Implies that iterations can’t be executed in parallel, Right????

Some loop carried dependences can be parallelized
Some Loop Carried Dependences can be Parallelized

  • Example:

    for (i=0; i<=100; i=i+1) { A[i+1] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ }

    S1 uses a value assigned by S2 in the previous iteration – loop carried dependence

    HOWEVER – dependence is not circular

    • No statement depends on itself

    • S1 depends on S2 but S2 does not depend on S1

    • Absence of cycle gives partial ordering in statements – loop is parallel

Parallel version of loop
Parallel Version of Loop

1. There is no dependence from S1 to S2. This means that S1 and S2 can be interchanged

2. On the first iteration S1 depends on B[1] computed prior to initiating the loop

B[1] = A[1] + B[1];

for (i=0; i<=99; i=i+1) {

B[i+1] = C[i] + D[i]; /* S2 */ A[i+1] = A[i+1] + B[i]; /* S1 */ }

B[101] = C[100] + D[100];

Loop iterations can now be overlapped if statements inside the loop are executed in order.

Another possibility software pipelining
Another possibility:Software Pipelining

  • Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

  • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW)

Software pipelining example
Software Pipelining Example

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Software pipelining example1
Software Pipelining Example

After: Software Pipelined

1 S.D 0(R1),F4 ; Stores M[i]

2 ADD.D F4,F0,F2 ; Adds to M[i-1]

3 L.D F0,-16(R1); Loads M[i-2]

4 DADDUI R1,R1,#-8


Before: Unrolled 3 times

1 L.D F0,0(R1)

2 ADD.D F4,F0,F2

3 S.D 0(R1),F4

4 L.D F6,-8(R1)

5 ADD.D F8,F6,F2

6 S.D -8(R1),F8

7 L.D F10,-16(R1)

8 ADD.D F12,F10,F2

9 S.D -16(R1),F12

10 DADDUI R1,R1,#-24


SW Pipeline

overlapped ops


Loop Unrolled

  • Symbolic Loop Unrolling

  • Maximize result-use distance

  • Less code space than unrolling

  • Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling


5 cycles per iteration

Loop unrolling vs software pipelining
Loop Unrolling vs. Software Pipelining

  • Both provide a better scheduled inner loop

  • Loop Unrolling

    • Reduces the overhead of the loop, the branch and counter update code

  • Software Pipelining

    • Reduces the number the loop is not running at peak speed to once per loop at the beginning and end

    • Easier when the body of a loop is a basic block, much more complex when it contains internal flow control

  • If we unroll a loop that does 100 iterations a constant number of times i.e 4 Then we have to pay the overhead 100/4=25 times

Trace scheduling
Trace Scheduling

  • Parallelism across IF branches vs. LOOP branches?

  • Trace scheduling incurs cost to less frequent paths.

  • Two steps:

    • Trace Selection

      • Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code

    • Trace Compaction

      • Squeeze trace into few VLIW instructions

      • Need bookkeeping code in case prediction is wrong

  • This is a form of compiler-generated speculation

    • Compiler must generate “fixup” code to handle cases in which trace is not the taken branch

    • Needs extra registers: undoes bad guess by discarding

  • Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks

  • So far it has been successfully applied to scientific code with intensive loops but still unclear if it is suitable for programs with less loops.

Advantages of hw tomasulo vs sw vliw speculation
Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation

  • HW advantages:

    • HW better at memory disambiguation since knows actual addresses

    • HW better at branch prediction since lower overhead

    • HW maintains precise exception model

    • HW does not execute bookkeeping instructions

    • Same software works across multiple implementations

    • Smaller code size (not as many nops filling blank instructions)

  • SW advantages:

    • Window of instructions that is examined for parallelism much higher

    • Much less hardware involved in VLIW (unless you are Intel…!)

    • More involved types of speculation can be done more easily

    • Speculation can be based on large-scale program behavior, not just local information

Superscalar v vliw

Smaller code size

Binary compatibility across generations of hardware

Simplified Hardware for decoding, issuing instructions

No Interlock Hardware (compiler checks?)

More registers, but simplified Hardware for Register Ports (multiple independent register files?)

Superscalar v. VLIW

Next time
Next Time

  • Hardware support for exposing more parallelism, examples and conclusion of ILP

  • Discussion of project reports

  • HWK3 out next lecture

    Next Week

  • Memory Hierarchies – Chapter 5 – last chapter for the course!