Computer architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

Computer Architecture PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on
  • Presentation posted in: General

Computer Architecture. Chapter 4 Instruction-Level Parallelism - 3 Prof. Jerry Breecher CS 240 Fall 2003. Chapter Overview. 4.1 Compiler Techniques for Exposing ILP 4.2 Static Branch Prediction 4.3 Static Multiple Issue: VLIW 4.4 Advanced Compiler Support for ILP

Download Presentation

Computer Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computer architecture

Computer Architecture

Chapter 4

Instruction-Level Parallelism - 3

Prof. Jerry Breecher

CS 240

Fall 2003


Chapter overview

Chapter Overview

4.1 Compiler Techniques for Exposing ILP

4.2 Static Branch Prediction

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


Ideas to reduce stalls

Ideas To Reduce Stalls

Chapter 3

Chapter 4

Chap. 4 - ILP 3


Instruction level parallelism

Instruction Level Parallelism

How can compilers recognize and take advantage of ILP?

4.1 Compiler Techniques for Exposing ILP

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


Simple loop and its assembler equivalent

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

Simple Loop and its Assembler Equivalent

for (i=1; i<=1000; i++) x(i) = x(i) + s;

This is a clean and simple example!

Loop:LDF0,0(R1);F0=vector element

ADDDF4,F0,F2;add scalar from F2

SD0(R1),F4;store result

SUBIR1,R1,8;decrement pointer 8bytes (DW)

BNEZR1,Loop;branch R1!=zero

NOP;delayed branch slot

Chap. 4 - ILP 3


Fp loop hazards

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

FP Loop Hazards

Loop:LDF0,0(R1);F0=vector element

ADDDF4,F0,F2;add scalar in F2

SD0(R1),F4;store result

SUBIR1,R1,8;decrement pointer 8B (DW)

BNEZR1,Loop;branch R1!=zero

NOP;delayed branch slot

InstructionInstructionLatency inproducing resultusing result clock cycles

FP ALU opAnother FP ALU op3

FP ALU opStore double2

Load doubleFP ALU op1

Load doubleStore double0

Integer opInteger op0

Where are the stalls?

Chap. 4 - ILP 3


Fp loop showing stalls

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

FP Loop Showing Stalls

1 Loop:LDF0,0(R1);F0=vector element

2stall

3ADDDF4,F0,F2;add scalar in F2

4stall

5stall

6 SD0(R1),F4;store result

7 SUBIR1,R1,8;decrement pointer 8Byte (DW)

8 stall

9 BNEZR1,Loop;branch R1!=zero

10stall;delayed branch slot

10 clocks: Rewrite code to minimize stalls?

InstructionInstructionLatency inproducing resultusing result clock cycles

FP ALU opAnother FP ALU op3

FP ALU opStore double2

Load doubleFP ALU op1

Load doubleStore double0

Integer opInteger op0

Chap. 4 - ILP 3


Scheduled fp loop minimizing stalls

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

Scheduled FP Loop Minimizing Stalls

1 Loop:LDF0,0(R1)

2SUBIR1,R1,8

3ADDDF4,F0,F2

4stall

5BNEZR1,Loop;delayed branch

6 SD8(R1),F4;altered when move past SUBI

Now 6 clocks: Now unroll loop 4 times to make faster.

Stall is because SD can’t proceed.

Swap BNEZ and SD by changing address of SD

InstructionInstructionLatency inproducing resultusing result clock cycles

FP ALU opAnother FP ALU op3

FP ALU opStore double2

Load doubleFP ALU op1

Chap. 4 - ILP 3


Unroll loop four times straightforward way

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

Unroll Loop Four Times (straightforward way)

1 Loop:LDF0,0(R1)

2stall

3ADDDF4,F0,F2

4stall

5 stall

6SD0(R1),F4

7LDF6,-8(R1)

8stall

9ADDDF8,F6,F2

10stall

11stall

12SD-8(R1),F8

13LDF10,-16(R1)

14stall

15ADDDF12,F10,F2

16stall

17stall

18SD-16(R1),F12

19LDF14,-24(R1)

20stall

21ADDDF16,F14,F2

22stall

23stall

24SD-24(R1),F16

25SUBIR1,R1,#32

26BNEZR1,LOOP

27stall

28NOP

Rewrite loop to minimize stalls.

15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration

Assumes R1 is multiple of 4

Chap. 4 - ILP 3


Unrolled loop that minimizes stalls

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

Unrolled Loop That Minimizes Stalls

1 Loop:LDF0,0(R1)

2LDF6,-8(R1)

3LDF10,-16(R1)

4LDF14,-24(R1)

5ADDDF4,F0,F2

6ADDDF8,F6,F2

7ADDDF12,F10,F2

8ADDDF16,F14,F2

9SD0(R1),F4

10SD-8(R1),F8

11SD-16(R1),F12

12SUBIR1,R1,#32

13BNEZR1,LOOP

14SD8(R1),F16; 8-32 = -24

14 clock cycles, or 3.5 per iteration

What assumptions made when moved code?

  • OK to move store past SUBI even though changes register

  • OK to move loads before stores: get right data?

  • When is it safe for compiler to do such changes?

No Stalls!!

Chap. 4 - ILP 3


Summary of loop unrolling example

Compilers and ILP

Pipeline Scheduling and Loop Unrolling

Summary of Loop Unrolling Example

  • Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset.

  • Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code.

  • Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations.

  • Eliminate the extra tests and branches and adjust the loop maintenance code.

  • Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address.

  • Schedule the code, preserving any dependences needed to yield the same result as the original code.

Chap. 4 - ILP 3


Compiler perspectives on code movement

Compilers and ILP

Dependencies

Compiler Perspectives on Code Movement

Compiler concerned about dependencies in program. Not concerned if a HW hazard depends on a given pipeline.

  • Tries to schedule code to avoid hazards.

  • Looks for Data dependencies (RAW if a hazard for HW)

    • Instruction i produces a result used by instruction j, or

    • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

  • If dependent, can’t execute in parallel

  • Easy to determine for registers (fixed names)

  • Hard for memory:

    • Does 100(R4) = 20(R6)?

    • From different loop iterations, does 20(R6) = 20(R6)?

Chap. 4 - ILP 3


Compiler perspectives on code movement1

Compilers and ILP

Data Dependencies

Compiler Perspectives on Code Movement

Where are the data dependencies?

1 Loop:LDF0,0(R1)

2ADDDF4,F0,F2

3SUBIR1,R1,8

4BNEZR1,Loop;delayed branch

5 SD8(R1),F4;altered when move past SUBI

Chap. 4 - ILP 3


Compiler perspectives on code movement2

Compilers and ILP

Name Dependencies

Compiler Perspectives on Code Movement

  • Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data

  • Anti-dependence (WAR if a hazard for HW)

    • Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first

  • Output dependence (WAW if a hazard for HW)

    • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Chap. 4 - ILP 3


Compiler perspectives on code movement3

Compilers and ILP

Name Dependencies

Compiler Perspectives on Code Movement

1 Loop:LDF0,0(R1)

2ADDDF4,F0,F2

3SD0(R1),F4

4LDF0,-8(R1)

5ADDDF4,F0,F2

6SD-8(R1),F4

7LDF0,-16(R1)

8ADDDF4,F0,F2

9SD-16(R1),F4

10LDF0,-24(R1)

11ADDDF4,F0,F2

12SD-24(R1),F4

13SUBIR1,R1,#32

14BNEZR1,LOOP

15NOP

How can we remove these dependencies?

Where are the name dependencies?

No data is passed in F0, but can’t reuse F0 in cycle 4.

Chap. 4 - ILP 3


Compiler perspectives on code movement4

Compilers and ILP

Name Dependencies

Compiler Perspectives on Code Movement

  • Again Name Dependencies are Hard for Memory Accesses

    • Does 100(R4) = 20(R6)?

    • From different loop iterations, does 20(R6) = 20(R6)?

  • Our example required compiler to know that if R1 doesn’t change then:0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)

    There were no dependencies between some loads and stores so they could be moved around each other

Chap. 4 - ILP 3


Computer architecture

Compilers and ILP

Control Dependencies

Compiler Perspectives on Code Movement

  • Final kind of dependence called control dependence

  • Example

    if p1 {S1;};

    if p2 {S2;};

    S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Chap. 4 - ILP 3


Computer architecture

Compilers and ILP

Control Dependencies

Compiler Perspectives on Code Movement

  • Two (obvious) constraints on control dependences:

    • An instruction that is control dependenton a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.

    • An instruction that is not control dependenton a branch cannot be moved to after the branch so that its execution is controlled by the branch.

  • Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

Chap. 4 - ILP 3


Where are the control dependencies

Compilers and ILP

Control Dependencies

Where are the control dependencies?

Compiler Perspectives on Code Movement

1 Loop:LDF0,0(R1)

2ADDDF4,F0,F2

3SD0(R1),F4

4SUBIR1,R1,8

5BEQZR1,exit

6LDF0,0(R1)

7ADDDF4,F0,F2

8SD0(R1),F4

9SUBIR1,R1,8

10BEQZR1,exit

11LDF0,0(R1)

12ADDDF4,F0,F2

13SD0(R1),F4

14SUBIR1,R1,8

15BEQZR1,exit

....

Chap. 4 - ILP 3


When safe to unroll loop

Compilers and ILP

Loop Level Parallelism

When Safe to Unroll Loop?

  • Example: Where are data dependencies? (A,B,C distinct & non-overlapping)

    1. S2 uses the value, A[i+1], computed by S1 in the same iteration.

    2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence” between iterations

  • Implies that iterations are dependent, and can’t be executed in parallel

  • Note the case for our prior example; each iteration was distinct

for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1]; /* S2 */}

Chap. 4 - ILP 3


When safe to unroll loop1

Compilers and ILP

Loop Level Parallelism

When Safe to Unroll Loop?

  • Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)

    1. No dependence from S1 to S2. If there were, then there would be a cycle in the dependencies and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2.

    2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop.

for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i]; /* S2 */}

Chap. 4 - ILP 3


Now safe to unroll loop p 240

Compilers and ILP

Loop Level Parallelism

Now Safe to Unroll Loop? (p. 240)

A[1] = A[1] + B[1];

for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i];A[i+1] = + A[i+1] + B[i+1];}

B[101] = C[100] + D[100];

for (i=1; i<=100; i=i+1) {A[i+1] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */

No circular dependencies.

OLD:

Loop caused dependence on B.

Have eliminated loop dependence.

NEW:

Chap. 4 - ILP 3


Example 1 there are no dependencies

Compilers and ILP

Example 1There are NO dependencies

Loop Level Parallelism

/* *****************************************************

This is the example on page 305 of Hennessy & Patterson but running on an Intel Machine

***************************************************** */

#define MAX 1000

#define ITER 100000

int main( int argc, char argv[] )

{

double x[MAX + 2];

double s = 3.14159;

int i, j;

for ( i = MAX; i > 0; i-- ) /* Init array */

x[i] = 0;

for ( j = ITER; j > 0; j-- )

for ( i = MAX; i > 0; i-- )

x[i] = x[i] + s;

}

Chap. 4 - ILP 3


Computer architecture

Elapsed seconds = 0.122848

Compilers and ILP

This is the ICC optimized code

.L2:

fstpl 8(%esp,%edx,8)

fldl (%esp,%edx,8)

fadd %st(1), %st

fldl -8(%esp,%edx,8)

fldl -16(%esp,%edx,8)

fldl -24(%esp,%edx,8)

fldl -32(%esp,%edx,8)

fxch %st(4)

fstpl (%esp,%edx,8)

fxch %st(2)

fadd %st(4), %st

fstpl -8(%esp,%edx,8)

fadd %st(3), %st

fstpl -16(%esp,%edx,8)

fadd %st(2), %st

fstpl -24(%esp,%edx,8)

fadd %st(1), %st

addl $-5, %edx

testl %edx, %edx

jg .L2 # Prob 99%

fstpl 8(%esp,%edx,8)

Loop Level Parallelism

Example 1

Elapsed seconds = 0.590026

This is the GCC optimized code

.L15:

fldl (%ecx,%eax)

fadd %st(1),%st

decl %edx

fstpl (%ecx,%eax)

addl $-8,%eax

testl %edx,%edx

jg .L15

Chap. 4 - ILP 3


Example 2

Compilers and ILP

Example 2

Loop Level Parallelism

There are two depend-encies here – what are they?

// Example on Page 320

get_current_time( &start_time );

for ( j = ITER; j > 0; j-- )

{

for ( i = 1; i <= MAX; i++ )

{

A[i+1] = A[i] + C[i];

B[i+1] = B[i] + A[i+1];

}

}

get_current_time( &end_time );

Chap. 4 - ILP 3


Computer architecture

Compilers and ILP

Elapsed seconds = 0.664073

Loop Level Parallelism

This is the ICC optimized code

.L4:

fstpl 25368(%esp,%edx,8)

fldl 8472(%esp,%edx,8)

faddl 16920(%esp,%edx,8)

fldl 25368(%esp,%edx,8)

fldl 16928(%esp,%edx,8)

fxch %st(2)

fstl 8480(%esp,%edx,8)

fadd %st, %st(1)

fxch %st(1)

fstl 25376(%esp,%edx,8)

fxch %st(2)

faddp %st, %st(1)

fstl 8488(%esp,%edx,8)

faddp %st, %st(1)

addl $2, %edx

cmpl $1000, %edx

jle .L4 # Prob 99%

fstpl 25368(%esp,%edx,8)

Example 2

Elapsed seconds = 1.357084

This is GCC optimized code

.L55:

fldl -8(%esi,%eax)

faddl -8(%edi,%eax)

fstl (%esi,%eax)

faddl -8(%ecx,%eax)

incl %edx

fstpl (%ecx,%eax)

addl $8,%eax

cmpl $1000,%edx

jle .L55

This is Microsoft optimized code

$L1225:

fldQWORD PTR _C$[esp+eax+40108]

addeax, 8

cmpeax, 7992

faddQWORD PTR _A$[esp+eax+40100]

fstQWORD PTR _A$[esp+eax+40108]

faddQWORD PTR _B$[esp+eax+40100]

fstpQWORD PTR _B$[esp+eax+40108]

jle$L1225

Chap. 4 - ILP 3


Example 3

Compilers and ILP

Example 3

Loop Level Parallelism

What are the depend-encies here??

// Example on Page 321

get_current_time( &start_time );

for ( j = ITER; j > 0; j-- )

{

for ( i = 1; i <= MAX; i++ )

{

A[i] = A[i] + B[i];

B[i+1] = C[i] + D[i];

}

}

get_current_time( &end_time );

Chap. 4 - ILP 3


Computer architecture

Elapsed seconds = 0.325419

Compilers and ILP

This is the ICC optimized code

.L6:

fstpl 8464(%esp,%edx,8)

fldl 8472(%esp,%edx,8)

faddl 25368(%esp,%edx,8

fldl 16920(%esp,%edx,8)

faddl 33824(%esp,%edx,8)

fldl 8480(%esp,%edx,8)

fldl 16928(%esp,%edx,8)

faddl 33832(%esp,%edx,8)

fxch %st(3)

fstpl 8472(%esp,%edx,8)

fxch %st(1)

fstl 25376(%esp,%edx,8)

fxch %st(2)

fstpl 25384(%esp,%edx,8)

faddp %st, %st(1)

addl $2, %edx

cmpl $1000, %edx

jle .L6 # Prob 99%

fstpl 8464(%esp,%edx,8)

Loop Level Parallelism

Example 3

Elapsed seconds = 1.370478

This is the GCC optimized code

.L65:

fldl (%esi,%eax)

faddl (%ecx,%eax)

fstpl (%esi,%eax)

movl -40100(%ebp),%edi

fldl (%edi,%eax)

movl -40136(%ebp),%edi

faddl (%edi,%eax)

incl %edx

fstpl 8(%ecx,%eax)

addl $8,%eax

cmpl $1000,%edx

jle .L65

Chap. 4 - ILP 3


Example 4

Compilers and ILP

Example 4

Loop Level Parallelism

Elapsed seconds = 1.200525

How many depend-encies here??

// Example on Page 322

get_current_time( &start_time );

for ( j = ITER; j > 0; j-- )

{

A[1] = A[1] + B[1];

for ( i = 1; i <= MAX - 1; i++ )

{

B[i+1] = C[i] + D[i];

A[i+1] = A[i+1] + B[i+1];

}

B[101] = C[100] + D[100];

}

get_current_time( &end_time );

Chap. 4 - ILP 3


Computer architecture

Compilers and ILP

Loop Level Parallelism

Example 4

Elapsed seconds = 1.200525

This is the GCC optimized code

.L75:

movl -40136(%ebp),%edi

fldl -8(%edi,%eax)

faddl -8(%esi,%eax)

movl -40104(%ebp),%edi

fstl (%edi,%eax)

faddl (%ecx,%eax)

incl %edx

fstpl (%ecx,%eax)

addl $8,%eax

cmpl $999,%edx

jle .L75

This is the Microsoft optimized code

$L1239

fldQWORD PTR _D$[esp+eax+40108]

addeax, 8

cmpeax, 7984;00001f30H

faddQWORD PTR _C$[esp+eax+40100]

fstQWORD PTR _B$[esp+eax+40108]

faddQWORD PTR _A$[esp+eax+40108]

fstpQWORD PTR _A$[esp+eax+40108]

jleSHORT $L1239

Chap. 4 - ILP 3


Computer architecture

Compilers and ILP

Elapsed seconds = 0.359232

Loop Level Parallelism

CONTINUED

fstl 25376(%esp,%edx,8)

fxch %st(3)

fstl 25384(%esp,%edx,8)

fxch %st(1)

fstl 25392(%esp,%edx,8)

fxch %st(3)

faddp %st, %st(4)

fxch %st(3)

fstpl 8480(%esp,%edx,8)

faddp %st, %st(2)

fxch %st(1)

fstpl 8488(%esp,%edx,8)

faddp %st, %st(1)

addl $3, %edx

cmpl $999, %edx

jle .L8

fstpl 8472(%esp,%edx,8)

Example 4

This is the ICC optimized code

.L8:

fstpl 8472(%esp,%edx,8)

fldl 16920(%esp,%edx,8)

faddl 33824(%esp,%edx,8)

fldl 8480(%esp,%edx,8)

fldl 16928(%esp,%edx,8)

faddl 33832(%esp,%edx,8)

fldl 8488(%esp,%edx,8)

fldl 16936(%esp,%edx,8)

faddl 33840(%esp,%edx,8)

fldl 8496(%esp,%edx,8)

fxch %st(5)

Chap. 4 - ILP 3


Static multiple issue

Static Multiple Issue

Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.

Flavor I:

Superscalar processors issue varying number of instructions per clock - can be either statically scheduled (by the compiler) or dynamically scheduled (by the hardware).

Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).

IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

4.1 Compiler Techniques for Exposing ILP

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


Issuing multiple instructions cycle

Multiple Issue

Issuing Multiple Instructions/Cycle

Flavor II:

VLIW - Very Long Instruction Word - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions.

fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates

  • Joint HP/Intel agreement in 1999/2000

  • Intel Architecture-64 (IA-64) 64-bit address

  • Style: “Explicitly Parallel Instruction Computer (EPIC)”

Chap. 4 - ILP 3


Issuing multiple instructions cycle1

Multiple Issue

Issuing Multiple Instructions/Cycle

Flavor II - continued:

  • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent

    • Smaller code size than old VLIW, larger than x86/RISC

    • Groups can be linked to show independence > 3 instr

  • 64 integer registers + 64 floating point registers

    • Not separate files per functional unit as in old VLIW

  • Hardware checks dependencies (interlocks => binary compatibility over time)

  • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mis-predictions?

  • IA-64 : name of instruction set architecture; EPIC is type

  • Merced is name of first implementation (1999/2000?)

Chap. 4 - ILP 3


Issuing multiple instructions cycle2

Multiple Issue

A SuperScalar Version of MIPS

Issuing Multiple Instructions/Cycle

  • In our MIPS example, we can handle 2 instructions/cycle:

  • Floating Point

  • Anything Else

– Fetch 64-bits/clock cycle; Int on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op in a pair

TypePipeStages

Int. instructionIFIDEXMEMWB

FP instructionIFIDEXMEMWB

Int. instructionIFIDEXMEMWB

FP instructionIFIDEXMEMWB

Int. instructionIFIDEXMEMWB

FP instructionIFIDEXMEMWB

  • 1 cycle load delay causes delay to 3 instructions in Superscalar

    • instruction in right half can’t use it, nor instructions in next slot

Chap. 4 - ILP 3


Unrolled loop minimizes stalls for scalar

Multiple Issue

A SuperScalar Version of MIPS

Unrolled Loop Minimizes Stalls for Scalar

1 Loop:LDF0,0(R1)

2LDF6,-8(R1)

3LDF10,-16(R1)

4LDF14,-24(R1)

5ADDDF4,F0,F2

6ADDDF8,F6,F2

7ADDDF12,F10,F2

8ADDDF16,F14,F2

9SD0(R1),F4

10SD-8(R1),F8

11SD-16(R1),F12

12SUBIR1,R1,#32

13BNEZR1,LOOP

14SD8(R1),F16; 8-32 = -24

14 clock cycles, or 3.5 per iteration

Latencies:

LD to ADDD: 1 Cycle

ADDD to SD: 2 Cycles

Chap. 4 - ILP 3


Loop unrolling in superscalar

Multiple Issue

A SuperScalar Version of MIPS

Loop Unrolling in Superscalar

Integer instructionFP instructionClock cycle

Loop:LD F0,0(R1)1

LD F6,-8(R1)2

LD F10,-16(R1)ADDD F4,F0,F23

LD F14,-24(R1)ADDD F8,F6,F24

LD F18,-32(R1)ADDD F12,F10,F25

SD 0(R1),F4ADDD F16,F14,F26

SD -8(R1),F8ADDD F20,F18,F27

SD -16(R1),F128

SD -24(R1),F169

SUBI R1,R1,#4010

BNEZ R1,LOOP11

SD 8(R1),F2012

  • Unrolled 5 times to avoid delays (+1 due to SS)

  • 12 clocks, or 2.4 clocks per iteration

Chap. 4 - ILP 3


Dynamic scheduling in superscalar

Multiple Issue

Multiple Instruction Issue & Dynamic Scheduling

Dynamic Scheduling in Superscalar

Code compiler for scalar version will run poorly on Superscalar

May want code to vary depending on how Superscalar

Simple approach: separate Tomasulo Control for separate reservation stations for Integer FU/Reg and for FP FU/Reg

Chap. 4 - ILP 3


Dynamic scheduling in superscalar1

Multiple Issue

Multiple Instruction Issue & Dynamic Scheduling

Dynamic Scheduling in Superscalar

  • How to do instruction issue with two instructions and keep in-order instruction issue for Tomasulo?

    • Issue 2X Clock Rate, so that issue remains in order

    • Only FP loads might cause dependency between integer and FP issue:

      • Replace load reservation station with a load queue; operands must be read in the order they are fetched

      • Load checks addresses in Store Queue to avoid RAW violation

      • Store checks addresses in Load Queue to avoid WAR,WAW

Chap. 4 - ILP 3


Performance of dynamic superscalar

Multiple Issue

Multiple Instruction Issue & Dynamic Scheduling

Performance of Dynamic Superscalar

Iteration InstructionsIssues ExecutesWrites result

no. clock-cycle number

1LD F0,0(R1)124

1ADDD F4,F0,F2158

1SD 0(R1),F429

1SUBI R1,R1,#8345

1BNEZ R1,LOOP45

2LD F0,0(R1)568

2ADDD F4,F0,F25912

2SD 0(R1),F4613

2SUBI R1,R1,#8789

2BNEZ R1,LOOP89

­ 4 clocks per iteration

Branches, Decrements still take 1 clock cycle

Chap. 4 - ILP 3


Loop unrolling in vliw

Multiple Issue

VLIW

Loop Unrolling in VLIW

Memory MemoryFPFPInt. op/Clockreference 1reference 2operation 1 op. 2 branch

LD F0,0(R1)LD F6,-8(R1)1

LD F10,-16(R1)LD F14,-24(R1)2

LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23

LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24

ADDD F20,F18,F2ADDD F24,F22,F25

SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26

SD -16(R1),F12SD -24(R1),F167

SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488

SD -0(R1),F28BNEZ R1,LOOP9

  • Unrolled 7 times to avoid delays

  • 7 results in 9 clocks, or 1.3 clocks per iteration

  • Need more registers to effectively use VLIW

Chap. 4 - ILP 3


Limits to multi issue machines

Multiple Issue

Limitations With Multiple Issue

Limits to Multi-Issue Machines

  • Inherent limitations of ILP

    • 1 branch in 5 instructions => how to keep a 5-way VLIW busy?

    • Latencies of units => many operations must be scheduled

    • Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy.

  • Difficulties in building HW

    • Duplicate Functional Units to get parallel execution

    • Increase ports to Register File (VLIW example needs 6 read and 3 write for Int. Reg. & 6 read and 4 write for Reg.)

    • Increase ports to memory

    • Decoding SS and impact on clock rate, pipeline depth

Chap. 4 - ILP 3


Limits to multi issue machines1

Multiple Issue

Limitations With Multiple Issue

Limits to Multi-Issue Machines

  • Limitations specific to either SS or VLIW implementation

    • Decode issue in SS

    • VLIW code size: unroll loops + wasted fields in VLIW

    • VLIW lock step => 1 hazard & all instructions stall

    • VLIW & binary compatibility

Chap. 4 - ILP 3


Multiple issue challenges

Multiple Issue

Limitations With Multiple Issue

Multiple Issue Challenges

  • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

    • Exactly 50% FP operations

    • No hazards

  • If more instructions issue at same time, greater difficulty of decode and issue

    • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue

  • VLIW: tradeoff instruction space for simple decoding

    • The long instruction word has room for many operations

    • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel

    • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

      • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

    • Need compiling technique that schedules across several branches

Chap. 4 - ILP 3


Compiler support for ilp

Compiler Support For ILP

  • How can compilers be smart?

  • 1. Produce good scheduling of code.

  • 2. Determine which loops might contain parallelism.

  • 3. Eliminate name dependencies.

  • Compilers must be REALLY smart to figure out aliases -- pointers in C are a real problem.

  • Techniques lead to:

    • Symbolic Loop Unrolling

    • Critical Path Scheduling

4.1 Compiler Techniques for Exposing ILP

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


Software pipelining

Compiler Support For ILP

Symbolic Loop Unrolling

Software Pipelining

  • Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

  • Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)

Chap. 4 - ILP 3


Sw pipelining example

Compiler Support For ILP

Symbolic Loop Unrolling

SW Pipelining Example

After: Software Pipelined

LDF0,0(R1)

ADDDF4,F0,F2

LDF0,-8(R1)

1SD0(R1),F4; Stores M[i]

2ADDDF4,F0,F2; Adds to M[i-1]

3LDF0,-16(R1); loads M[i-2]

4SUBIR1,R1,#8

5BNEZR1,LOOP

SD0(R1),F4

ADDDF4,F0,F2

SD-8(R1),F4

Before: Unrolled 3 times

1 LDF0,0(R1)

2ADDDF4,F0,F2

3SD0(R1),F4

4LDF6,-8(R1)

5ADDDF8,F6,F2

6SD-8(R1),F8

7LDF10,-16(R1)

8ADDDF12,F10,F2

9SD-16(R1),F12

10SUBIR1,R1,#24

11BNEZR1,LOOP

Read F4

Read F0

SD

ADDD

LD

IF ID EX Mem WB

IF ID EX Mem WB

IF ID EX Mem WB

Write F4

Write F0

Chap. 4 - ILP 3


Sw pipelining example1

Compiler Support For ILP

Symbolic Loop Unrolling

SW Pipelining Example

  • Symbolic Loop Unrolling

    • Less code space

    • Overhead paid only once vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

Chap. 4 - ILP 3


Trace scheduling

Compiler Support For ILP

Critical Path Scheduling

Trace Scheduling

  • Parallelism across IF branches vs. LOOP branches

  • Two steps:

    • Trace Selection

      • Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code

    • Trace Compaction

      • Squeeze trace into few VLIW instructions

      • Need bookkeeping code in case prediction is wrong

  • Compiler undoes bad guess (discards values in registers)

  • Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks

Chap. 4 - ILP 3


Hardware support for parallelism

Hardware Support For Parallelism

  • Software support of ILP is best when code is predictable at compile time.

  • But what if there’s no predictability?

  • Here we’ll talk about hardware techniques. These include:

  • Conditional or Predicated Instructions

  • Hardware Speculation

4.1 Compiler Techniques for Exposing ILP

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


Tell the hardware to ignore an instruction

Hardware Support For Parallelism

Nullified Instructions

Tell the Hardware To Ignore An Instruction

  • Avoid branch prediction by turning branches into conditionally executed instructions:

    IF (x) then A = B op C else NOP

    • If false, then neither store result nor cause exception

    • Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have conditional move. PA-RISC can annul any following instruction.

    • IA-64: 64 1-bit condition fields selected so conditional execution of any instruction

  • Drawbacks to conditional instructions:

    • Still takes a clock, even if “annulled”

    • Stalls if condition evaluated late

    • Complex conditions reduce effectiveness; condition becomes known late in pipeline.

      This can be a major win because there is no time lost by taking a branch!!

x

A =

B op C

Chap. 4 - ILP 3


Tell the hardware to ignore an instruction1

Hardware Support For Parallelism

Nullified Instructions

Tell the Hardware To Ignore An Instruction

Suppose we have the code:

if ( VarA == 0 )

VarS = VarT;

Previous Method:

LDR1, VarA

BNEZ R1, Label

LDR2, VarT

SDVarS, R2

Label:

Nullified Method:

LDR1, VarA

LDR2, VarT

CMPNNZ R1, #0

SDVarS, R2

Label:

Compare and Nullify Next Instr. If Not Zero

Nullified Method:

LDR1, VarA

LDR2, VarT

CMOVZ VarS,R2, R1

Compare and Move IF Zero

Chap. 4 - ILP 3


Hardware support for parallelism1

Compiler Speculation

Hardware Support For Parallelism

Increasing Parallelism

The theory here is to move an instruction across a branch so as to increase the size of a basic block and thus to increase parallelism.

Primary difficulty is in avoiding exceptions. For example

if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.

Methods for increasing speculation include:

1. Use a set of status bits (poison bits) associated with the registers. Are a signal that the instruction results are invalid until some later time.

2. Result of instruction isn’t written until it’s certain the instruction is no longer speculative.

Chap. 4 - ILP 3


Hardware support for parallelism2

Compiler Speculation

Hardware Support For Parallelism

Increasing Parallelism

Original Code:

LW R1, 0(R3) Load A

BNEZ R1, L1 Test A

LW R1, 0(R2) If Clause

J L2 Skip Else

L1:ADDI R1, R1, #4Else Clause

L2:SW 0(R3), R1Store A

Example on Page 305.

Code for

if ( A == 0 )

A = B;

else

A = A + 4;

Assume A is at 0(R3) and B is at 0(R4)

Speculated Code:

LW R1, 0(R3) Load A

LW R14, 0(R2)Spec Load B

BEQZ R1, L3 Other if Branch

ADDI R14, R1, #4Else Clause

L3:SW 0(R3), R14Non-Spec Store

Note here that only ONE side needs to take a branch!!

Chap. 4 - ILP 3


Hardware support for parallelism3

Compiler Speculation

Hardware Support For Parallelism

Poison Bits

In the example on the last page, if the LW* produces an exception, a poison bit is set on that register. The if a later instruction tries to use the register, an exception is THEN raised.

Speculated Code:

LW R1, 0(R3) Load A

LW* R14, 0(R2)Spec Load B

BEQZ R1, L3 Other if Branch

ADDI R14, R1, #4Else Clause

L3:SW 0(R3), R14Non-Spec Store

Chap. 4 - ILP 3


Hw support for more ilp

Reorder

Buffer

FP

Op

Queue

FP Regs

Res Stations

Res Stations

FP Adder

FP Adder

Figure 4.34, page 311

Hardware Support For Parallelism

Hardware Speculation

HW support for More ILP

  • Need HW buffer for results of uncommitted instructions: reorder buffer

    • Reorder buffer can be operand source

    • Once operand commits, result is found in register

    • 3 fields: instr. type, destination, value

    • Use reorder buffer number instead of reservation station

    • Discard instructions on mis-predicted branches or on exceptions

Chap. 4 - ILP 3


Hw support for more ilp1

Hardware Support For Parallelism

Hardware Speculation

HW support for More ILP

How is this used in practice?

Rather than predicting the direction of a branch, execute the instructions on both side!!

We early on know the target of a branch, long before we know it if will be taken or not.

So begin fetching/executing at that new Target PC.

But also continue fetching/executing as if the branch NOT taken.

Chap. 4 - ILP 3


Summary

Summary

4.1 Compiler Techniques for Exposing ILP

4.3 Static Multiple Issue: VLIW

4.4 Advanced Compiler Support for ILP

4.5 Hardware Support for Exposing more Parallelism

Chap. 4 - ILP 3


  • Login