cmput680 winter 2006
Download
Skip this Video
Download Presentation
CMPUT680 - Winter 2006

Loading in 2 Seconds...

play fullscreen
1 / 79

CMPUT680 - Winter 2006 - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

CMPUT680 - Winter 2006. Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680. Suggested Reading. Intel IA-64 Architecture Software Developer’s Manual, Chapters 8, 9. Instruction Group.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CMPUT680 - Winter 2006 ' - draco


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cmput680 winter 2006

CMPUT680 - Winter 2006

Topic F: IA-64 Hardware Support for Software Pipelining

José Nelson Amaral

http://www.cs.ualberta.ca/~amaral/courses/680

CMPUT 680 - Compiler Design and Optimization

suggested reading
Suggested Reading

Intel IA-64 Architecture Software

Developer’s Manual, Chapters 8, 9

CMPUT 680 - Compiler Design and Optimization

instruction group
Instruction Group

An instruction group is a set of instructions that

have no read after write (RAW) or write after write (WAW)

register dependencies.

Consecutive instruction groups are separated by stops

(represented by a double semi-column in the assembly code).

ld8 r1=[r5] // First group

sub r6=r8, r9 // First group

add r3=r1,r4 ;; // First group

st8 [r6]=r12 // Second group

CMPUT 680 - Compiler Design and Optimization

instruction bundles
Instruction Bundles

Instructions are organized in bundles of three instructions,

with the following format:

127

87

86

46

45

5

4

0

instruction slot 2

instruction slot 1

instruction slot 0

template

41

41

41

5

CMPUT 680 - Compiler Design and Optimization

bundles
Bundles

In assembly, each 128-bit bundle is enclosed in

curly braces and contains a template specification

{ .mii

ld4 r28=[r8] // Load a 4-byte value

add r9=2,r1 // 2+r1 and put in r9

add r30=1,r1 // 1+r1 and put in r30

}

An instruction group can extend over an arbitrary

number of bundles.

CMPUT 680 - Compiler Design and Optimization

templates
Templates

There are restrictions on the type of instructions that

can be bundled together. The IA-64 has five slot types

(M, I, F, B, and L), six instruction types (M, I, A, F, B, L),

and twelve basic template types (MII, MI_I, MLX, MMI,

M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB).

The underscore in the bundle accronym indicates

a stop.

Every basic bundle type has two versions: one

with a stop at the end of the bundle and one

without.

CMPUT 680 - Compiler Design and Optimization

control dependency preventing code motion

block A

br

block B

ld

Control Dependency Preventing Code Motion

In the code below, the ld4 is control dependent on the

branch, and thus cannot be safely moved up in

conventional processor architectures.

add r7=r6,1 // cycle 0

add r13=r25, r27

cmp.eq p1, p2=r12, r23

(p1) br. cond some_label ;;

ld4 r2=[r3] ;; // cycle 1

sub r4=r2, r11 // cycle 3

CMPUT 680 - Compiler Design and Optimization

control speculation
Control Speculation

In the following code, suppose a load latency of two cycles

(p1) br.cond.dptk L1 // cycle 0

ld8 r3=[r5] ;; // cycle 1

shr r7=r3,r87 // cycle 3

However, if we execute the load before we know that

we actually have to do it (control speculation), we get:

ld8.s r3=[r5] // earlier cycle

// other, unrelated instructions

(p1) br.cond.dptk L1 ;; // cycle 0

chk.s r3, recovery // cycle 1

shr r7=r3,r87 // cycle 1

CMPUT 680 - Compiler Design and Optimization

control speculation1
Control Speculation

The ld8.s instruction is a speculative load, and the

chk.s instruction is a check instruction that verifies

if the value loaded is still good.

ld8.s r3=[r5] // earlier cycle

// other, unrelated instructions

(p1) br.cond.dptk L1 ;; // cycle 0

chk.s r3, recovery // cycle 1

shr r7=r3,r87 // cycle 1

CMPUT 680 - Compiler Design and Optimization

ambiguous memory dependencies
Ambiguous Memory Dependencies

An ambiguous memory dependency is a dependence

between a load and a store, or between two stores,

where it cannot be determined if the instructions

involved access overlapping memory locations.

Two or more memory references are independent

if it is known that they access non-overlapping

memory locations.

CMPUT 680 - Compiler Design and Optimization

data speculation
Data Speculation

An advanced load allows a load to be moved

above a store even if it is not known wether

the load and the store may reference overlapping

memory locations.

st8 [r55]=r45 // cycle 0

ld8 r3=[r5] ;; // cycle 0

shr r7=r3,r87 // cycle 2

ld8.a r3=[r5] ;; // Advanced Load

// other, unrelated instructions

st8 [r55]=r45 // cycle 0

ld8.c r3=[r5] ;; // cycle 0 - check

shr r7=r3,r87 // cycle 0

CMPUT 680 - Compiler Design and Optimization

moving up loads uses recovery code
Moving Up Loads + Uses: Recovery Code

st8 [r4] = r12 // cycle 0: ambiguous store

ld8 r6 = [r8] ;; // cycle 0: load to advance

add r5 = r6,r7 // cycle 2

st8 [r18] = r5 // cycle 3

Original Code

ld8.a r6 = [r8] ;; // cycle -3

// other, unrelated instructions

add r5 = r6,r7 // cycle -1; add that uses r6

// other, unrelated instructions

st8 [r4]=r12 // cycle 0

chk.a r6, recover // cycle 0: check

back: // Return point from jump to recover

st8 [r18] = r5 // cycle 0

recover:

ld8 r6 = [r8] ;; // Reload r6 from [r8]

add r5 = r6,r7 // Re-execute the add

br back // Jump back to main code

Speculative

Code

CMPUT 680 - Compiler Design and Optimization

ld c chk a and the alat
ld.c, chk.a and the ALAT

The execution of an advanced load, ld.a, creates an

entry in a hardware structure, the Advanced Load

Address Table (ALAT). This table is indexed by the

register number. Each entry records the load

address, the load type, and the size of the load.

When a check is executed, the entry for the register

is checked to verify that a valid enter with the type

specified is there.

CMPUT 680 - Compiler Design and Optimization

ld c chk a and the alat1
ld.c, chk.a and the ALAT

Entries are removed from the ALAT when:

(1) A store overlaps with the memory locations

specified in the ALAT entry;

(2) Another advanced load to the same register

is executed;

(3) There is a context switch caused by the

operating system (or hardware);

(4) Capacity limitation of the ALAT implementation

requires reuse of the entry.

CMPUT 680 - Compiler Design and Optimization

not a thing nat
Not a Thing (NaT)

The IA-64 has 128 general purpose registers, each

with 64+1 bits, and 128 floating point registers, each

with 82 bits.

The extra bit in the GPRs is the NaT bit that is used to

indicate that the content of the register is not valid.

NaT=1 indicates that an instruction that generated an

exception wrote to the register. It is a way to defer

exceptions caused by speculative loads.

Any operation that uses NaT as an operand

results in NaT.

CMPUT 680 - Compiler Design and Optimization

if conversion
If-conversion

If-conversion uses predicates to transform a

conditional code into a single control stream code.

if(r4) {

add r1= r2, r3

ld8 r6=[r5]

}

cmp.ne p1, p0=r4, 0 ;; Set predicate reg

(p1) add r1=r2, r3

(p1) ld8 r6=[r5]

if(r1)

r2 = r3 + r3

else

r7 = r6 - r5

cmp.ne p1, p2 = r1, 0 ;; Set predicate reg

(p1) add r2 = r3, r4

(p2) sub r7 = r6,r5

CMPUT 680 - Compiler Design and Optimization

code generation for software pipelining
Code Generation for Software Pipelining

z0  &Z(1)

x0  &X(1)

q0 0.0

DO k=1,N Lat.

(a) uk load zk-1 (6)

(b) vk  load xk-1 (6)

(c) wk uk * vk (2)

(d) qk  qk-1 + wk (2)

(e) zk  zk-1 + 4 (1)

(f) xk  xk-1 + 4 (1)

END DO

code generation for software pipelining1
Code Generation for Software Pipelining

z0  &Z(1)

x0  &X(1)

q0 0.0

(a1) u1 load z0

(b1) v1  load x0

(e1) z1  z0 + 4

(f1) x1  x0 + 4

(a2) u2 load z1

(b2) v2  load x1

(e2) z2  z1 + 4

(f2) x2  x1 + 4

(a3) u3 load z2

(b3) v3  load x2

(e3) z3  z2 + 4

(f3) x3  x2 + 4

(a4) u4 load z3

(b4) v4  load x3

(c1) w1 u0 * v0

(e4) z4  z3 + 4

(f4) x4  x3 + 4

code generation for software pipelining2
Code Generation for Software Pipelining

DO k=1,N-4

(ak+4) uk+4 load zk+3

(bk+4) vk+4  load xk+3

(ck+1) wk+1 uk * vk

(d) qk  qk-1 + wk

(ek+4) zk+4  zk+3 + 4

(fk+4) xk+4  xk+3 + 4

END DO

(c98) w98 u97 * v97

(d97) q97  q96 + w97

(c99) w99 u98 * v98

(d98) q98  q97 + w98

(c100) w100 u99 * v99

(d99) q99  q98 + w99

(d100) q100  q99 + w100

code generation for software pipelining3

loop

counter

Code Generation for Software Pipelining

z0  &Z(1)

x0  &X(1)

q0 0.0

DO k=1,4

(a) uk load zk-1

(b) vk  load xk-1

(e) zk  zk-1 + 4

(f) xk  xk-1 + 4

END DO

(c) w1 u1 * v1

DO k=5,N-4

(a) uk+4 load zk+3

(b) vk+4  load xk+3

(c) wk+1 uk+1 * vk+1

(d) qk qk-1 + wk

(e) zk+4  zk+3 + 4

(f) xk+4  xk+3 + 4

END DO

prolog

counter

code generation for software pipelining4
Code Generation for Software Pipelining

DO k=N-3,N

(c) wk+1 uk+1 * vk+1

(d) qk  qk-1 + wk

END DO

(d) q100  q99 + w100

epilog

counter

code generation for software pipelining try 3
Code Generation for Software Pipelining(try 3)

But, we still have not solved

the register allocation problem.

The code on the right needs

a large number of registers.

What can we do about it?

R0  &Z(1)

R1  &X(1)

F0 0.0

R2 1

loop: F1 load [R0]

F2 load [R1]

F3 mult F1, F2

F0 add F0, F3

R0 add R0, 4

R1 add R1, 4

R2 add R2, 1

brne R2, N loop

Without software pipelining

the following code could

be generated.

optimization of loops
Optimization of Loops

L1: ld4 r4 = [r5], 4 ;; // Cycle 0 load postinc 4

add r7 = r4, r9 ;; // Cycle 2

st4 [r6] = r7, 4 // Cycle 3 store postinc 4

br.cloop L1 ;; // Cycle 3

Instructions Description:

ld4 r4 = [r5], 4 ;; r4  MEM[r5]

r5  r5 + 4

st4 [r6] = r7, 4 MEM[r6]  r7

r6  r6 + 4

br.cloop L1 if LC  0

then LC  LC -1

goto L1

CMPUT 680 - Compiler Design and Optimization

optimization of loops1

10

11

b

c/d

Optimization of Loops

Iterations

1

2

3

4

0

a

1

2

b

3

c/d

(a) L1: ld4 r4 = [r5], 4 ;;

(b) add r7 = r4, r9 ;;

(c) st4 [r6] = r7, 4

(d) br.cloop L1 ;;

4

a

Cycles

5

6

b

7

c/d

8

a

If LC=1000, how long does

it take for this loop to execute?

9

12

a

It takes 4000 cycles.

13

CMPUT 680 - Compiler Design and Optimization

14

b

optimization of loops loop unrolling

1

2

3

4

0

a

1

b

2

c

3

d/e

4

f/g

5

a

6

b

7

c

8

d/e

9

f/g

11

10

a

b

12

c

13

d/e

14

f/g

Optimization of Loops:Loop Unrolling

Iterations

(a) L1: ld4 r4 = [r5], 4 ;;

(b) ld4 r14 = [r5], 4 ;;

(c) add r7 = r4, r9 ;;

(d) add r17 = r14, r9

(e) st4 [r6] = r7,4 ;;

(f) st4 [r6] = r17,4

(g) br.cloop L1 ;;

Cycles

For simplicity we assumed that

N is a multiple of 2.

Because the loads (a) and (b)

both update r5 they have to be

serialized

CMPUT 680 - Compiler Design and Optimization

optimization of loops loop unrolling1

1

2

3

4

0

a

1

b

2

c

3

d/e

4

f/g

5

a

6

b

7

c

8

d/e

9

f/g

11

10

a

b

12

c

13

d/e

14

f/g

Optimization of Loops:Loop Unrolling

Iterations

(a) L1: ld4 r4 = [r5], 4 ;;

(b) ld4 r14 = [r5], 4 ;;

(c) add r7 = r4, r9 ;;

(d) add r17 = r14, r9

(e) st4 [r6] = r7,4 ;;

(f) st4 [r6] = r17,4

(g) br.cloop L1 ;;

Cycles

If LC=1000 for the original

loop, how long does

it take for this loop to execute?

It takes 2500 cycles.

Thus the loop is

4000/2500 = 1.6 times faster

CMPUT 680 - Compiler Design and Optimization

optimization of loops expanding the induction variable
Optimization of Loops:Expanding the Induction Variable

Iterations

add r15 = 4, r5

add r16 = 4, r6 ;;

(a) L1: ld4 r4 = [r5], 8

(b) ld4 r14 = [r15], 8 ;;

(c) add r7 = r4, r9

(d) add r17 = r14, r9

(e) st4 [r6] = r7,8 ;;

(f) st4 [r16] = r17,8

(g) br.cloop L1 ;;

1

2

3

4

0

a/b

1

2

c/d

3

e/f/g

4

a/b

Cycles

5

6

c/d

7

e/f/g

8

a/b

We use twice as many functional

units as the original code.

But no instruction is issued in

cycle 1, and functional units

are still under-utilized.

9

10

c/d

11

e/f/g

12

a/b

13

CMPUT 680 - Compiler Design and Optimization

14

c/d

optimization of loops expanding the induction variable1
Optimization of Loops:Expanding the Induction Variable

Iterations

add r15 = 4, r5

add r16 = 4, r6 ;;

(a) L1: ld4 r4 = [r5], 8

(b) ld4 r14 = [r15], 8 ;;

(c) add r7 = r4, r9

(d) add r17 = r14, r9

(e) st4 [r6] = r7,8

(f) st4 [r6] = r17,8

(g) br.cloop L1 ;;

1

2

3

4

0

a/b

1

2

c/d

3

e/f/g

4

a/b

Cycles

5

6

c/d

7

e/f/g

If LC=1000 for the original

loop, how long does

it take for this loop to execute?

8

a/b

9

10

c/d

11

e/f/g

It takes 2000 cycles.

Thus the loop is

4000/2000 = 2.0 times faster

12

a/b

13

CMPUT 680 - Compiler Design and Optimization

14

c/d

optimization of loops further loop unrolling

Iterations

1

2

3

4

0

a/b

1

c/d

2

e/f

3

g/h/i/j

4

k/l/m

Cycles

5

a/b

6

c/d

7

e/f

8

g/h/i/j

9

k/l/m

10

a/b

11

c/d

12

e/f

13

g/h/i/j

14

k/l/m

Optimization of Loops:Further Loop Unrolling

add r15 = 4, r5

add r25 = 8, r5

add r35 = 12, r5

add r16 = 4, r6

add r26 = 8, r6

add r36 = 12, r6 ;;

add r16 = 4, r6 ;;

(a) L1: ld4 r4 = [r5], 16

(b) ld4 r14 = [r15], 16 ;;

(c) ld4 r24 = [r25], 16

(d) ld4 r34 = [r35], 16 ;;

(e) add r7 = r4, r9

(f) add r17 = r14, r9;;

(g) st4 [r6] = r7,16

(h) st4 [r16] = r17,16

(i) add r27 = r24, r9

(j) add r37 = r34, r9 ;;

(k) st4 [r26] = r27, 16

(l) st4 [r36] = r37, 16

(m) br.cloop L1 ;;

CMPUT 680 - Compiler Design and Optimization

optimization of loops further loop unrolling1

Iterations

1

2

3

4

0

a/b

1

c/d

2

e/f

3

g/h/i/j

4

k/l/m

Cycles

5

a/b

6

c/d

7

e/f

8

g/h/i/j

9

k/l/m

10

a/b

11

c/d

12

e/f

13

g/h/i/j

14

k/l/m

Optimization of Loops:Further Loop Unrolling

If LC=1000 for the original

loop, how long does

it take for this loop

(unrolled 4 times) to execute?

It takes 250*5=1250 cycles.

Thus the loop is

4000/1250 = 3.2 times faster

CMPUT 680 - Compiler Design and Optimization

loop optimization loop unrolling
Loop Optimization:Loop Unrolling

In the previous example we obtained

a good utilization of the functional units

through loop unrolling.

But at the cost of code expansion

and higher register pressure.

Software Pipelining offers an alternative

by overlapping the execution of operations

from multiple iterations of the loop.

CMPUT 680 - Compiler Design and Optimization

loop optimization software pipelining
Loop Optimization:Software Pipelining

(S1) ld4 r4 = [r5], 4

(S2) - - -

(S3) add r7 = r4, r9

(S4) st4 [r6] = r7, 4

* This is not real code

Iterations

1

2

3

4

5

6

7

0

S1

prologue

1

S1

2

S3

S1

3

S4

S3

S1

4

S4

S3

S1

kernel

Cycles

5

S4

S3

S1

6

S4

S3

S1

7

S4

S3

8

S4

S3

epilogue

CMPUT 680 - Compiler Design and Optimization

9

S4

loop optimization software pipelining code
Loop Optimization:Software Pipelining Code

ld4 r4 = [r5], 4 ;; // load x[1]

ld4 r4 = [r5], 4 ;; // load x[2]

add r7 = r4, r9 // y[1] = x[1]+ k

ld4 r4 = [r5], 4 ;; // load x[3]

L1: ld4 r4 = [r5], 4 // load x[i+3]

add r7 = r4, r9 // y[i+1] = x[i+1] + k

st4 [r6] = r7, 4 // store y[i]

br.cloop L1 ;;

st4 [r6] = r7, 4 // store y[n-2]

add r7 = r4, r9 ;; // y[n-1] = x[n-1] + k

st4 [r6] = r7, 4 // store y[n-1]

add r7 = r4,r9 ;; // y[n] = x[n] + k

st4 [r6] = r7, 4 // store y[n]

prologue

kernel

epilogue

CMPUT 680 - Compiler Design and Optimization

support for software pipelining in the ia 64
Support for Software Pipelining in the IA-64

After a loop is converted into a software pipeline,

it looks quite different from the original loop,

Intel adopts the following terminology:

source loop and source iteration: refer to the

original source code

kernel loop and kernel iteration: refer to the

code that implements the software pipeline.

CMPUT 680 - Compiler Design and Optimization

loop support in the ia 64 register rotation
Loop Support in the IA-64:Register Rotation

The IA-64 has a rotating register base (rrb)

register that is decremented by special

software pipelined loop branches.

When the rrb is decremented the valued stored

in register X appear to move to register X+1,

and the value of the highest numbered rotating

register appears to move to the lowest numbered

rotating register.

CMPUT 680 - Compiler Design and Optimization

loop support in the ia 64 register rotation1
Loop Support in the IA-64:Register Rotation
  • What registers can rotate?
    • The predicate registers p16-p63;
    • The floating-point registers f32-f127;
    • A programable portion of the general registers:
      • The function alloc can allocate 0, 8, 16, 24, …, 96 general registers as rotating registers
      • The lowest numbered rotating register is r32.
    • There are three rrb: rrb.gr, rrb.fr rrb.pr

CMPUT 680 - Compiler Design and Optimization

how register rotation helps software pipeline
How Register Rotation Helps Software Pipeline

The concept of a software pipelining branch:

L1: ld4 r35 = [r4], 4 // post-increment by 4

st4 [r5] = r37, 4 // post-increment by 4

swp_branch L1 ;;

The pseudo-instruction swp_branch in the example rotates

the general registers.

Therefore the value stored into r35 is read in r37 two kernel

iterations (and two rotations) later.

The register rotation eliminated a dependence between

the load and the store instructions, and allowed the loop to

execute in one cycle.

CMPUT 680 - Compiler Design and Optimization

how register rotation helps software pipeline1

Logical

Logical

Logical

Physical

Physical

Physical

R32

R32

R32

R33

R33

R33

R35

9

RRB

RRB

RRB

R34

R34

R34

R35

8

8

0

-1

-2

R35

R35

R35

R35

R37

7

7

7

R36

R36

R36

R37

R37

R37

R37

R37

R38

R38

R38

R39

R39

R39

How Register Rotation Helps Software Pipeline

The concept of a software pipelining branch:

L1: ld4 r35 = [r4], 4 // post-increment by 4

st4 [r5] = r37, 4 // post-increment by 4

swp_branch L1 ;;

CMPUT 680 - Compiler Design and Optimization

the stage predicate
The stage predicate

When assembling a software pipeline the programmer can

assign a stage predicate to each stage of the pipeline to

control the execution of the instructions in that stage.

p16 is architecturally defined as the predicate for the first stage,

p17 for the second, and so on.

The software pipeline branchrotates the predicate registers and

injects a 1 in p16. Thus enabling one stage of the pipeline

at a time for the execution of the prolog.

(S1): (p16) ld4 r4 = [r5], 4

(S2): (p17) - - -

(S3): (p18) add r7 = r4, r9

(S4): (p19) st4 [r6] = r7, 4

CMPUT 680 - Compiler Design and Optimization

the stage predicate1
The stage predicate

(S1): (p16) ld4 r4 = [r5], 4

(S2): (p17) - - -

(S3): (p18) add r7 = r4, r9

(S4): (p19) st4 [r6] = r7, 4

When the kernel counter reaches zero, the software

pipeline branchstarts to decrement the epilog counter

and injects 0 in p16 at every rotation to execute the

epilogue of the software pipelined loop.

CMPUT 680 - Compiler Design and Optimization

anatomy of a software pipelining branch

== 0 (epilog)

EC?

=0

>1

(prolog/kernel)

 0

=1

LC--

EC

EC--

EC--

PR[16]=0

PR[16]=1

PR[16]=0

PR[16]=0

RRB--

RRB--

RRB--

branch

fall-thru

Anatomy of a Software Pipelining Branch

LC?

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 64
Software Pipelining Example in the IA-64

mov pr.rot = 0 // Clear all rotating predicate registers

cmp.eq p16,p0 = r0,r0 // Set p16=1

mov ar.lc = 4 // Set loop counter to n-1

mov ar.ec = 3 // Set epilog counter to 3

loop:

(p16) ldl r32 = [r12], 1 // Stage 1: load x

(p17) add r34 = 1, r33 // Stage 2: y=x+1

(p18) stl [r13] = r35,1 // Stage 3: store y

br.ctop loop // Branch back

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 641

34

36

32

33

35

37

38

39

EC

LC

4

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16)ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

34

36

32

33

35

37

38

39

General Registers (Logical)

Predicate Registers

Memory

1

0

0

18

16

17

x1

x2

x3

x4

x5

RRB

0

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 642

34

36

32

33

35

37

38

39

EC

LC

4

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

34

36

32

33

35

37

38

39

General Registers (Logical)

Predicate Registers

Memory

1

0

0

18

16

17

x1

x2

x3

x4

x5

RRB

0

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 643

34

36

32

33

35

37

38

39

EC

LC

4

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

34

36

32

33

35

37

38

39

General Registers (Logical)

Predicate Registers

Memory

1

0

0

18

16

17

x1

x2

x3

x4

x5

RRB

0

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 644

34

36

32

33

35

37

38

39

EC

LC

4

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

1

0

0

1

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 645

34

36

32

33

35

37

38

39

EC

LC

3

3

1

1

0

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 646

34

36

32

33

35

37

38

39

EC

LC

3

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16)ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

x2

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

1

1

0

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 647

34

36

32

33

35

37

38

39

EC

LC

3

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17)add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y1

x1

x2

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

1

1

0

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 648

34

36

32

33

35

37

38

39

EC

LC

3

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y1

x1

x2

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

1

1

0

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 649

34

36

32

33

35

37

38

39

EC

LC

3

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y1

x1

x2

35

37

33

34

36

38

39

32

General Registers (Logical)

Predicate Registers

Memory

1

1

0

18

16

17

x1

x2

x3

x4

x5

RRB

-1

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6410

34

36

32

33

35

37

38

39

EC

LC

2

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y1

x1

x2

36

38

34

35

37

39

32

33

General Registers (Logical)

Predicate Registers

Memory

1

1

1

1

18

16

17

x1

x2

x3

x4

x5

RRB

-2

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6411

34

36

32

33

35

37

38

39

EC

LC

2

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16)ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

x1

y1

x3

x2

36

38

34

35

37

39

32

33

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

x4

x5

RRB

-2

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6412

34

36

32

33

35

37

38

39

EC

LC

2

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17)add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x3

x2

36

38

34

35

37

39

32

33

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

x4

x5

RRB

-2

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6413

34

36

32

33

35

37

38

39

EC

LC

2

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18)stl [r13] = r35,1

br.ctop loop

y2

y1

x3

x2

36

38

34

35

37

39

32

33

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

x5

RRB

-2

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6414

34

36

32

33

35

37

38

39

EC

LC

2

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x3

x2

36

38

34

35

37

39

32

33

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

x5

RRB

-2

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6415

34

36

32

33

35

37

38

39

EC

LC

1

3

1

1

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x3

x2

37

39

35

36

38

32

33

34

General Registers (Logical)

Predicate Registers

Memory

1

18

16

17

x1

x2

x3

y1

x4

x5

RRB

-3

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6416

34

36

32

33

35

37

38

39

EC

LC

1

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16)ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x4

x3

x2

37

39

35

36

38

32

33

34

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

x5

RRB

-3

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6417

34

36

32

33

35

37

38

39

EC

LC

1

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17)add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x4

x3

y3

37

39

35

36

38

32

33

34

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

x5

RRB

-3

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6418

34

36

32

33

35

37

38

39

EC

LC

1

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18)stl [r13] = r35,1

br.ctop loop

y2

y1

x4

x3

y3

37

39

35

36

38

32

33

34

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

RRB

-3

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6419

34

36

32

33

35

37

38

39

EC

LC

1

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x4

x3

y3

37

39

35

36

38

32

33

34

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

RRB

-3

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6420

34

36

32

33

35

37

38

39

EC

LC

0

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x4

x3

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

1

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

RRB

-4

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6421

34

36

32

33

35

37

38

39

EC

LC

0

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16)ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

x3

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

RRB

-4

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6422

34

36

32

33

35

37

38

39

EC

LC

0

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17)add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

RRB

-4

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6423

34

36

32

33

35

37

38

39

EC

LC

0

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18)stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-4

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6424

34

36

32

33

35

37

38

39

EC

LC

0

3

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

1

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-4

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6425

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

0

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6426

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

0

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6427

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

x4

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6428

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17)add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6429

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18)stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6430

34

36

32

33

35

37

38

39

EC

LC

0

2

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

1

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6431

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

0

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6432

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6433

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6434

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18)stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

y5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6435

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

y5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6436

34

36

32

33

35

37

38

39

EC

LC

0

1

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

38

32

36

37

39

33

34

35

General Registers (Logical)

Predicate Registers

Memory

0

0

1

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-6

y5

CMPUT 680 - Compiler Design and Optimization

software pipelining example in the ia 6437

34

36

32

33

35

37

38

39

EC

LC

0

0

Software Pipelining Example in the IA-64

General Registers (Physical)

loop:

(p16) ldl r32 = [r12], 1

(p17) add r34 = 1, r33

(p18) stl [r13] = r35,1

br.ctop loop

y2

y1

x5

y5

y4

y3

39

33

37

38

32

34

35

36

General Registers (Logical)

Predicate Registers

Memory

0

0

0

0

18

16

17

x1

x2

x3

y1

x4

y2

x5

y3

RRB

y4

-7

y5

CMPUT 680 - Compiler Design and Optimization

ad