Ia 64 advanced loads speculative loads software pipelining
Download
1 / 62

IA-64: Advanced Loads Speculative Loads Software Pipelining - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and Thomas Richards (470 alum) and Our textbook’s authors. IA-64: Advanced Loads Speculative Loads Software Pipelining. IA-64. 128 64-bit registers Use a register window similarish to SPARC

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' IA-64: Advanced Loads Speculative Loads Software Pipelining' - britain


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ia 64 advanced loads speculative loads software pipelining

Presentation stolen from the web(with changes)from the Univ of Aberta andEspen SkoglundandThomas Richards (470 alum)andOur textbook’s authors

IA-64:

Advanced Loads

Speculative Loads

Software Pipelining


Ia 64
IA-64

  • 128 64-bit registers

    • Use a register window similarish to SPARC

  • 128 82 bit fp registers

  • 64 1 bit predicate registers

  • 8 64-bit branch target registers


Explicit parallelism
Explicit Parallelism

  • Groups

    • Instructions which could be executed in parallel if hardware resources available.

  • Bundle

    • Code format. 3 instructions fit into a 128-bit bundle.

    • 5 bits of template, 41*3 bits of instruction.

      • Template specifies what execution units each instruction requires.


Instruction groups
Instruction groups

  • IA-64 instructions are bound in instruction groups

    • No read-after-write dependencies

    • No write-after-write dependencies

    • Any instruction in the group may be executed in parallel

    • New processors can easily take advantage of the existing ILP in the instruction group

  • Instruction groups indicated by stop bits in template

  • Instruction groups may end dynamically on branches


Instruction bundles

Template

Slot 3

Slot 2

Slot 1

127

86

45

4

0

Instruction bundle

Instruction bundles

  • Instruction bundles contain

    • 3 instructions

    • A template field which maps instructions to execution units

  • Processor dispatches all three instruction in parallel

  • Instruction group may end in middle of bundle

  • Bundles are aligned on 16 byte boundaries


Predication

Use predicates to eliminate branches

Predicates are one bit registers (total of 64)

Most instructions can be predicated

(qp) mnemonic dest = source

Predicates are set by compare instructions

(qp) cmp.crel px,py = source

x86 assembly:

cmp a, bbeq .eq add $4, yjmp .done.eq: add $3, y.done:

Predication

  • C code:

  • if (a == b) y += 3;else y += 4;

  • IA-64 assembly:

  • cmp.eq p1,p2 = a,b(p1) add y = y, 3(p2) add y = y, 4


Advanced loads and speculative loads

Advanced loads

Used to address data dependencies

Speculative loads

Used to address control dependencies

st

ld

advanced load

st

check load

(p) br

ld

speculative load

(p) br

check speculation

Advanced loads and speculative loads


Advanced loads

Addr1 and addr2 in example might point to same address

If different:

Datum in addr2 can be prefetched

If same:

Datum in addr2 can not be prefetched

C code example:

int foo (int *addr1, int *addr2){ int h; *addr1 = 4; h = *addr2; return h+1;}

Advanced loads


Advanced loads1

Insert advanced loads (ld.a) to prefetch data (store in ALAT)

Use check data instruction (ld.c) in place of original load

If memory contents has changed, perform real load

Advanced loads do not defer exceptions

(e.g., page-faults)

Regular load:

add r3 = 4,r0 ;;st4 [r32] = r3ld4 r2 = [r33] regular loadadd r5 = r2,r3 use data

Advanced Load:

ld4.a r2 = [r33]advanced loadadd r3 = 4,r0 ;;st4 [r32] = r3ld4.c r2 = [r33] ;; verify dataadd r5 = r2,r3 use data

Advanced loads


Speculative loads

If addr in example is legal, we can prefetch its value

If addr is illegal, prefetching the value would cause exception

Any exception should be delayed until code path has been resolved

C code example:

int add5 (int *addr){ if (addr == NULL) return (-1); else return (*addr+5);}

Speculative loads


Speculative loads1

Insert speculative loads (ld.s) to prefetch data

Verify load using check instruction (chk.s)

NaT-bit/NaTVal is used track success of load

Might also be combined with advanced loads

(ld.sa and chk.a)

Assembly code:

add5:ld8.s r1 = [r32] cmp.eq p6,p5 = r32,r0 ;; (p6) add r8 = -1,r0 (p6) br.ret (p5) chk.s r1,return_error add r8 = 5,r1 br.ret ;;

return_error:recovery code

Speculative loads


A:1

11

10

B:4

E:5

C:4

F:4

20

D:1

G:1

Code example“Why hoist loads?”

  • add r15 = r2,r3 //A

  • mult r4 = r15,r2 //B

  • mult r4 = r4,r4 //C

  • st8 [r12] = r4 //D

  • ld8 r5 = [r15] //E

  • div r6 = r5,r7 //F

  • add r5 = r6,r2 //G

  • Assume latencies are:

    • add, store: +0

    • mult, div: +3

    • ld: +4


Advanced loads recovery

  • // Case B: Advanced Load

  • // With Speculative Add

  • ld.a r2 = [r10]

  • add r5 = r2, r3

  • st8 [r1] = r9

  • ld.c r2 = [r10] // Wrong

  • st8 [r18] = r19

  • Case B – Hoist the load and dependent instructions.

    • In this case, we need to re-execute all of the dependent instructions.

Advanced Loads Recovery

  • Case A – Hoist just the load.

    • In this case, if there is a memory dependency we just re-execute the load.

// Case A: Advanced Load

ld.a r2 = [r10]

st8 [r1] = r9

ld.c r2 = [r10]

add r15 = r2, r3

st8 [r18] = r19

A ld.c will only

re-execute the load,

r5 is still wrong after the

ld.c!


Advanced load use recovery compiler generated recovery code
Advanced Load-Use Recovery: Compiler Generated Recovery Code

// Solution: Using the chk.a instruction

ld8.a r2 = [r10]

add r5 = r2, r3

st8 [r1] = r9

chk.a r6, fixup

return: // Return Point

st8 [r18] = r19

......

......

fixup: // Re-execute load and all speculative uses

ld8 r2 = [r10]

add r5 = r2, r3

br return

  • Use ld.c if JUST a load is speculative. Use chk.a if a load and an instruction that is dependant on the load are both speculative.


The advanced load address table alat
The Advanced Load Address Table (ALAT)

  • The ALAT tells us if we need to recover from an Advanced Load.

  • When an advanced load is executed – Save the type of load, size of load, and load address to the ALAT (indexed by PR).

  • When we execute a ld.c or chk.a look for the entry in the ALAT. If it is missing, run the recovery code.

  • Remove an entry from the ALAT if

    • A store address overlaps an ALAT entry.

    • Capacity/Associatively evictions.

    • Other advanced load indexes the same PR.


Control speculation and recovery
Control Speculation and Recovery

  • What if we want to move a load above a branch?

    • Problem is that the load maybe shouldn’t have executed and might have thrown a spurious exception.

  • Similar to Advanced Load, but no ALAT.

    • Instead, check NaT bit for deferred exceptions.

      • See next slide.

    • Use chk.s for recovery (instead of chk.a or ld.a).

  • // Control Speculation and Recovery

  • ld8.s r1 = [r10] //load moved outside of branch

  • st8 [r11] = r9

  • (p1)br.cond branch_label // (p1) is a predication bit

  • chk.s r1,recovery

  • return:

  • add r2 = r1, r2

  • chk.s checks r1 to see if the NaT bit is set. If so, branch to recovery code (re-execute instructions if necessary).


  • Not a thing bit nat
    Not a Thing Bit (NaT)

    64bits + 1NaT

    IA64 register

    • If a control speculative load causes an exception, the processor can set this bit, which defers the exception.

    • NaT bits propagate.

      • Propagation allows a single check for multiple ld.s.

    • ld8.s r1 = [r10]

    • ld8.s r2 = [r11]

    • add r3 = r1, r2

    • ld8.s r4 = [r3]

    • st8[r11] = r9

    • (p1)br.cond branch_label

    • chk.s r4, recovery


    Software pipelining on ia 64
    Software pipelining on IA-64

    • Lots of tricks

      • Rotating registers

      • Special counters

    • Often don’t need Prologue and Epilog.

      • Special counters and prediction lets us only execute those instructions we need to.


    Prolog and epilog from before
    Prolog and epilogFrom before!!!!!

    r3=r3-8 // Needed to check legal! r4=MEM[r2+0] //A(1)

    r1=r4*2 //B(1)

    r4=MEM[r2+4] //A(2)

    Loop: MEM[r2+0]=r1 //C(n)

    r1=r4*2 //B(n+1)

    r4=MEM[r2+8] //A(n+2)

    r2=r2+4 //D(n)

    bne r2 r3 Loop //E(n)

    MEM[r2+0]=r1 // C(x-1)

    r1=r4*2 // B(x)

    MEM[r2+0]=r1 // C(x)

    r3=r3+8 // Could have used tmp var.


    There are three special purpose registers used in ia 64 for software pipelining
    There are three special purpose registers used in IA-64 for software pipelining

    • There are three special purpose registers used in IA-64 for software pipelining

    • Loop counter (LC) indicates how many times to run through loop (prolog/kernel)

      • Initialized to N-1 before starting loop code

      • Decremented until LC == 0

    • Epilog counter (EC) indicates how many times to run loop after loop counter exhausted (epilog)

      • Needed to flush the software pipeline

      • Initialized to num-stages before entering loop code

      • Decremented if LC == 0, and EC > 1


    And rrb register rename base
    And RRB (Register Rename Base) software pipelining

    • Add internal counter RRB to register number to get actual used register

      • Counter decreased by special loop branch instructions

      • May be reset by clrrrb instruction

      • Use modular lookup (so we wrap around!)

    • Rotated predicate registers

      • Initially reset using: mov pr.rot = value

      • pr63 is reset before every rotation


    How does register rotation work basics
    How does register rotation work? software pipelining(Basics)

    • Rotated registers:

      • General: gr32 - grN(as specified by alloc instruction)

      • Predicate: pr16 - pr63

      • Floating point: fr16 - fr127

    • Registers are rotated to higher numbers

      • Register rn is renamed to rn+1, rmax is renamed to rmin

    • Registers are rotated by specific loop branch instructions

      • br.ctop, br.cexit (for counted loops)

      • br.wtop, br.exit (for while loops)


    How they relate

    ctop, cexit software pipelining

    == 0 (epilog)

    LC?

    (special unrolled loops)

    != 0

    EC?

    > 1

    == 0

    (prolog/kernel)

    == 1

    LC--

    LC=LC

    LC=LC

    LC=LC

    EC=EC

    EC--

    EC--

    EC=EC

    PR[63]=1

    PR[63]=0

    PR[63]=0

    PR[63]=0

    RRB--

    RRB--

    RRB--

    RRB=RRB

    ctop: branch

    cexit: fall-thru

    ctop: fall-thru

    cexit: branch

    How they relate


    Software pipelining example in the ia 64

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    4

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    34

    36

    32

    33

    35

    37

    38

    39

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    0

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    0


    Software pipelining example in the ia 641

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    4

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16)ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    34

    36

    32

    33

    35

    37

    38

    39

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    0

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    0


    Software pipelining example in the ia 642

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    4

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    34

    36

    32

    33

    35

    37

    38

    39

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    0

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    0


    Software pipelining example in the ia 643

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    4

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    34

    36

    32

    33

    35

    37

    38

    39

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    0

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    0


    Software pipelining example in the ia 644

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    4

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 645

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    3

    3

    1

    1

    0

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 646

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    3

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16)ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    x2

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 647

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    3

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17)add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y1

    x1

    x2

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 648

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    3

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y1

    x1

    x2

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 649

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    3

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y1

    x1

    x2

    35

    37

    33

    34

    36

    38

    39

    32

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    0

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -1


    Software pipelining example in the ia 6410

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    2

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y1

    x1

    x2

    36

    38

    34

    35

    37

    39

    32

    33

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -2


    Software pipelining example in the ia 6411

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    2

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16)ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    x1

    y1

    x3

    x2

    36

    38

    34

    35

    37

    39

    32

    33

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -2


    Software pipelining example in the ia 6412

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    2

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17)add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x3

    x2

    36

    38

    34

    35

    37

    39

    32

    33

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    x4

    x5

    RRB

    -2


    Software pipelining example in the ia 6413

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    2

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18)stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x3

    x2

    36

    38

    34

    35

    37

    39

    32

    33

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    x5

    RRB

    -2


    Software pipelining example in the ia 6414

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    2

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x3

    x2

    36

    38

    34

    35

    37

    39

    32

    33

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    x5

    RRB

    -2


    Software pipelining example in the ia 6415

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    1

    3

    1

    1

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x3

    x2

    37

    39

    35

    36

    38

    32

    33

    34

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    x5

    RRB

    -3


    Software pipelining example in the ia 6416

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    1

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16)ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x4

    x3

    x2

    37

    39

    35

    36

    38

    32

    33

    34

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    x5

    RRB

    -3


    Software pipelining example in the ia 6417

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    1

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17)add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x4

    x3

    y3

    37

    39

    35

    36

    38

    32

    33

    34

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    x5

    RRB

    -3


    Software pipelining example in the ia 6418

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    1

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18)stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x4

    x3

    y3

    37

    39

    35

    36

    38

    32

    33

    34

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    RRB

    -3


    Software pipelining example in the ia 6419

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    1

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x4

    x3

    y3

    37

    39

    35

    36

    38

    32

    33

    34

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    RRB

    -3


    Software pipelining example in the ia 6420

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x4

    x3

    y3

    38

    32

    36

    37

    39

    33

    34

    35

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    RRB

    -4


    Software pipelining example in the ia 6421

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16)ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    x3

    y3

    38

    32

    36

    37

    39

    33

    34

    35

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    RRB

    -4


    Software pipelining example in the ia 6422

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17)add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    y4

    y3

    38

    32

    36

    37

    39

    33

    34

    35

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    RRB

    -4


    Software pipelining example in the ia 6423

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18)stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    y4

    y3

    38

    32

    36

    37

    39

    33

    34

    35

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    -4


    Software pipelining example in the ia 6424

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    3

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    y4

    y3

    38

    32

    36

    37

    39

    33

    34

    35

    General Registers (Logical)

    Predicate Registers

    Memory

    1

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    -4


    Software pipelining example in the ia 6425

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    2

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    y4

    y3

    39

    33

    37

    38

    32

    34

    35

    36

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    1

    1

    0

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    -5


    Software pipelining example in the ia 6426

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    2

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    x4

    y4

    y3

    39

    33

    37

    38

    32

    34

    35

    36

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    -5


    Software pipelining example in the ia 6427

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    2

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17)add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    39

    33

    37

    38

    32

    34

    35

    36

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    -5


    Software pipelining example in the ia 6428

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    2

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18)stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    39

    33

    37

    38

    32

    34

    35

    36

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -5


    Software pipelining example in the ia 6429

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    2

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    39

    33

    37

    38

    32

    34

    35

    36

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    1

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -5


    Software pipelining example in the ia 6430

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    0

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6


    Software pipelining example in the ia 6431

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6


    Software pipelining example in the ia 6432

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6


    Software pipelining example in the ia 6433

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18)stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6

    y5


    Software pipelining example in the ia 6434

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6

    y5


    Software pipelining example in the ia 6435

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    1

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    1

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -6

    y5


    Software pipelining example in the ia 6436

    34 software pipelining

    36

    32

    33

    35

    37

    38

    39

    EC

    LC

    0

    0

    Software Pipelining Example in the IA-64

    General Registers (Physical)

    loop:

    (p16) ldl r32 = [r12], 1

    (p17) add r34 = 1, r33

    (p18) stl [r13] = r35,1

    br.ctop loop

    y2

    y1

    x5

    y5

    y4

    y3

    32

    34

    38

    39

    33

    35

    36

    37

    General Registers (Logical)

    Predicate Registers

    Memory

    0

    0

    0

    0

    18

    16

    17

    x1

    x2

    x3

    y1

    x4

    y2

    x5

    y3

    RRB

    y4

    -7

    y5


    Ia 64 software pipelining review
    IA-64 Software pipelining Review software pipelining

    • No prolog or epilog in code

      • But we execute a lot of noops.

    • Rotated registers help

      • In this case, we just didn’t have to reverse the code ordering

        • But in general, better still. Could move load from use more than one loop iteration apart.

    • Looks good at least in this case…


    Ia 64 review
    IA-64 review software pipelining

    • Some problems

      • ALAT difficult for compliers to use.

        • Recall Colwell talking about “once we figure out how to do this…”

      • 128/3 instruction size makes I-cache worse.

      • Big register file has disadvantages

        • Context switch mainly.

      • So many dependencies with special purpose instructions, dynamic OoO is unlikely.

    • But…

      • If the complier could do a good job, there really does look like the potential for a big win.


    ad