IA-64: Advanced Loads Speculative Loads Software Pipelining - PowerPoint PPT Presentation

britain
ia 64 advanced loads speculative loads software pipelining n.
Skip this Video
Loading SlideShow in 5 Seconds..
IA-64: Advanced Loads Speculative Loads Software Pipelining PowerPoint Presentation
Download Presentation
IA-64: Advanced Loads Speculative Loads Software Pipelining

play fullscreen
1 / 62
Download Presentation
IA-64: Advanced Loads Speculative Loads Software Pipelining
138 Views
Download Presentation

IA-64: Advanced Loads Speculative Loads Software Pipelining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Presentation stolen from the web(with changes)from the Univ of Aberta andEspen SkoglundandThomas Richards (470 alum)andOur textbook’s authors IA-64: Advanced Loads Speculative Loads Software Pipelining

  2. IA-64 • 128 64-bit registers • Use a register window similarish to SPARC • 128 82 bit fp registers • 64 1 bit predicate registers • 8 64-bit branch target registers

  3. Explicit Parallelism • Groups • Instructions which could be executed in parallel if hardware resources available. • Bundle • Code format. 3 instructions fit into a 128-bit bundle. • 5 bits of template, 41*3 bits of instruction. • Template specifies what execution units each instruction requires.

  4. Instruction groups • IA-64 instructions are bound in instruction groups • No read-after-write dependencies • No write-after-write dependencies • Any instruction in the group may be executed in parallel • New processors can easily take advantage of the existing ILP in the instruction group • Instruction groups indicated by stop bits in template • Instruction groups may end dynamically on branches

  5. Template Slot 3 Slot 2 Slot 1 127 86 45 4 0 Instruction bundle Instruction bundles • Instruction bundles contain • 3 instructions • A template field which maps instructions to execution units • Processor dispatches all three instruction in parallel • Instruction group may end in middle of bundle • Bundles are aligned on 16 byte boundaries

  6. Use predicates to eliminate branches Predicates are one bit registers (total of 64) Most instructions can be predicated (qp) mnemonic dest = source Predicates are set by compare instructions (qp) cmp.crel px,py = source x86 assembly: cmp a, bbeq .eq add $4, yjmp .done.eq: add $3, y.done: Predication • C code: • if (a == b) y += 3;else y += 4; • IA-64 assembly: • cmp.eq p1,p2 = a,b(p1) add y = y, 3(p2) add y = y, 4

  7. Advanced loads Used to address data dependencies Speculative loads Used to address control dependencies st ld advanced load st check load (p) br ld speculative load (p) br check speculation Advanced loads and speculative loads

  8. Addr1 and addr2 in example might point to same address If different: Datum in addr2 can be prefetched If same: Datum in addr2 can not be prefetched C code example: int foo (int *addr1, int *addr2){ int h; *addr1 = 4; h = *addr2; return h+1;} Advanced loads

  9. Insert advanced loads (ld.a) to prefetch data (store in ALAT) Use check data instruction (ld.c) in place of original load If memory contents has changed, perform real load Advanced loads do not defer exceptions (e.g., page-faults) Regular load: add r3 = 4,r0 ;;st4 [r32] = r3ld4 r2 = [r33] regular loadadd r5 = r2,r3 use data Advanced Load: ld4.a r2 = [r33]advanced loadadd r3 = 4,r0 ;;st4 [r32] = r3ld4.c r2 = [r33] ;; verify dataadd r5 = r2,r3 use data Advanced loads

  10. If addr in example is legal, we can prefetch its value If addr is illegal, prefetching the value would cause exception Any exception should be delayed until code path has been resolved C code example: int add5 (int *addr){ if (addr == NULL) return (-1); else return (*addr+5);} Speculative loads

  11. Insert speculative loads (ld.s) to prefetch data Verify load using check instruction (chk.s) NaT-bit/NaTVal is used track success of load Might also be combined with advanced loads (ld.sa and chk.a) Assembly code: add5:ld8.s r1 = [r32] cmp.eq p6,p5 = r32,r0 ;; (p6) add r8 = -1,r0 (p6) br.ret (p5) chk.s r1,return_error add r8 = 5,r1 br.ret ;; return_error:recovery code Speculative loads

  12. A:1 11 10 B:4 E:5 C:4 F:4 20 D:1 G:1 Code example“Why hoist loads?” • add r15 = r2,r3 //A • mult r4 = r15,r2 //B • mult r4 = r4,r4 //C • st8 [r12] = r4 //D • ld8 r5 = [r15] //E • div r6 = r5,r7 //F • add r5 = r6,r2 //G • Assume latencies are: • add, store: +0 • mult, div: +3 • ld: +4

  13. // Case B: Advanced Load • // With Speculative Add • ld.a r2 = [r10] • add r5 = r2, r3 • st8 [r1] = r9 • ld.c r2 = [r10] // Wrong • st8 [r18] = r19 • Case B – Hoist the load and dependent instructions. • In this case, we need to re-execute all of the dependent instructions. Advanced Loads Recovery • Case A – Hoist just the load. • In this case, if there is a memory dependency we just re-execute the load. // Case A: Advanced Load ld.a r2 = [r10] st8 [r1] = r9 ld.c r2 = [r10] add r15 = r2, r3 st8 [r18] = r19 A ld.c will only re-execute the load, r5 is still wrong after the ld.c!

  14. Advanced Load-Use Recovery: Compiler Generated Recovery Code // Solution: Using the chk.a instruction ld8.a r2 = [r10] add r5 = r2, r3 st8 [r1] = r9 chk.a r6, fixup return: // Return Point st8 [r18] = r19 ...... ...... fixup: // Re-execute load and all speculative uses ld8 r2 = [r10] add r5 = r2, r3 br return • Use ld.c if JUST a load is speculative. Use chk.a if a load and an instruction that is dependant on the load are both speculative.

  15. The Advanced Load Address Table (ALAT) • The ALAT tells us if we need to recover from an Advanced Load. • When an advanced load is executed – Save the type of load, size of load, and load address to the ALAT (indexed by PR). • When we execute a ld.c or chk.a look for the entry in the ALAT. If it is missing, run the recovery code. • Remove an entry from the ALAT if • A store address overlaps an ALAT entry. • Capacity/Associatively evictions. • Other advanced load indexes the same PR.

  16. Control Speculation and Recovery • What if we want to move a load above a branch? • Problem is that the load maybe shouldn’t have executed and might have thrown a spurious exception. • Similar to Advanced Load, but no ALAT. • Instead, check NaT bit for deferred exceptions. • See next slide. • Use chk.s for recovery (instead of chk.a or ld.a). • // Control Speculation and Recovery • ld8.s r1 = [r10] //load moved outside of branch • st8 [r11] = r9 • (p1)br.cond branch_label // (p1) is a predication bit • chk.s r1,recovery • return: • add r2 = r1, r2 • chk.s checks r1 to see if the NaT bit is set. If so, branch to recovery code (re-execute instructions if necessary).

  17. Not a Thing Bit (NaT) 64bits + 1NaT IA64 register • If a control speculative load causes an exception, the processor can set this bit, which defers the exception. • NaT bits propagate. • Propagation allows a single check for multiple ld.s. • ld8.s r1 = [r10] • ld8.s r2 = [r11] • add r3 = r1, r2 • ld8.s r4 = [r3] • st8[r11] = r9 • (p1)br.cond branch_label • chk.s r4, recovery

  18. Software pipelining on IA-64 • Lots of tricks • Rotating registers • Special counters • Often don’t need Prologue and Epilog. • Special counters and prediction lets us only execute those instructions we need to.

  19. Prolog and epilogFrom before!!!!! r3=r3-8 // Needed to check legal! r4=MEM[r2+0] //A(1) r1=r4*2 //B(1) r4=MEM[r2+4] //A(2) Loop: MEM[r2+0]=r1 //C(n) r1=r4*2 //B(n+1) r4=MEM[r2+8] //A(n+2) r2=r2+4 //D(n) bne r2 r3 Loop //E(n) MEM[r2+0]=r1 // C(x-1) r1=r4*2 // B(x) MEM[r2+0]=r1 // C(x) r3=r3+8 // Could have used tmp var.

  20. There are three special purpose registers used in IA-64 for software pipelining • There are three special purpose registers used in IA-64 for software pipelining • Loop counter (LC) indicates how many times to run through loop (prolog/kernel) • Initialized to N-1 before starting loop code • Decremented until LC == 0 • Epilog counter (EC) indicates how many times to run loop after loop counter exhausted (epilog) • Needed to flush the software pipeline • Initialized to num-stages before entering loop code • Decremented if LC == 0, and EC > 1

  21. And RRB (Register Rename Base) • Add internal counter RRB to register number to get actual used register • Counter decreased by special loop branch instructions • May be reset by clrrrb instruction • Use modular lookup (so we wrap around!) • Rotated predicate registers • Initially reset using: mov pr.rot = value • pr63 is reset before every rotation

  22. How does register rotation work?(Basics) • Rotated registers: • General: gr32 - grN(as specified by alloc instruction) • Predicate: pr16 - pr63 • Floating point: fr16 - fr127 • Registers are rotated to higher numbers • Register rn is renamed to rn+1, rmax is renamed to rmin • Registers are rotated by specific loop branch instructions • br.ctop, br.cexit (for counted loops) • br.wtop, br.exit (for while loops)

  23. ctop, cexit == 0 (epilog) LC? (special unrolled loops) != 0 EC? > 1 == 0 (prolog/kernel) == 1 LC-- LC=LC LC=LC LC=LC EC=EC EC-- EC-- EC=EC PR[63]=1 PR[63]=0 PR[63]=0 PR[63]=0 RRB-- RRB-- RRB-- RRB=RRB ctop: branch cexit: fall-thru ctop: fall-thru cexit: branch How they relate

  24. 34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

  25. 34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

  26. 34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

  27. 34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 34 36 32 33 35 37 38 39 General Registers (Logical) Predicate Registers Memory 1 0 0 18 16 17 x1 x2 x3 x4 x5 RRB 0

  28. 34 36 32 33 35 37 38 39 EC LC 4 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 0 0 1 18 16 17 x1 x2 x3 x4 x5 RRB -1

  29. 34 36 32 33 35 37 38 39 EC LC 3 3 1 1 0 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 18 16 17 x1 x2 x3 x4 x5 RRB -1

  30. 34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

  31. 34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y1 x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

  32. 34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y1 x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

  33. 34 36 32 33 35 37 38 39 EC LC 3 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y1 x1 x2 35 37 33 34 36 38 39 32 General Registers (Logical) Predicate Registers Memory 1 1 0 18 16 17 x1 x2 x3 x4 x5 RRB -1

  34. 34 36 32 33 35 37 38 39 EC LC 2 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y1 x1 x2 36 38 34 35 37 39 32 33 General Registers (Logical) Predicate Registers Memory 1 1 1 1 18 16 17 x1 x2 x3 x4 x5 RRB -2

  35. 34 36 32 33 35 37 38 39 EC LC 2 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop x1 y1 x3 x2 36 38 34 35 37 39 32 33 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 x4 x5 RRB -2

  36. 34 36 32 33 35 37 38 39 EC LC 2 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x3 x2 36 38 34 35 37 39 32 33 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 x4 x5 RRB -2

  37. 34 36 32 33 35 37 38 39 EC LC 2 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop y2 y1 x3 x2 36 38 34 35 37 39 32 33 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 x5 RRB -2

  38. 34 36 32 33 35 37 38 39 EC LC 2 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x3 x2 36 38 34 35 37 39 32 33 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 x5 RRB -2

  39. 34 36 32 33 35 37 38 39 EC LC 1 3 1 1 1 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x3 x2 37 39 35 36 38 32 33 34 General Registers (Logical) Predicate Registers Memory 1 18 16 17 x1 x2 x3 y1 x4 x5 RRB -3

  40. 34 36 32 33 35 37 38 39 EC LC 1 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x4 x3 x2 37 39 35 36 38 32 33 34 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 x5 RRB -3

  41. 34 36 32 33 35 37 38 39 EC LC 1 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x4 x3 y3 37 39 35 36 38 32 33 34 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 x5 RRB -3

  42. 34 36 32 33 35 37 38 39 EC LC 1 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop y2 y1 x4 x3 y3 37 39 35 36 38 32 33 34 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 RRB -3

  43. 34 36 32 33 35 37 38 39 EC LC 1 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x4 x3 y3 37 39 35 36 38 32 33 34 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 RRB -3

  44. 34 36 32 33 35 37 38 39 EC LC 0 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x4 x3 y3 38 32 36 37 39 33 34 35 General Registers (Logical) Predicate Registers Memory 1 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 RRB -4

  45. 34 36 32 33 35 37 38 39 EC LC 0 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16)ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 x3 y3 38 32 36 37 39 33 34 35 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 RRB -4

  46. 34 36 32 33 35 37 38 39 EC LC 0 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 y4 y3 38 32 36 37 39 33 34 35 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 RRB -4

  47. 34 36 32 33 35 37 38 39 EC LC 0 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 y4 y3 38 32 36 37 39 33 34 35 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 y3 RRB -4

  48. 34 36 32 33 35 37 38 39 EC LC 0 3 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 y4 y3 38 32 36 37 39 33 34 35 General Registers (Logical) Predicate Registers Memory 1 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 y3 RRB -4

  49. 34 36 32 33 35 37 38 39 EC LC 0 2 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 y4 y3 39 33 37 38 32 34 35 36 General Registers (Logical) Predicate Registers Memory 0 1 1 0 18 16 17 x1 x2 x3 y1 x4 y2 x5 y3 RRB -5

  50. 34 36 32 33 35 37 38 39 EC LC 0 2 Software Pipelining Example in the IA-64 General Registers (Physical) loop: (p16) ldl r32 = [r12], 1 (p17) add r34 = 1, r33 (p18) stl [r13] = r35,1 br.ctop loop y2 y1 x5 x4 y4 y3 39 33 37 38 32 34 35 36 General Registers (Logical) Predicate Registers Memory 0 1 1 18 16 17 x1 x2 x3 y1 x4 y2 x5 y3 RRB -5