1 / 55

IA-64 Architecture Innovations

IA-64 Architecture Innovations. John Crawford Architect & Intel Fellow Intel Corporation. Jerry Huck Manager & Lead Architect Hewlett Packard Co. Agenda. Architecture Principles Predication & Speculation Branch Architecture Software Pipelining.

Download Presentation

IA-64 Architecture Innovations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

  2. Agenda • Architecture Principles • Predication & Speculation • Branch Architecture • Software Pipelining

  3. Traditional Architectures: Limited Parallelism Original Source Code Sequential Machine Code Compiler Hardware parallelized code parallelized code multiple functional units Execution Units Available Used Inefficiently . . . . . . . . . . . . Today’s Processors often 60% Idle

  4. IA-64 Architecture: Explicit Parallelism Parallel Machine Code Original Source Code Compile Compiler Hardware multiple functional units IA-64 Compiler Views Wider Scope More efficient use of execution resources . . . . . . . . . . . . Increases Parallel Execution

  5. IA-64 Principles • Explicitly parallel: • Instruction level parallelism (ILP) in machine code • Compiler schedules across a wider scope • Enhanced ILP : • Predication, Speculation, Software pipelining, ... • Fully compatible: • Across all IA-64 family members • IA-32 in hardware and PA-RISC through instruction mapping • Inherently scalable • Massively resourced: • Many registers • Many functional units

  6. Predication Traditional Architectures IA-64 • Removes branches, converts to predicated execution • Executes multiple paths simultaneously • Increases performance by exposing parallelism and reducing critical path • Better utilization of wider machines • Reduces mispredicted branches cmp cmp then p1 p2 p1 p2 else p1 p2

  7. Predication Review • Two kinds of normal compares • Regular • Unconditional (nested IF’s) p1,p2,<-... (p1) p3= (p2) p3= (p2) p3,p4 <-cmp.unc... p2&p3 p2&p4 (p3)... (p4)... (p3)... Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 Opportunity for Even More Parallelism

  8. Introducing Parallel Compares • Three new types of compares: • AND: both target predicates set FALSE if compare is false • OR: both target predicates set TRUE if compare is true • ANDOR: if true, sets one TRUE, sets other FALSE A B A C B D C Reduces Critical Path D

  9. Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 8 queens control flow R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] ld R2=[R1] ld.s R4=[R3] ld.s R6=[R5] P1,P2 <-cmp.unc(R2==true) (p1)chk.s R4 (p1)P3,P4 <-cmp.unc(R4==true) (p3)chk.s R6 (p3)P5,P6 <-cmp.unc(R5==true) (P5) br then else 1 P2 P1 2 4 P4 P3 5 P6 P5 Else Then 6 7

  10. if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) P1 Eight Queens Example 8 queens control flow Parallel Compares R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else P2 1 P1 2 P4 P3 P6 4 P5 Else Then 5

  11. if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Eight Queens Example 8 queens control flow Parallel Compares R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else P2 1 P1 2 P4 P3 P1=False P1= true P6 4 P5 Else Else Then Then 5 Reduced from 7 cycles to 5

  12. Five Predicate Compare Types • (qp) p1,p2 <- cmp.relation • if(qp) {p1 = relation; p2 = !relation}; • (qp) p1,p2 <- cmp.relation.unc • p1 = qp&relation; p2 = qp&!relation; • (qp) p1,p2 <- cmp.relation.and • if(qp & (relation==FALSE)) { p1=0; p2=0; } • (qp) p1,p2 <- cmp.relation.or • if(qp & (relation==TRUE)) { p1=1; p2=1; } • (qp) p1,p2 <- cmp.relation.or.andcm • if(qp & (relation==TRUE)) { p1=1; p2=0; } Tbit (Test Bit) Also Sets Predicates

  13. Predication Benefits • Reduces branches and mispredict penalties • 50% fewer branches and 37% faster code* • Parallel compares further reduce critical paths • Greatly improves code with hard to predict branches • Large server apps- capacity limited • Sorting, data mining- large database apps • Data compression • Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication • Cmove: 39% more instructions, 23% slower performance* • Instructions must all be speculative * Source: S. Mahlke, 1995

  14. Speculation Review IA-64 Traditional Architectures ld.s instr 1 instr 2 • Memory latency is a major performance bottleneck in today’s systems • CPU to memory gap increasing instr 1 instr 2 . . . br Barrier br Load use chk.s use Allows elevation of load, even above a branch

  15. Hoisting Uses IA-64 ld.s instr 1 instr 2 • The uses of speculative data can also be executed speculatively • distinguishes speculation from simple prefetch br chk.s use Enables Further Parallelism

  16. Introducing the NaT(“Not a Thing”) IA-64 • NaT is the GR’s 65th bit that indicates: • whether or not an exception has occurred • branch to fixup code required • NaT set during ld.s, checked by Chk.s ld.s instr 1 instr 2 ;Exception Detection Propagate Exception br ;Exception Delivery chk.s use

  17. Propagation • All computation instructions propagate NaTs to reduce number of checks • Cmp propagates “false” when writing predicates • RISC architectures require more instructions for equivalent integrity • e.g., non faulting load ld8.s r3 = (r9) ld8.s r4 = (r10) add r6 = r3, r4 ld8.s r5 = (r6) p1,p2 = cmp(...) Allows single chk on result chk.s r5 sub r7 = r5,r2

  18. Exception Deferral: More Than Skin Deep • Deferral allows the efficient delay of costly exceptions • OS controlled deferral by hardware of: • Page faults • Protection violations • … • NaTs enable deferral with recovery • Efficiently support structured exception handling in C/C++ ld.s instr 1 instr 2 uses br Recovery code ld uses br home chk.s (Home Block) Complete Solution for Exception Management

  19. Control Speculation Summary • All loads have a speculative form that sets the NaT bit when deferring exceptions • Computational instructions propagate NaTs • OS controls deferral of faults but supported directly in HW - “no-fault speculation” • Minimizes overhead of data that is not used • Chk more effective than non-faulting load

  20. Store Barrier Traditional Architectures instr 1 instr 2 . . . Store(*) Barrier Load (*) use Traditional architectures limited by the Store Barrier

  21. Introducing Data Speculation • Compiler can issue a load prior to a preceding, possibly-conflicting store Traditional Architectures IA-64 instr 1 ld8.a instr 1 instr 2 instr 2 . . . st8 Barrier st8 ld8 use ld.c use Unique feature to IA-64

  22. ld8.a instr 1 use instr 2 ld8.a instr 1 instr 2 st8 st8 Recovery code ld8 uses br home chk.a ld.c use Data Speculation • Uses can be hoisted Synergy with control speculation yields greater performance

  23. ld.a reg# =... chk.a reg# ? st Advanced Load Address Table - ALAT • ld.a inserts entries. • Conflicting stores remove entries • Also: ld.c.clr, chk.a.clr, • Presence of entry indicates success • chk.a branches when no entry is found reg # Address reg # Address reg # Address ...

  24. Architectural Support for Data Speculation • Instructions • ld.a - advanced loads • ld.c - check loads • chk.a - advance load checks • Speculative Advanced loads - ld.sa - is an advanced load with deferral • ALAT - HW structure containing outstanding advanced loads

  25. Speculation Benefits • Reduces impact of memory latency • Study demonstrates performance improvement of 79% when combined with predication* • Greatest improvement to code with many cache accesses • Large databases • Operating systems • Scheduling flexibility enables new levels of performance headroom * August, et.al, 1998

  26. Agenda • Architecture Principles • Predication & Speculation • Branch Architecture • Software Pipelining

  27. Branch Instruction 128-bit bundle 41-bits 127 0 • Two basic branch formats • Relative: IP := IP + Offset21 • Indirect: IP := BR[I] • 8 branch registers for efficient branch execution • Call/Return linking through branch registers • Loop branches with 64-bit loopcount register (LC) • Enables perfect branch prediction of counted loops • Traditional architectures always mispredict last iteration • Incurs misprediction stall costing many cycles QP Branch Instruction 1 Instruction 0 Template IP-Offset 21-bits

  28. Branch Predicates Conditional branches Unconditional branches • Compiler directed static prediction augments dynamic prediction • Better predict highly correlated branches (always/never taken) • Frees space in H/W predictor • Can give hint for dynamic predictor (p0) BR #label_A; (p1) BR #label_A; P1=true P1=false “always true” A A B

  29. Compare & Branch in Same Cycle Queens Loop: Parallel Compares & Compare-branch R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else 1 2 4 From 5 Cycles Down to 4

  30. ld8 r6 = (ra) (p1) br exit1 chk r6, rec0 (p2) chk r7, rec1 (p4) chk r8, rec2 }{ (p1) br exit1 (p3) br exit2 (p5) br exit3 } ld8 r7 = (rb) (p3) br exit2 ld8 r8 = (rc) (p5) br exit3 Multi-way Branch w/o Speculation Hoisting Loads IA-64 ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) P1 chk r6, rec0 (p1) br exit1 P2 Chk r7, rec1 (p3) br exit2 P3 P4 Chk r8, rec2 (p5) br exit3 P5 P6 3 branch cycles 1 branch cycle • Multiway branches: more than 1 branch in a single cycle • Allows n-way branching Supports Aggressive Speculation

  31. Software Pipelining • Overlapping execution of different loop iterations vs. • More iterations in same amount of time

  32. Software Pipelining • IA-64 features that make this possible • Full Predication • Special branch handling features • Register rotation: removes loop copy overhead • Predicate rotation: removes prologue & epilogue • Traditional architectures use loop unrolling • High overhead: extra code for loop body, prologue, and epilogue Especially Useful for Integer Code With Small Number of Loop Iterations

  33. 3 ops Basic Loop Example Basic Copy Loop For (i=0; i<n; i++) { *b++ =*a++; } /* MemCopy */ • Simple Non-overlapping iterations • 2 cycles per iteration • 3 operations in loop body Execution (Cycles) 1 2 3 4 5 6 7 8 ld1 br.cloop st1 // setup ra/rb/lc, .label loop { ld8 r35 = [ra],8 }{ st8 [rb],8 = r35 br.cloop #loop // check n!=0 } ld2 br. cloop st2 ld3 br. cloop st3 ld4 br. cloop st4

  34. Loop Support: Unrolling Unrolled Copy Loop Test for loop count 0,1 ld8 r34 = [ra],8 .label loop ld8 r35 = [ra],8 st8 [rb],8 = r34 br.cle #e-exit ld8 r34 = [ra],8 st8 [rb],8 = r35 br.cloop #loop st8 [rb],8 = r34 br #thru .label e-exit st8 [rb],8 = r35 .label thru • Overlapped iterations • 1 cycle per word • 1.6X performance improvement • 3.3X code expansion Execution cycles 1 2 3 4 5 Prologue ld1 ld2 st1 br.cle Main loop br.cloop ld3 st2 st3 br.cle ld4 10 ops Epilogue st4 Incurs Code Expansion Penalties

  35. ld1 r34 Software Register Renaming Traditional Architecture . . . R32 R33 R34 R35 . . .

  36. st1 r34 ld2 r35 Software Register Renaming Traditional Architecture . . . R32 R33 R34 R35 . . .

  37. ld3 r34 st2 r35 Software Register Renaming Traditional Architecture . . . R32 R33 R34 R35 . . .

  38. st3 r34 ld4 r35 Software Register Renaming Traditional Architecture . . . R32 R33 R34 R35 . . .

  39. st4 r35 Software Register Renaming Traditional Architecture . . . R32 R33 R34 R35 . . .

  40. ld1 R35 Introducing Rotating Registers • GR 32-127, FR32-127 can rotate • Separate Rotating Register Base for each: GRs, FRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. . . . 36: Palm 35: 34: 33: 32: . . . Palm Sunny Springs is RRB=0

  41. st1 R35 ld2 R34 Introducing Rotating Registers • GR 32-127, FR32-127 can rotate • Separate Rotating Register Base for each: GRs, FRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm . . . 36: Palm 35: Springs 34: 33: 32: . . . Palm Sunny Springs is RRB=0

  42. st2 R35 ld3 R34 Introducing Rotating Registers • GR 32-127, FR32-127 can rotate • Separate Rotating Register Base for each: GRs, FRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs . . . Palm 35: Springs 34: is 33: 32: 127: . . . Palm Sunny Springs is RRB=-1

  43. st3 R35 ld4 R34 Introducing Rotating Registers • GR 32-127, FR32-127 can rotate • Separate Rotating Register Base for each: GRs, FRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Springs is . . . Springs 34: is 33: Sunny 32: 127: 126: . . . Palm Sunny Springs is RRB=-2

  44. st4 R35 Introducing Rotating Registers • GR 32-127, FR32-127 can rotate • Separate Rotating Register Base for each: GRs, FRs • Loop branches decrement all register rotating bases (RRB) • Instructions contain a “virtual” register number • RRB + virtual register number = physical register number. IA-64 Palm Sunny Springs is . . . is 33: Sunny 32: 127: 126: 125: . . . Palm Sunny Springs is RRB=-3

  45. 5 ops Loop Support: Rotating Registers Software Pipelined Copy Loop // setup ra/rb/lc/ec, check n > 2 { ld8 r35 = [ra],8 } .label loop { ld8 r34 = [ra],8 st8 [rb] = r35,8 br.ctop #loop }{ st8 [rb] = r35,8 } • Modulo Scheduled Iterations • 1 cycle per word • 1.6X performance improvement • additional upside for higher latency conditions • 1.7X code expansion Execution cycles 1 2 3 4 5 Prologue ld1 ld2 st1 br.ctop Main loop ld3 br. ctop st2 ld4 st3 br. ctop Epilogue st4

  46. IA-64 . . . (p16) ld1 R34 (p17) st R35 0 18: 0 17: 0 16: 0 63: 0 62: . . . Introducing Rotating Predicate Registers • PR16-63 can rotate, with separate Rotating Register Base • Loop branches decrement all register rotating base (RRB) • Instructions contain a “virtual” predicate register number • RRB + virtual register number = physical register number. LC=3 EC=2 IA-64 Code . . . 0 18: (p17) st R35 0 17: 1 (p16) ld R34 (p16) ld R34 16: 0 63: 0 62: . . . RRB=0 Initialize

  47. IA-64 IA-64 IA-64 . . . . . . . . . (p16) ld2 R34 (p16) ld1 R34 (p17) st1 R35 (p17) st R35 0 0 0 18: 18: 18: 0 0 0 17: 17: 17: 1 0 1 16: 16: 16: 0 0 1 63: 63: 63: 0 0 0 62: 62: 62: . . . . . . . . . Introducing Rotating Predicate Registers • PR16-63 can rotate, with separate Rotating Register Base • Loop branches decrement all register rotating base (RRB) • Instructions contain a “virtual” predicate register number • RRB + virtual register number = physical register number. LC=2 EC=2 IA-64 Code . . . 0 17: (p17) st R35 (p17) st R35 1 16: 1 (p16) ld R34 (p16) ld R34 63: 0 62: 0 61: . . . RRB=-1 Branch 1

  48. IA-64 IA-64 IA-64 IA-64 IA-64 . . . . . . . . . . . . . . . (p16) ld3 R34 (p16) ld2 R34 (p16) ld1 R34 (p17) st R35 (p17) st1 R35 (p17) st2 R35 0 0 0 0 0 18: 17: 17: 18: 18: 0 1 0 0 1 16: 17: 16: 17: 17: 1 1 1 1 0 63: 16: 63: 16: 16: 0 0 1 0 1 62: 63: 62: 63: 63: 0 0 0 0 0 61: 62: 61: 62: 62: . . . . . . . . . . . . . . . Introducing Rotating Predicate Registers • PR16-63 can rotate, with separate Rotating Register Base • Loop branches decrement all register rotating base (RRB) • Instructions contain a “virtual” predicate register number • RRB + virtual register number = physical register number. LC=1 EC=2 IA-64 Code . . . 1 16: (p17) st R35 (p17) st R35 1 63: 1 (p16) ld R34 (p16) ld R34 62: 0 61: 0 60: . . . RRB=-2 Branch 2

  49. IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 . . . . . . . . . . . . . . . . . . . . . (p16) ld4 R34 (p16) ld1 R34 (p16) ld2 R34 (p16) ld3 R34 (p17) st R35 (p17) st2 R35 (p17) st1 R35 (p17) st3 R35 0 0 1 1 0 0 0 17: 17: 16: 18: 18: 18: 16: 1 0 1 1 0 1 0 17: 17: 17: 63: 16: 16: 63: 1 1 1 1 1 1 0 62: 16: 16: 16: 62: 63: 63: 1 1 1 0 0 0 0 63: 61: 62: 63: 63: 62: 61: 0 0 0 0 0 0 0 60: 61: 62: 61: 62: 60: 62: . . . . . . . . . . . . . . . . . . . . . Introducing Rotating Predicate Registers • PR16-63 can rotate, with separate Rotating Register Base • Loop branches decrement all register rotating base (RRB) • Instructions contain a “virtual” predicate register number • RRB + virtual register number = physical register number. LC=0 EC=2 IA-64 Code . . . 1 63: (p17) st R35 (p17) st R35 1 62: 1 (p16) ld R34 (p16) ld R34 61: 0 60: 0 59: . . . RRB=-3 Branch 3

  50. IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . (p16) ld1 R34 (p16) ld4 R34 (p16) ld R34 (p16) ld3 R34 (p16) ld2 R34 (p17) st3 R35 (p17) st R35 (p17) st2 R35 (p17) st3 R35 (p17) st1 R35 0 1 0 1 0 1 0 0 1 17: 16: 17: 18: 18: 63: 63: 18: 16: 1 1 1 1 0 0 0 1 1 17: 16: 17: 62: 63: 62: 63: 17: 16: 1 1 1 1 0 1 1 1 1 16: 63: 62: 61: 16: 16: 63: 61: 62: 0 0 1 0 0 0 1 0 1 61: 61: 62: 63: 62: 60: 63: 60: 63: 0 0 0 0 0 0 0 0 0 61: 60: 61: 59: 62: 62: 62: 60: 59: . . . . . . . . . . . . . . . . . . . . . . . . . . . Introducing Rotating Predicate Registers • PR16-63 can rotate, with separate Rotating Register Base • Loop branches decrement all register rotating base (RRB) • Instructions contain a “virtual” predicate register number • RRB + virtual register number = physical register number. LC=0 EC=1 IA-64 Code . . . 1 62: (p17) st R35 (p17) st R35 1 61: (p16) ld R34 0 (p16) ld R34 (p16) ld R34 60: 0 59: 0 58: . . . RRB=-4 Branch 4

More Related