1 / 22

Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds. Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2009. Avoiding superscalar complexity. An alternative: EPIC (explicit parallel instruction computer) EPIC: Best of both worlds?

andren
Download Presentation

Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer Architecture5MD00 / 5Z033EPIC / Itanium architecturebest of both worlds Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2009

  2. Avoiding superscalar complexity • An alternative: • EPIC (explicit parallel instruction computer) • EPIC: Best of both worlds? • Superscalar: expensive but binary compatible • VLIW: simple, but not compatible • Or: use VLIW with Binary translation at Run-time • Transmeta: Crusoe VLIW processor • Runs x86 code on a VLIW !!! ACA H.Corporaal

  3. EPIC Architecture: IA-64 / Itanium Explicit Parallel Instruction Computer • IA-64 • Implementations: Merced (2001), McKinley (2002), Montecite (2 core, 2006), Tukwila (4-core 2009), Poulson (Q4, 2009, 8-core) • architecture is now called Itanium Register model: • 128 64-bit int x bits, stack, rotating • 128 82-bit floating point, rotating • 64 1-bit booleans • 8 64-bit branch target address • system control registers ACA H.Corporaal

  4. (2002) ACA H.Corporaal

  5. Itanium Instruction format • Instructions grouped in 128-bit bundles • 3 * 41-bit instruction • 5 template bits, indicate type and stop location • Each 41-bit instruction • starts with 4-bit opcode, and • ends with 6-bit guard (boolean) register-id 5 41 41 41 ACA H.Corporaal

  6. ACA H.Corporaal

  7. Predication • Predicated execution of virtually all instructions • (p) add r1 = r2, r3 • If p is true, normal add operation. Otherwise, NOP • 64 1-bit predicate registers • Advantages of predicated execution: • Remove branches • Convert control dependence to data dependence • Reduce misprediction penalties • Increase the size of basic block • Both codes from taken & not-taken path can be scheduled in the same cycle ACA H.Corporaal

  8. Control Speculation • Loads incur high latency • Need to schedule loads as early as possible • Two barriers –branches and stores • Control speculation –move loads above branches: ACA H.Corporaal

  9. Control speculation –move loads above branches Problem: loads can cause exceptions • Separate load behavior from exception behavior • Speculative load (ld.s) initiates a load op. & detects exceptions • On an exception, hardware propagates exception token (stored with destination register) from ld.s to chk.s • Speculative check (chk.s) delivers the exception detected by ld.s ACA H.Corporaal

  10. Control Speculation • Control speculating uses further increase ILP • Dependent instructions following the load can be also speculated above branches ACA H.Corporaal

  11. Data Speculation • Loads and previous stores can conflict • When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence • IA-64 enables data speculation by ld.a and ld.c/chk.a with ALAT (Advanced Load Address Table) • ld. a performs a normal load and inserts the address to ALAT • Any intervening stores eliminate the overlapping entries from ALAT • The advanced load check (ld.c) checks ALAT whether there was a violation and reissues the load if necessary ACA H.Corporaal

  12. Data Speculation • Move loads above potentially overlapping stores ACA H.Corporaal

  13. Data Speculation • Uses of speculative data can be further speculated • Also, control and data speculation can be combined • Schedule loads across branches and across stores at the same time • Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s ACA H.Corporaal

  14. Register Stack • Procedure call overhead • Spill registers to memory on call • Restore them on procedure return • Register Stack • Register stack is used to save/restore procedure contexts across calls • Stack area in memory to save/restore procedure context • Explicit allocation of stack frames • Effective use of 96 registers • Allocate only what is needed • Overlapping stack frames avoids parameter copying • Mechanism implemented by renaming register addresses ACA H.Corporaal

  15. Register Stack ACA H.Corporaal

  16. Register Stack Engine (RSE) • Automatically saves/restores stack registers without software intervention • Avoids explicit spill/fill (Eliminates stack management overhead) • Provides the illusion of infinite physical registers • RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background • Overflow: alloc needs more registers than available • Underflow: return needs to restore frame saved in memory ACA H.Corporaal

  17. Software Pipelining Support • High performance loops without code size overhead • No prologue and epilogue • Rotating registers • Provide automatic renaming • Rotating predicates (stage predicates) • Unify prologue, kernel, and epilogue • Loop control registers (LC, EC) • Loop branches • Counted loop (br.ctop) • While loop (br.wtop) • Especially valuable for integer loops with small trip counts ACA H.Corporaal

  18. Software Pipelining Example ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add st L1: ld4 r4 = [r5], 4 //Cycle 0 add r7 = r4, r9 //Cycle 2 st4 [r6] = r7, 4 //Cycle 3 br.cloop L1;; L1: (p16) ld4 r32 = [r5], 4 // Cycle 0 (p18) add r35 = r34, r9 // Cycle 0 (p19) st4 [r6] = r36, 4 // Cycle 0 br.ctop L1 // Cycle 0 What happens during runtime? Iteration1 r32 r33 r34 r35 … p16 p17 p18 p19 .. 1 0 0 0 .. Iteration2 r33 r34 r35 r36 … p17 p18 p19 .. p16 1 0 0 .. 1 Iteration3 r34 r35 r36 r37 … p18 p19 .. p16 p17 1 0 .. 1 1 ACA H.Corporaal

  19. IA-64 / Itanium architecture: a VLIW? • Yes, but: • Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel • HW does the Operation – FU binding • Pipeline latencies not visible in the ISA • These measures make the ISA independent of #FUs and pipeline latencies  ISA supports multiple implementations ACA H.Corporaal

  20. Montecito 2006: dual 11-issue cores ACA H.Corporaal

  21. Tukwila 4 core Itanium, 2009 ACA H.Corporaal

  22. How further? Burton Smith Microsoft 2005 ACA H.Corporaal

More Related