1 / 35

IA-64 Architecture (Think Intel Itanium)

IA-64 Architecture (Think Intel Itanium). also known as ( EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer. HW 5 - Due 12/4 Please clean up boards in lab by Dec 3 * Put good wires in the box * Take chips off of the board using chip puller

barr
Download Presentation

IA-64 Architecture (Think Intel Itanium)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IA-64 Architecture(Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due 12/4 Please clean up boards in lab by Dec 3 * Put good wires in the box * Take chips off of the board using chip puller * Put parts away in the proper bins. * THANKS!

  2. Superpipelined & Superscaler Machines Superpipelined machine: • Superpiplined machines overlap pipe stages • Relies on stages being able to begin operations before the last is complete. Superscaler Machine: A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel. • Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently.

  3. Why A New Architecture Direction? Processor designers obvious choices for use of increasing number of transistors on chip and extra speed: • Bigger Caches  diminishing returns • Increase degree of Superscaling by adding more execution units  complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies. • Multiple Processors  challenge to use them effectively in general computing • Longer pipelines  greater penalty for misprediction

  4. IA-64 : Background • Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP) • New 64 bit architecture • Not extension of x86 series • Not adaptation of HP 64bit RISC architecture • To exploit increasing chip transistors and increasing speeds • Utilizes systematic parallelism • Departure from superscalar trend Note: Became the architecture of the Intel Itanium

  5. Basic Concepts for IA-64 • Instruction level parallelism • EXPLICIT in machine instruction, rather than determined at run time by processor • Long or very long instruction words (LIW/VLIW) • Fetch bigger chunks already “preprocessed” • * Predicated Execution • Marking groups of instructions for a late decision on “execution”. • * Control Speculation • Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later • * Data Speculation (or Speculative Loading) • Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong • *Software Pipelining • - Multiple iterations of a loop can be executed in parallel

  6. General Organization

  7. Predicate Registers • Used as a flag for instructions that may or may not be executed. • A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch). • Only instructions with a predicate value of true are executed. • When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed. • Those instructions with predicate false are now candidates for cleanup.

  8. Predication

  9. Speculative Loading

  10. General Organization

  11. IA-64 Key Hardware Features • Large number of registers • IA-64 instruction format assumes 256 Registers • 128 * 64 bit integer, logical & general purpose • 128 * 82 bit floating point and graphic • 64 predicated execution registers (To support high degree of parallelism) • Multiple execution units • Probably pipelined • 8 or more ?

  12. IA-64 Register Set

  13. Relationship between Instruction Type & Execution Unit

  14. IA-64 Execution Units • I-Unit • Integer arithmetic • Shift and add • Logical • Compare • Integer multimedia ops • M-Unit • Load and store • Between register and memory • Some integer ALU operations • B-Unit • Branch instructions • F-Unit • Floating point instructions

  15. Instruction Format Diagram

  16. Instruction Format 128 bit bundles • Can fetch one or more bundles at a time • Bundle holds three instructions plus template • Instructions are usually 41 bit long • Have associated predicated execution registers • Template contains info on which instructions can be executed in parallel • Not confined to single bundle • e.g. a stream of 8 instructions may be executed in parallel • Compiler will have re-ordered instructions to form contiguous bundles • Can mix dependent and independent instructions in same bundle

  17. Field Encoding & Instr Set Mapping Note: BAR indicates stops: Possible dependencies with Instructions after the stop

  18. Assembly Language Format [qp] mnemonic [.comp] dest = srcs ;; // • qp - predicate register • 1 at execution  execute and commit result to hardware • 0  result is discarded • mnemonic - name of instruction • comp – one or more instruction completers used to qualify mnemonic • dest – one or more destination operands • srcs – one or more source operands • ;;-instruction groups stops (when appropriate) • Sequence without read after write or write after write • Do not need hardware register dependency checks • // - comment follows

  19. Assembly Example ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group • Second instruction depends on value in r1 • Changed by first instruction • Can not be in same group for parallel execution • Note ;; ends the group of instructions that can be executed in parallel Register Dependency:

  20. Assembly Example ld8 r1 = [r5] //first group sub r6 = r8, r9 ;; //first group add r3 = r1, r4 //second group st8 [r6] = r12 //second group • Last instruction stores in the memory location whose address is in r6, which is established in the second instruction Multiple Register Dependencies:

  21. Assembly Example – Predicated Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Consider the Following program with branches:

  22. Assembly Example – Predicated Code Source Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Pentium Assembly Code cmp a, 0 ; compare with 0 je L1 ; branch to L1 if a = 0 cmp b, 0 je L1 add j, 1 ; j = j + 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 ; k = k + 1 jmp L3 L2: sub k, 1 ; k = k – 1 L3: add i, 1 ; i = i + 1

  23. Assembly Example – Predicated Code Source Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Pentium Code cmp a, 0 je L1 cmp b, 0 je L1 add j, 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 jmp L3 L2: sub k, 1 L3: add i, 1 IA-64 Code cmp. eq p1, p2 = 0, a ;; (p2) cmp. eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp. ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i

  24. Example of Prediction

  25. Data Speculation • Load data from memory before needed • What might go wrong? • Load moved before store that might alter memory location • Need subsequent check in value

  26. Assembly Example – Data Speculation (p1) br some_label // cycle 0 ld8 r1 = [r5] ;; // cycle 0 (indirect memory op – 2 cycles) add r1 = r1, r3 // cycle 2 Consider the Following program:

  27. Assembly Example – Data Speculation (p1) br some_label //cycle 0 ld8 r1 = [r5] ;; //cycle 0 add r1 = r1, r3 //cycle 2 Consider the Following program: Original code Speculated Code ld8.s r1 = [r5] ;; //cycle -2 // other instructions (p1) br some_label //cycle 0 chk.s r1, recovery //cycle 0 add r2 = r1, r3 //cycle 0

  28. Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 (indirect memory op – 2 cycles) add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the Following program: What if r4 and r8 point to the same address?

  29. Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the Following program: Without Data Speculation With Data Speculation ld8.a r6 = [r8] ;; //cycle -2, adv // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0, check add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1

  30. Assembly Example – Data Speculation ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 // other instructions st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, check back: //return pt st8 [r18] = r5 //cycle 0 recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute be back //jump back Data Dependencies: Speculation Speculation with data dependency ld8.a r6 = [r8] ;; //cycle-2 // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0 add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1

  31. Software Pipelining // y[i] = x[i] + c L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 • Adds constant to one vector and stores result in another • No opportunity for instruction level parallelism in one iteration • Instruction in iteration x all executed before iteration x+1 begins • If no address conflicts between loads and stores can move independent instructions from loop x+1 to loop x

  32. Pipeline - Unrolled Loop, Pipeline Display Unrolled loop ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4 //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4 //cycle 3 add r37=r33,r9 //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4 //cycle 3 add r38=r34,r9 //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9 //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9 //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7 Original Loop L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7, 4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 Pipeline Display

  33. Unrolled Loop Observations • Completes 5 iterations in 7 cycles • Compared with 20 cycles in original code • Assumes two memory ports • Load and store can be done in parallel

  34. Support For Software Pipelining • Automatic register renaming • Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotation • Loop using r32 on first iteration automatically uses r33 on second • Predication • Each instruction in loop predicated on rotating predicate register • Determines whether pipeline is in prolog, kernel, or epilog • Special loop termination instructions • Branch instructions that cause registers to rotate and loop counter to decrement

  35. Intel’s Itanium Implements the IA-64

More Related