ia 64 architecture think intel itanium l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IA-64 Architecture (Think Intel Itanium) PowerPoint Presentation
Download Presentation
IA-64 Architecture (Think Intel Itanium)

Loading in 2 Seconds...

play fullscreen
1 / 35

IA-64 Architecture (Think Intel Itanium) - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

IA-64 Architecture (Think Intel Itanium). also known as ( EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer. HW 5 - Due 12/4 Please clean up boards in lab by Dec 3 * Put good wires in the box * Take chips off of the board using chip puller

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IA-64 Architecture (Think Intel Itanium)' - barr


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ia 64 architecture think intel itanium

IA-64 Architecture(Think Intel Itanium)

also known as

(EPIC – Extremely Parallel Instruction Computing)

a new kind of superscalar computer

HW 5 - Due 12/4

Please clean up boards in lab by Dec 3

* Put good wires in the box

* Take chips off of the board using chip puller

* Put parts away in the proper bins.

* THANKS!

superpipelined superscaler machines
Superpipelined & Superscaler Machines

Superpipelined machine:

  • Superpiplined machines overlap pipe stages
    • Relies on stages being able to begin operations before the last is complete.

Superscaler Machine:

A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel.

    • Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently.
why a new architecture direction
Why A New Architecture Direction?

Processor designers obvious choices for use of increasing number of transistors on chip and extra speed:

  • Bigger Caches  diminishing returns
  • Increase degree of Superscaling by adding more execution units  complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies.
  • Multiple Processors  challenge to use them effectively in general computing
  • Longer pipelines  greater penalty for misprediction
ia 64 background
IA-64 : Background
  • Explicitly Parallel Instruction Computing (EPIC)

- Jointly developed by Intel & Hewlett-Packard (HP)

  • New 64 bit architecture
    • Not extension of x86 series
    • Not adaptation of HP 64bit RISC architecture
  • To exploit increasing chip transistors and increasing speeds
  • Utilizes systematic parallelism
  • Departure from superscalar trend

Note: Became the architecture of the Intel Itanium

basic concepts for ia 64
Basic Concepts for IA-64
  • Instruction level parallelism
    • EXPLICIT in machine instruction, rather than determined at run time by processor
  • Long or very long instruction words (LIW/VLIW)
    • Fetch bigger chunks already “preprocessed”
  • * Predicated Execution
    • Marking groups of instructions for a late decision on “execution”.
  • * Control Speculation
    • Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later
  • * Data Speculation (or Speculative Loading)
    • Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong
  • *Software Pipelining
    • - Multiple iterations of a loop can be executed in parallel
predicate registers
Predicate Registers
  • Used as a flag for instructions that may or may not be executed.
  • A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch).
  • Only instructions with a predicate value of true are executed.
  • When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed.
  • Those instructions with predicate false are now candidates for cleanup.
ia 64 key hardware features
IA-64 Key Hardware Features
  • Large number of registers
    • IA-64 instruction format assumes 256 Registers
      • 128 * 64 bit integer, logical & general purpose
      • 128 * 82 bit floating point and graphic
    • 64 predicated execution registers

(To support high degree of parallelism)

  • Multiple execution units
    • Probably pipelined
    • 8 or more ?
ia 64 execution units
IA-64 Execution Units
  • I-Unit
    • Integer arithmetic
    • Shift and add
    • Logical
    • Compare
    • Integer multimedia ops
  • M-Unit
    • Load and store
      • Between register and memory
    • Some integer ALU operations
  • B-Unit
    • Branch instructions
  • F-Unit
    • Floating point instructions
instruction format
Instruction Format

128 bit bundles

  • Can fetch one or more bundles at a time
  • Bundle holds three instructions plus template
  • Instructions are usually 41 bit long
    • Have associated predicated execution registers
  • Template contains info on which instructions can be executed in parallel
    • Not confined to single bundle
    • e.g. a stream of 8 instructions may be executed in parallel
    • Compiler will have re-ordered instructions to form contiguous bundles
    • Can mix dependent and independent instructions in same bundle
field encoding instr set mapping
Field Encoding & Instr Set Mapping

Note: BAR indicates stops: Possible dependencies with Instructions after the stop

assembly language format
Assembly Language Format

[qp] mnemonic [.comp] dest = srcs ;; //

  • qp - predicate register
      • 1 at execution  execute and commit result to hardware
      • 0  result is discarded
  • mnemonic - name of instruction
  • comp – one or more instruction completers used to qualify mnemonic
  • dest – one or more destination operands
  • srcs – one or more source operands
  • ;;-instruction groups stops (when appropriate)
      • Sequence without read after write or write after write
      • Do not need hardware register dependency checks
  • // - comment follows
assembly example
Assembly Example

ld8 r1 = [r5] ;; //first group

add r3 = r1, r4 //second group

  • Second instruction depends on value in r1
    • Changed by first instruction
    • Can not be in same group for parallel execution
  • Note ;; ends the group of instructions that can be executed in parallel

Register Dependency:

assembly example20
Assembly Example

ld8 r1 = [r5] //first group

sub r6 = r8, r9 ;; //first group

add r3 = r1, r4 //second group

st8 [r6] = r12 //second group

  • Last instruction stores in the memory location whose address is in r6, which is established in the second instruction

Multiple Register Dependencies:

assembly example predicated code
Assembly Example – Predicated Code

if (a&&b)

j = j + 1;

else

if(c)

k = k + 1;

else

k = k – 1;

i = i + 1;

Consider the Following program with branches:

assembly example predicated code22
Assembly Example – Predicated Code

Source Code

if (a&&b)

j = j + 1;

else

if(c)

k = k + 1;

else

k = k – 1;

i = i + 1;

Pentium Assembly Code

cmp a, 0 ; compare with 0

je L1 ; branch to L1 if a = 0

cmp b, 0

je L1

add j, 1 ; j = j + 1

jmp L3

L1: cmp c, 0

je L2

add k, 1 ; k = k + 1

jmp L3

L2: sub k, 1 ; k = k – 1

L3: add i, 1 ; i = i + 1

assembly example predicated code23
Assembly Example – Predicated Code

Source Code

if (a&&b)

j = j + 1;

else

if(c)

k = k + 1;

else

k = k – 1;

i = i + 1;

Pentium Code

cmp a, 0

je L1

cmp b, 0

je L1

add j, 1

jmp L3

L1: cmp c, 0

je L2

add k, 1

jmp L3

L2: sub k, 1

L3: add i, 1

IA-64 Code

cmp. eq p1, p2 = 0, a ;;

(p2) cmp. eq p1, p3 = 0, b

(p3) add j = 1, j

(p1) cmp. ne p4, p5 = 0, c

(p4) add k = 1, k

(p5) add k = -1, k

add i = 1, i

data speculation
Data Speculation
  • Load data from memory before needed
  • What might go wrong?
    • Load moved before store that might alter memory location
    • Need subsequent check in value
assembly example data speculation
Assembly Example – Data Speculation

(p1) br some_label // cycle 0

ld8 r1 = [r5] ;; // cycle 0 (indirect memory op – 2 cycles)

add r1 = r1, r3 // cycle 2

Consider the Following program:

assembly example data speculation27
Assembly Example – Data Speculation

(p1) br some_label //cycle 0

ld8 r1 = [r5] ;; //cycle 0

add r1 = r1, r3 //cycle 2

Consider the Following program:

Original code Speculated Code

ld8.s r1 = [r5] ;; //cycle -2

// other instructions

(p1) br some_label //cycle 0

chk.s r1, recovery //cycle 0

add r2 = r1, r3 //cycle 0

assembly example data speculation28
Assembly Example – Data Speculation

st8 [r4] = r12 //cycle 0

ld8 r6 = [r8] ;; //cycle 0 (indirect memory op – 2 cycles)

add r5 = r6, r7 ;; //cycle 2

st8 [r18] = r5 //cycle 3

Consider the Following program:

What if r4 and r8 point to the same address?

assembly example data speculation29
Assembly Example – Data Speculation

st8 [r4] = r12 //cycle 0

ld8 r6 = [r8] ;; //cycle 0

add r5 = r6, r7 ;; //cycle 2

st8 [r18] = r5 //cycle 3

Consider the Following program:

Without Data Speculation With Data Speculation

ld8.a r6 = [r8] ;; //cycle -2, adv

// other instructions

st8 [r4] = r12 //cycle 0

ld8.c r6 = [r8] //cycle 0, check

add r5 = r6, r7 ;; //cycle 0

st8 [r18] = r5 //cycle 1

assembly example data speculation30
Assembly Example – Data Speculation

ld8.a r6 = [r8];; //cycle -3,adv ld

// other instructions

add r5 = r6, r7 //cycle -1,uses r6

// other instructions

st8 [r4] = r12 //cycle 0

chk.a r6, recover //cycle 0, check

back: //return pt

st8 [r18] = r5 //cycle 0

recover:

ld8 r6 = [r8] ;; //get r6 from [r8]

add r5 = r6, r7;; //re-execute

be back //jump back

Data Dependencies:

Speculation Speculation with data dependency

ld8.a r6 = [r8] ;; //cycle-2

// other instructions

st8 [r4] = r12 //cycle 0

ld8.c r6 = [r8] //cycle 0

add r5 = r6, r7 ;; //cycle 0

st8 [r18] = r5 //cycle 1

software pipelining
Software Pipelining

// y[i] = x[i] + c

L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4

add r7=r4,r9 ;;//cycle 2

st4 [r6]=r7,4 //cycle 3 store postinc 4

br.cloop L1 ;;//cycle 3

  • Adds constant to one vector and stores result in another
  • No opportunity for instruction level parallelism in one iteration
  • Instruction in iteration x all executed before iteration x+1 begins
  • If no address conflicts between loads and stores can move independent instructions from loop x+1 to loop x
pipeline unrolled loop pipeline display
Pipeline - Unrolled Loop, Pipeline Display

Unrolled loop

ld4 r32=[r5],4;; //cycle 0

ld4 r33=[r5],4;; //cycle 1

ld4 r34=[r5],4 //cycle 2

add r36=r32,r9;; //cycle 2

ld4 r35=[r5],4 //cycle 3

add r37=r33,r9 //cycle 3

st4 [r6]=r36,4;; //cycle 3

ld4 r36=[r5],4 //cycle 3

add r38=r34,r9 //cycle 4

st4 [r6]=r37,4;; //cycle 4

add r39=r35,r9 //cycle 5

st4 [r6]=r38,4;; //cycle 5

add r40=r36,r9 //cycle 6

st4 [r6]=r39,4;; //cycle 6

st4 [r6]=r40,4;; //cycle 7

Original Loop

L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4

add r7=r4,r9 ;;//cycle 2

st4 [r6]=r7, 4 //cycle 3 store postinc 4

br.cloop L1 ;;//cycle 3

Pipeline Display

unrolled loop observations
Unrolled Loop Observations
  • Completes 5 iterations in 7 cycles
    • Compared with 20 cycles in original code
  • Assumes two memory ports
    • Load and store can be done in parallel
support for software pipelining
Support For Software Pipelining
  • Automatic register renaming
    • Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotation
    • Loop using r32 on first iteration automatically uses r33 on second
  • Predication
    • Each instruction in loop predicated on rotating predicate register
      • Determines whether pipeline is in prolog, kernel, or epilog
  • Special loop termination instructions
    • Branch instructions that cause registers to rotate and loop counter to decrement