The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

The IA-64 architecture and Itanium processorsExplicitly Parallel Instruction Computing Frans Dondorp Presentation et 4 074, January 8th 2001

Contents Introduction to the IA-64 architecture and EPIC The Itanium processor Branch removal Predication Speculative execution Control speculation Data speculation Comparison: ARM conditional instructions

Introduction to the IA-64 architecture • Joint research by Intel and Hewlett-Packard (1994) • exploitation of the ILP concept • tight coupling of hard- and software EPIC is introduced as basic concept:Explicitly Parallel Instruction Computing This results in a more complex task for the compiler and Hardware support for communication of meta-information  speculation, predication and branch hints “The future of computing”– Intel web site

The Itanium processor The Itanium, formerly code-named Merced, is the first processor based on the IA-64 architecture Still a prototype, compilers announced (as of nov. 2000) 10-stage pipeline, running at 800Mhz To support EPIC, it is equipped with: 4 ALU’s, 4 MMX units, 4 FPU’s (2 SP, 2 DP), 2 L/S units, 3 br units MS Win2K and Linux announced (as of oct. 2000)

Support for register stackand software pipelining Function calllinkage and return(64b address space!) 128 GR’s 128 FR’s 128 AR’s r0 f0 ar0 r1 f1 ar1 Static .... .... .... .... .... .... r31 f31 .... r32 f32 .... .... .... .... Stacked /Rotating .... Rotating .... .... .... .... .... r126 f126 f126 r127 f127 f127 64 + 1 b 82 b 64 b Holds result of a conditional expression evaluationPredication Deferred exception(Not A Thing, NaT)Control speculation IA 64 resources and instructions • Register resources 8 BR’s b0 AR application register BR Branch register FR Floating point register GR General register PR Predicate register b1 .... b7 64 b 64 PR’s p0 ... ... p15 p16 ... ... ... p62 p63 1 b

Templates are used to group instructions to exploit parallel execution by keeping execution units buzy. Instruction 2 Instruction 1 Instruction 0 Template 41 b 41 b 41 b 5 b Predicates are used to allow for conditional execution.6 bits used to address 64 predicate registers The Itanium processor issues 8 ops/clock: M I I M B B ALU ALU ALU ALU L/S L/S FP S FP S FP D FP D BR BR BR MMX MMX MMX MMX IA64 resources and instructions • Instruction encoding {.mii ld8 r1 = 4[r2] add r3 = r1, r3 shr r7 = r4, r12 } {.mbb ld8 r6 = 8[r5] (p3) br.cond Label1 (p4) br.cond Label2 } IA-64 “Bundle” Instruction format Op Reg 1 Reg 2 Reg 3 Predicate 14 b 7 b 7 b 7 b 6 b

Branch removal • Branch-prediction is costly • Cost of misprediction is proportional to pipeline length Optimizing the use of prediction resources can significantly improve the overall performance Conditional instructions can eliminate the need for branches Executes only if eq-bit is set in status register; else NOP cmp r1, r2 beq equal mov r1, #0 bal end .equal mov r2, #0 .end cmp r1, r2 moveq r1, #0 movne r2, #0

Branch removal – Conditional instructions Conditional instructions can reduce a branch-penalty due to a misprediction from N pipeline-stages to 1 • Implementing conditional instructions in instruction space directly increases instruction-size while the amount of conditions to test on is limited (typically to a few bits in the processor status register) ARM Conditional Instructions • Unbalanced execution paths: conditional code might decrease performance in favor of a branch misprediction

Branch removal – Conditional instructions Example: conditional code performance(one instruction executed each cycle) cmp r1, r2 moveq r1, #0 addeq r2, r2, #10 ldbeq r3, (r5)+ inceq r3 stbeq r3, (r5)+ inceq r1 mov r2, #0 cmp r1, r2 bne end mov r1, #0 add r2, r2, #10 ldb r3, (r5)+ inc r3 stb r3, (r5)+ inc r1 .end mov r2, #0 Pipeline flushed: branch-penalty LOSS: #pipeline 6 NOP’s LOSS: 6 r1  r2 vs r1  r2 mispredict On a machine with a 5-stage pipeline, conditional instructions would lead to performance loss The compiler should decide!

Predication Predication: tagging instructions with a boolean value cmp.ne p1, p0 = r4, 0;; (p1) add r1 = r2, r3 (p1) ld8 r6 = [r5] The limitations of conditional instructions are decreased by predication: with predication the amount of conditions to test on equals the number of predicate registers if r4  0 then r1 = (r2 + r3) SET BOOLEAN VALUES Compare r4 to #0; not equal p1 is TRUE if r4 0 p2 = NOT(p1) if r4  0 then r6 = MEM(r5)

Predication – moving instructions Advantages of predication The compiler has more freedom when scheduling if predicates are guaranteed not to conflict. Code motion past branches and Ld/Str ops results in speculative execution Code Motion Upward Downward

Speculative execution The compiler selects commonly executed blocks Instruction selection, prioritization and reordening To enable agressive code-motion done by the compiler, explicitly speculative instructions must be available

instrA instrB ... br ld8 r1 = [r2] use r1 Speculative execution – Control speculation IA-64 provides speculative load instructions The load instruction is replaced by a speculative load ld8.s r1 = [r2] use r1 instrA instrB ... br chk.s NaT may be written in r1 speculation check Exception Handling:If a speculative load raises an exception, a deferred exeception token (NaT) is written to the target register. This NaT is propagated by almost all instructions.chk.s checks for NaT and if present, jumps to fix-up code (compiler generated). This fix-up code may excute the load non-speculatively and return to main code afterwards.

instrA ... ... store ld8 r1 = [r2] use r1 reg# addr size reg# addr size ... ... ... ... ... ... reg# addr size Speculative execution – Data speculation IA-64 provides advanced load instructions The load instruction is replaced by an advanced load ld8.a r1 = [r2] use r1 instrA ... store chk.a reg#, addr and size are stored in theadvanced load address table (ALAT) advanced load check WaR Handling:When the store is executed, all ALAT-entries will be compared with the store address. Overlapping entries are removed.chk.a checks for the address of it’s corresponding advanced load in the ALAT. If the address is still there, chk.a does nothing. If it’s gone, chk.a jumps to fix-up code.

Speculative execution – fix-up The fix-up code generated by the compiler is general In case of control speculation: Not only the load is speculative, but also all instructions using the destination register. In case of data speculation: Not only the load is speculative, but also all computations before the (possibly conflicting) store. Although the compiler must include fix-up code to handle exceptions and WaR-conflicts, this relatively simple mechanism allows for aggressive code-motion

0000 EQ Z 0001 NE ~Z 0010 CS C 0011 CC ~C 0100 MI N 0101 PL ~N 0110 VS V 0111 VC ~V 1000 HI C and ~Z 1001 LS ~C or Z 1010 GE N = V 1011 LT N = ~V 1100 GT (N = V) and ~Z 1101 LE (N = ~V) or Z 1110 AL True 1111 NV False (=NOP) S ADD EQ Rn, Rm, ASL Rc Rd, Comparison: ARM conditional instructions Conditional instructions to allow for branch-removal as implemented in the ARM processor (+/- 1985) Cond 000 OPC S SRC1 DEST SH# SH SRC2 Instruction format code Rd = Sign(Rn+(Rm << Rc)) Single cycle execution Straightforward orthogonal instruction coding: all instructions can be coded conditionally on all conditions Only 4 condition bits: Z, C, N, V in processor status register: set by CMN, CMP, TEQ, TST Flexibility: branch removal, but no code motion!(conditional instructions after CMP)

In conclusion EPIC: The future of computing? As processors grow in complexity, shifting responsibilities to the compiler seems obvious Keeping up with Moore’s law: calls for conceptual innovations, not only technological

References • [1] “Introducing the IA-64 architecture” • J. Huck, D. Morris, J. Ross (HP), A. Knies, H. Mulder, R. Zahir (Intel) • IEEE/Micro, sep-oct 2000, p. 12-23 • [2] “Itanium processor microarchitecture” H. Sharangpani, K. Arora (Intel) • IEEE/Micro, sep-oct 2000, p. 24-43 • [3] “IA-64 Application developer’s architecture guide, Rev. 1.0” • Intel Documentation, may 1999 • Chap. 11: “Predication, Control Flow and Instruction Stream” • http://developer.intel.com/software/idap/media/pdf/ADAG.pdf • [4] “Itanium processor microarchitecture reference” • Intel Documentation, aug. 2000 • http://developer.intel.com/design/ia-64/downloads/245474.htm • [5] “ARM Instruction formats and timings” • R. Watts, nov. 1995 • http://www.pinknoise.demon.co.uk/ARMinstrs/index.html Websites: • www.intel.com/pressroom • developer.intel.com/design/ia-64

It is now safe toask your questions

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Presentation Transcript

Instruction-Level Parallel Processors

IA-64 Architecture (Think Intel Itanium)

IA-64 Microarchitecture --- Itanium Processor

Chapter 21 IA-64 Architecture (Think Intel Itanium)

Chapter 15 IA-64 Architecture

Computer Architecture Parallel Processors

Computer Architecture Instruction-Level Parallel Processors

Itanium Architecture

EPIC Architecture (Explicitly Parallel Instruction Computing)

Chapter 15 IA 64 Architecture Review

History of 64-bit Computing: AMD64 and Intel Itanium Processors

IA-64 Architecture Innovations

The IA-64 architecture and Itanium processors Explicitly Parallel Instruction Computing

Parallel Computing on Graphics Processors

Pertemuan 22 IA-64 Architecture

IA-64

IA-64 Application Architecture Tutorial

Chapter 15 IA-64 Architecture

IA-64 Architecture Innovations