Processor Architectures and Program Mapping

Processor Architectures and Program Mapping Exploiting ILP part 1: VLIW architectures TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

VLIW = Very Long Instruction Word architecture Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW: Topics Overview • Enhance performance: architecture methods • Instruction Level Parallelism • Limits on ILP • VLIW • Examples • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Enhance performance: 3 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Architecture methodsPipelined Execution of Instructions IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one • More efficient Hardware • Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

* Architecture methodsPipelined Execution of Instructions Superpipelining: • Split one or more of the critical pipeline stages Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methodsPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SIMD Execution Method time node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) SIMD computing • Nodes used for independent operations • Mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

* * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Examples • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methodsPowerful Instructions (2) MO-technique: multiple operations per instruction • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods: Powerful Instructions (2) VLIW Characteristics • Only RISC like operation support • Short cycle times • Flexible: Can implement any FU mixture • Extensible • Tight inter FU connectivity required • Large instructions (up to 1000 bits) • Not binary compatible • But good compilers exist Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methodsMultiple instruction issue (per cycle) Who guarantees semantic correctness? • can instructions be executed in parallel • User specifies multiple instruction streams • MIMD (Multiple Instruction Multiple Data) • Run-time detection of ready instructions • Superscalar • Compile into dataflow representation • Dataflow processors Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Multiple instruction issueThree Approaches Example code a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG (Data Dependence Graph) &d ld 3.14 &f &b ld ld * 15 &c + / st &a &e st st Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instr. Sequential Code Dataflow Code I1 ld r1,M(&b) ld(M(&b) -> I2 I2 addi r1,r1,15 addi 15 -> I3 I3 st r1,M(&a) st M(&a) I4 ld r1,M(&d) ld M(&d) -> I5 I5 muli r1,r1,3.14 muli 3.14 -> I6, I8 I6 st r1,M(&c) st M(&c) I7 ld r2,M(&f) ld M(&f) -> I8 I8 div r1,r1,r2 div -> I9 I9 st r1,M(&e) st M(&e) Generated Code Notes: • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 • No dependencies between streams; in practice communication and synchronization required between streams • A superscalar issues multiple instructions from sequential stream • Obey dependencies (True and name dependencies) • Reverse engineering of DDG needed at run-time • Dataflow code is direct representation of DDG Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

FU-1 FU-2 FU-K Multiple Instruction Issue:Data flow processor Token Matching Token Store Instruction Generate Instruction Store Result Tokens Reservation Stations Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

IF DC RF EX WB IF DC/RF EX WB IF2 IFk IF1 IF3 DC3 DC1 DC2 DCk ISSUE ISSUE ISSUE ISSUE RF3 RF1 RF2 RFk EX2 EX1 EX3 EXk ROB ROB ROB ROB WB2 WB1 WB3 WBk IF1 IF2 --- IFs DC RF EX1 EX2 --- EX5 WB IF DC RF1 EX1 WB1 RF1 EX1 WB1 RF2 EX2 WB2 RF2 EX2 WB2 RFk EXk WBk RFk EXk WBk Instruction Pipeline Overview CISC RISC Superscalar Superpipelined DATAFLOW VLIW Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Four dimensional representation of the architecture design space <I, O, D, S> SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture K I O D S Mpar CISC 1 0.2 1.2 1.1 1 0.26 RISC 1 1 1 1 1.2 1.2 VLIW 10 1 10 1 1.2 12 Superscalar 3 3 1 1 1.2 3.6 Superpipelined 1 1 1 1 3 3 Vector 7 0.1 1 64 5 32 SIMD 128 1 1 128 1.2 154 MIMD 32 32 1 1 1.2 38 Dataflow 10 10 1 1 1.2 12 Architecture design space Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures S(architecture) = f(Op) * lt (Op) Op I_set Mpar = I*O*D*S Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • limits on ILP • VLIW • Examples • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

FU-1 CPU FU-2 Instruction fetch unit Instruction decode unit Instruction memory FU-3 Bypassing network Data memory Register file FU-4 FU-5 General organization of an ILP architecture Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Motivation for ILP • Increasing VLSI densities; decreasing feature size • Increasing performance requirements • New application areas, like • multi-media (image, audio, video, 3-D) • intelligent search and filtering engines • neural, fuzzy, genetic computing • More functionality • Use of existing Code (Compatibility) • Low Power: P = fCVdd2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P = f/2 2C V’2 =fCV’2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Measuring and exploiting available ILP • How much ILP is there in applications? • How to measure parallelism within applications? • Using existing compiler • Using trace analysis • Track all the real data dependencies (RaWs) of instructions from issue window • register dependence • memory dependence • Check for correct branch prediction • if prediction correct continue • if wrong, flush schedule and start in next cycle Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Trace analysis Compiled code set r1,0 set r2,3 set r3,&A Loop: st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Program For i := 0..2 A[i] := i; S := X+3; How parallel can this code be executed? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction– Perfect => all program instructions available for execution 3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal Also: • unlimited number of instructions issued/cycle (unlimited resources), and • unlimited instruction window • perfect caches • 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 IPC Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Window Size and Branch Impact • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle FP: 15 - 45 Integer: 6 – 12 IPC PerfectTournamentBHT(512)ProfileNo prediction Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Impact of Limited Renaming Registers • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) FP: 11 - 45 Integer: 5 - 15 IPC Infinite2561286432 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Memory Address Alias Impact • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC PerfectGlobal/stack perfectInspectionNone Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Window Size Impact • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window FP: 8 - 45 IPC Integer: 6 - 12 Infinite2561286432 16 8 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

How to Exceed ILP Limits of This Study? • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory • Unnecessary dependences • compiler did not unroll loops so iteration variable dependence • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions • Amount of parallelism is limited • higher in Multi-Media • higher in kernels • Trace analysis detects all types of parallelism • task, data and operation types • Detected parallelism depends on • quality of compiler • hardware • source-code transformations Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • IA-64: Itanium, .... • TTA • Clustering • Code generation • Hands-on Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instruction Memory Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU Int Register File Floating Point Register File Data Memory VLIW concept A VLIW architecture with 7 FUs Instruction register Function units Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW characteristics • Multiple operations per instruction • One instruction per cycle issued (at most) • Compiler is in control • Only RISC like operation support • Short cycle times • Easier to compile for • Flexible: Can implement any FU mixture • Extensible / Scalable However: • tight inter FU connectivity required • not binary compatible !! • (new long instruction format) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VelociTIC6x datapath Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW example: TMS320C62 TMS320C62 VelociTI Processor • 8 operations (of 32-bit) per instruction (256 bit) • Two clusters • 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) • 2 x 16 registers • One bus available to write in register file of other cluster • Flexible addressing modes (like circular addressing) • Flexible instruction packing • All instruction conditional • 5 ns, 200 MHz, 0.25 um, 5-layer CMOS • 128 KB on-chip RAM Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SDRAM Memory Interface Timers PCI interface 32 bit, 33 MHZ 19 Mpix/s Video In Video Out 40 Mpix/s Digital audio Audio In Audio Out 208 chanel digital audio I2C Interface Serial Interface VLIW Processor 32K I$ Huffman decoder MPEG1,2 VLD coprocessor 16K D$ • 5-issue • 128 registers • 27 FUs • 32-bit • 8-Way set associative caches • dual-ported data cache • guarded operations VLIW example: Trimedia Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Intel Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) • IA-64 architecture -> Itanium, first realization Register model: • 128 64-bit int x bits, stack, rotating • 128 82-bit floating point, rotating • 64 1-bit boolean • 8 64-bit branch target address • system control registers Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

EPIC Architecture: IA-64 • Instructions grouped in 128-bit bundles • 3 * 41-bit instruction • 5 template bits, indicate type and stop location • Each 41-bit instruction • starts with 4-bit opcode, and • ends with 6-bit guard (boolean) register-id • Supports speculative loads Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Itanium Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Itanium 2: McKinley Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

EPIC Architecture: IA-64 • EPIC allows for more binary compatibility then a plain VLIW: • Function unit assignment performed at run-time • Lock when FU results not available • See other website for more info on IA-64: • www.ics.ele.tue.nl/~heco/courses/ACA • (look at related material) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW evaluation Strong points of VLIW: • Scalable (add more FUs) • Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: • With N FUs: • Bypassing complexity: O(N2) • Register file complexity: O(N) • Register file size: O(N2) • Register file design restricts FU flexibility Solution: .................................................. ? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

FU-1 CPU FU-2 Instruction fetch unit Instruction decode unit Instruction memory FU-3 Bypassing network Data memory Register file FU-4 FU-5 Control problem O(N2) O(N)-O(N2) With N function units VLIW evaluation Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Solution TTA: Transport Triggered Architecture Mirroring the Programming Paradigm + - + - > * > * st st Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Transport Triggered Architecture General organization of a TTA FU-1 CPU FU-2 FU-3 Instruction fetch unit Instruction decode unit Bypassing network FU-4 Instruction memory Data memory FU-5 Register file Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

load/store unit load/store unit integer ALU integer ALU float ALU integer RF float RF boolean RF instruct. unit immediate unit TTA structure; datapath details Data Memory Socket Instruction Memory Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

TTA hardware characteristics • Modular: building blocks easy to reuse • Very flexible and scalable • easy inclusion of Special Function Units (SFUs) • Very low complexity • > 50% reduction on # register ports • reduced bypass complexity (no associative matching) • up to 80 % reduction in bypass connectivity • trivial decoding • reduced register pressure • easy register file partitioning (a single port is enough!) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Processor Architectures and Program Mapping