Maximizing Processor Efficiency with Superscalar and VLIW Architectures

CENG 450Computer Systems and ArchitectureLecture 12 Amirali Baniasadi amirali@ece.uvic.ca

Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors

1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB

Superscalar Vs. VLIW • Religious debate, similar to RISC vs. CISC • Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) • Q. Who can schedule code better, hardware or software?

Hardware Scheduling • High branch prediction accuracy • Dynamic information on latencies (cache misses) • Dynamic information on memory dependences • Easy to speculate (& recover from mis-speculation) • Works for generic, non-loop, irregular code • Ex: databases, desktop applications, compilers • Limited reorder buffer size limits “lookahead” • High cost/complexity • Slow clock

Software Scheduling • Large scheduling scope (full program), large “lookahead” • Can handle very long latencies • Simple hardware with fast clock • Only works well for “regular” codes (scientific, FORTRAN) • Low branch prediction accuracy • Can improve by profiling • No information on latencies like cache misses • Can improve by profiling • Pain to speculate and recover from mis-speculation • Can improve with hardware support

Superscalar Processors • Pioneer: IBM (America => RIOS, RS/6000, Power-1) • Superscalar instruction combinations • 1 ALU or memory or branch + 1 FP (RS/6000) • Any 1 + 1 ALU (Pentium) • Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) • Impact of superscalar • More opportunity for hazards (why?) • More performance loss due to hazards (why?)

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution

SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation

The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit

Superscalar Microarchitecture Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit

Register renaming methods • First Method: • Physical register file vs. logical (architectural) register file. • Mapping table used to associate physical reg w/ current value of log. Reg • use a free list of physical registers • Physical register file bigger than log register file • Second Method: • physical register file same size as logical • Also, use a buffer w/ one entry per inst. Reorder buffer.

Register renaming: first method r0 r1 r2 r3 r4 R8 r0 r1 r2 r3 r4 R8 R7 R7 R5 R5 R1 R2 R9 R9 R2 R6 R13 R6 R13 Mapping table Mapping table Add r3,r3,4 Free List Free List

More Realistic HW: Register Impact FP: 11 - 45 • Effect of limiting the number of renaming registers Integer: 5 - 15 IPC

Reorder Buffer • Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed

register renaming:reorder buffer R8 R7 R5 rob6 R9 Before add r3,r3,4 Add r3, rob6, 4 add rob8,rob6,4 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 R8 R7 R5 rob8 R9 8 7 6 0 7 6 0 R3 0 R3 …. r3 …..….. Reorder buffer Reorder buffer

Instruction Buffers Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit

Issue Buffer Organization • a) Single, shared queue b)Multiple queue; one per inst. type No out-of-order No Renaming No out-of-order inside queues Queues issue out of order

Issue Buffer Organization • c) Multiple reservation stations; (one per instruction type or big pool) • NO FIFO ordering • Ready operands, hardware available execution starts • Proposed by Tomasulo From Instruction Dispatch

Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination

Memory Hazard Detection Logic Load address buffer loads Instruction issue To memory Address add & translation Address compare Hazard Control stores Store address buffer

Example • MIPS R10000, Alpha 21264, AMD k5 : self study • READ THE PAPER.

VLIW • VLIW: Very long instruction word • In-order pipe, but each “instruction” is N instructions (VLIW) • Typically “slotted” (I.e., 1st must be ALU, 2nd must be load,etc., ) • VLIW travels down pipe as a unit • Compiler packs independent instructions into VLIW IF ID ALU WB ALU WB Ad MEM WB FP FP WB

Very Long Instruction Word • VLIW - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions • Fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates • Started with microcode (“horizontal microcode”) • Joint HP/Intel agreement in 1999/2000 • Intel Architecture-64 (IA-64) 64-bit address /Itanium • Explicitly Parallel Instruction Computer (EPIC) • Transmeta: translates X86 to VLIW • Many embedded controllers (TI, Motorola) are VLIW

Pure VLIW: What Does VLIW Mean? • All latencies fixed • All instructions in VLIW issue at once • No hardware interlocks at all • Compiler responsible for scheduling entire pipeline • Includes stall cycles • Possible if you know structure of pipeline and latencies exactly

Problems with Pure VLIW • Latencies are not fixed (e.g., caches) • Option I: don’t use caches (forget it) • Option II: stall whole pipeline on a miss? • Option III: stall instructions waiting for memory? (need out-of-order) • Different implementations • Different pipe depths, different latencies • New pipeline may produce wrong results (code stalls in wrong place) • Recompile for new implementations? • Code compatibility is very important, made Intel what it is

Key: Static Scheduling • VLIW relies on the fact that software can schedule code well • Loop unrolling (we have seen this one already) • Code growth • Poor scheduling along unrolled copies

Limits to Multi-Issue Machines • Inherent limitations of ILP • 1 branch in 5 instructions => how to keep a 5-way VLIW busy? • Latencies of units => many operations must be scheduled • Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy. • Difficulties in building HW • Duplicate Functional Units to get parallel execution • Increase ports to Register File • Increase ports to memory • Decoding Superscalar and impact on clock rate, pipeline depth: • Complexity-effective designs

Limits to Multi-Issue Machines • Limitations specific to either Superscalar or VLIW implementation • Decode issue in Superscalar • VLIW code size: unroll loops + wasted fields in VLIW • VLIW lock step => 1 hazard & all instructions stall • VLIW & binary compatibility

Multiple Issue Challenges • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issue at same time, greater difficulty of decode and issue • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches

HW Support for More ILP • How is this used in practice? • Rather than predicting the direction of a branch, execute the instructions on both side!! • We early on know the target of a branch, long before we know it if will be taken or not. • So begin fetching/executing at that new Target PC. • But also continue fetching/executing as if the branch NOT taken.

Studies of ILP • Conflicting studies of amount of improvement available • Benchmarks (vectorized FP Fortran vs. integer C programs) • Hardware sophistication • Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

Summary • Static ILP • Simple, advanced loads, predication hardware (fast clock), complex compilers • VLIW

Summary • Dynamic ILP • Instruction buffer • Split ID into two stages one for in-order and other for out-of-order issue • Socreboard • out-of-order, doesn’t deal with WAR/WAW hazards • Tomasulo’s algorithm • Uses register renaming to eliminate WAR/WAW hazards • Dynamic scheduling + speculation • Superscalar

Maximizing Processor Efficiency with Superscalar and VLIW Architectures

Maximizing Processor Efficiency with Superscalar and VLIW Architectures

Presentation Transcript

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 11

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 7

CENG 450 Computer Systems and Architecture Lecture 10

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 12

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 16

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 7