Superscalar Processors: Operation and Limitations

Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware implementation vs microprogramming

Chapter 14 • Superscalar Processors • Definition of Superscalar • Design Issues: • Instruction Issue Policy • Register renaming • Machine parallelism • Branch Prediction • Execution • Pentium 4 example

What is Superscalar? • “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. • Equally applicable to RISC & CISC, but more straightforward in RISC machines. • The order of execution is usually assisted by the compiler. A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.

Example of Superscalar Organization • 2 Integer ALU pipelines, • 2 FP ALU pipelines, • 1 memory pipeline (?)

Superscalar v Superpipelined

Limitations of Superscalar • Dependent upon: - Instruction level parallelism possible - Compiler based optimization - Hardware support • Limited by • Data dependency • Procedural dependency • Resource conflicts

(Recall) True Data Dependency (Must W before R) ADD r1, r2 r1+r2  r1 MOVE r3, r1 r1  r3 • Can fetch and decode second instruction in parallel with first LOAD r1, X x (memory)  r1 MOVE r3, r1 r1 r3 • Can NOT execute second instruction until first is finished Second instruction is dependent on first (R after W)

(recall) Antidependancy (Must R before W) ADD R4, R3, 1 R3 + 1  R4 ADD R3, R5, 1 R5 + 1  R3 • Cannot complete the second instruction before the first has read R3

(Recall) Procedural Dependency • Can’t execute instructions after a branch in parallel with instructions before a branch, because? Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed

(recall) Resource Conflict • Two or more instructions requiring access to the same resource at the same time • e.g. two arithmetic instructions need the ALU • Solution - Can possibly duplicate resources • e.g. have two arithmetic units

Effect of Dependencies on Superscalar Operation Notes: 1) Superscalar operation is double impacted by a stall. 2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation

Instruction-level Parallelism – degree of • Consider: LOAD R1, R2 ADD R3, 1 ADD R4, R2 These can be handled in parallel. • Consider: ADD R3, 1 ADD R4, R3 STO (R4), R0 These cannot be handled in parallel. The “degree” of instruction-level parallelism is determined by the number of instructions that can be executed in parallel without stalling for dependencies

Instruction Issue Policies • Order in which instructions are fetched • Order in which instructions are executed • Order in which instructions update registers and memory values (order of completion) Standard Categories: • In-order issue with in-order completion • In-order issue with out-of-order completion • Out-of order issue with out-of-order completion

In-Order Issue -- In-Order Completion Issue instructions in the order they occur: • Not very efficient • Instructions must stall if necessary (and stalling in superpipelining is expensive)

In-Order Issue -- In-Order Completion (Example) • Assume: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit

In-Order Issue -- Out-of-Order Completion(Example) • Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit How does this effect interrupts?

Out-of-Order Issue -- Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can continue to fetch and decode until the “window” is full • When a functional unit becomes available an instruction can be executed (usually in as much in-order as possible) • Since instructions have been decoded, processor can look ahead

Out-of-Order Issue -- Out-of-Order Completion (Example) • Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit Note: I5 depends upon I4, but I6 does not

Register Renaming to avoid hazards • Output and antidependencies occur because register contents may not reflect the correct ordering from the program • Can require a pipeline stall • One solution: Allocate Registers dynamically (renaming registers)

Register Renaming example Add R3, R3, R5 R3b:=R3a + R5a (I1) Add R4, R3, 1 R4b:=R3b + 1 (I2) Add R3, R5, 1 R3c:=R5a + 1 (I3) Add R7, R3, R4 R7b:=R3c + R4b (I4) • Without “subscript” refers to logical register in instruction • With subscript is hardware register allocated: R3a R3b R3c Note: R3c avoids: antidependency on I2 output dependency I1

Recaping: Machine Parallelism Support • Duplication of Resources • Out of order issue hardware • Windowing to decouple execution from decode • Register Renaming capability

Speedups of Machine Organizations (Without Procedural Dependencies) • Not worth duplication of functional units without register renaming • Need instruction window large enough (more than 8, probably not more than 32)

Branch Prediction in Superscalar Machines • Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot. This leads to much complexity in recovery. • Branch prediction should be used - Branch history is very useful

View of Superscalar Execution

Committing or Retiring Instructions Results need to be put into order (commit or retire) • Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage. (either committed or retired/flushed) • Temporary storage requires regular clean up – overhead – done in hardware.

Superscalar Hardware Support • Facilities to simultaneously fetch multiple instructions • Logic to determine true dependencies involving register values and Mechanisms to communicate these values • Mechanisms to initiate multiple instructions in parallel • Resources for parallel execution of multiple instructions • Mechanisms for committing process state in correct order

Example: Pentium 4A Superscalar CISC Machine

Pentium 4 alternate view

Pentium 4 pipeline 20 stages !

a) Generation of Micro-ops (stages 1 &2) • Using the Branch Target Buffer and Instruction Translation Lookaside Buffer, the x86 instructions are fetched 64 bytes at a time from the L2 cache • The instruction boundaries are determined and instructions decoded into 1-4 118-bit RISC micro-ops • Micro-ops are stored in the trace cache

b) Trace cache next instruction pointer (stage 3) • The Trace Cache Branch Target Buffer contains dynamic gathered history information (4 bit tag) • If target is not in BTB • Branch not PC relative: predict branch taken if it is a return, predict not taken otherwise • For PC relative backward conditional branches, predict take, otherwise not taken

c) Trace Cache fetch (stage 4) • Orders micro-ops in program-ordered sequences called traces • These are fetched in order, subject to branch prediction • Some micro-ops require many micro-ops (CISC instructions). These are coded into the ROM and fetched from the ROM

d) Drive (stage 5) • Delivers instructions from the Trace Cache to the Rename/Allocator module for reordering

e) Allocate: register naming (stages 6, 7, & 8) • Allocates resources for execution (3 micro-ops arrive per clock cycle): • - Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which • tracks progress of the micro-ops. • Buffer entries include: • - State – scheduled, dispatched, completed, ready for retire • - Address that generated the micro-op • - Operation • - Alias registers are assigned for one of 16 arch reg (128 alias registers) {to remove data • dependencies} • The micro-ops are dispatched out of order as resources are available • Allocates an entry to one of the 2 scheduler queues - memory access or not • The micro-ops are retired in order from the ROB

f) Micro-op queuing (stage 9) • Micro-ops are loaded into one of 2 queues: • - one for memory operations • - one for non memory operations • Each queue operates on a FIFO policy

g) Micro-op scheduling (stages 10, 11, & 12) h) Dispatch (stages 13 & 14) • The 2 schedulers retrieve micro-ops based upon having all the operands ready and dispatch them to an available unit (up to 6 per clock cycle) • If two micro-ops need the same unit, they are dispatched in sequence.

i) Register file (stages 15 & 16) j) Execute: flags (stages 17 & 18) • The register files are the sources for pending fixed and FF operations • A separate stage is used to compute the flags

k) Branch check (stage 19) l) Branch check results (stage 20) • Checks flags and compares results with predictions • If the branch prediction was wrong: • all incorrect micro-ops must be flushed (don’t want to be wrong!) • the correct branch destination is provided to the Branch Predictor • the pipeline is restarted from the new target address

Superscalar Processors: Operation and Limitations

Superscalar Processors: Operation and Limitations

Presentation Transcript

Minutes Left

Topics Left

Left Brain

UPPER LEFT

LEFT BEHIND

Your Left, Your Left, Your Left, Right, Left.

New/left over Syrian topics

LEFT BEHIND

left

Left Side

ROTATE LEFT

Left Rotations

LEFT RECURSION

?- go( state(left,left,left,left), state(right,right,right,right) ). A solution is:

LEFT

left

Left Thumb

Far Left

Left

LEFT CLAMP

Minutes Left:

Left Margin