Computer Architecture

Computer Architecture Chapter 3 Instruction-Level Parallelism 2 Prof. Jerry Breecher CSCI 240 Fall 2003

Chapter Overview 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Ideas To Reduce Stalls Chapter 3 Chapter 4 Chap. 3 -ILP 2

Dynamic Hardware Prediction Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions. Key Concept: A Branch History Table contains information about what a branch did the last time it was executed. 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History – The Lower bits of PC do an address of the index table of 1-bit values • Says whether or not branch taken last time • Problem: in a loop, 1-bit BHT will cause two mis-predictions: • End of loop case, when it exits instead of looping as before • First time through loop on next time through code, when it predicts exit instead of looping Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Bits 2 – 13 define 1024 different possibilities. Based on the address of the branch, it’s prediction is put into the Branch History table. • How Does It Work? Bits 13 - 2 Prediction Address 0 31 0 1023 Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Dynamic Branch Prediction • Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 198) T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Branch History Table Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch when index the table • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% • 4096 about as good as infinite table, but 4096 is a lot of HW Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Correlating Branches What if we have the code: If ( aa == 2) aa = 0; If ( bb == 2 ) bb = 0; If ( aa != bb ) { … Then the third “if” can be somewhat predicted based on the 1st two “ifs” Generated MIPS Code: DSUBUI R3, R1, #2 BNEZ R3, L1 DADD R1, R0, R0 L1: DSUBUI R3, R2, #2 BNEZ R3, L2 DADD R2, R0, R0 L2: DSUBU R3, R1, R2 BEQZ R3, L3 This branch is based on the Outcome of the previous 2 branches. Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Correlating Branches • Branch address (4 bits) Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) • Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction • (2,2) predictor: 2-bit global, 2-bit local 2-bits per branch local predictors Prediction 2-bit global branch history (01 = not taken then taken) Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Accuracy of Different Schemes 4096 Entries 2-bits per entry Unlimited Entries 2-bits per entry 1024 Entries - 2 bits of history, 2 bits per entry (Figure 3.15, p. 206) 18% Frequency of Mispredictions 0% Chap. 3 -ILP 2

High Performance Instruction Delivery The goal here is to be able to fetch an instruction from the destination of a branch. You need the next address at the same time as you’ve made the prediction. This can be tough if you don’t know where that instruction is until the branch has been figured out. The solution is a table that remembers the resulting destination addresses of previous branches. 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Dynamic Hardware Prediction Branch PC Predicted PC Basic Branch Prediction: Branch Target Buffers Branch Target Buffer • Branch Target Buffer (BTB): Use address of branch as index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 210) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? Extra prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Chap. 3 -ILP 2

Dynamic Hardware Prediction Basic Branch Prediction: Branch Target Buffers Example Case Instruction Prediction Actual Penalty in Buffer Branch Cycles 1. Yes Taken Taken 0 2. Yes Taken Not taken 2 3. No Taken 2 4. No Not Taken 0 Example on page 211. Determine the total branch penalty for a BTB using the above penalties. Assume also the following: • Prediction accuracy of 90% • Hit rate in the buffer of 90% • Assume that 60% of the branches are taken. Case 2 Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2 + ( 1 - percent buffer hit rate) X Taken branches X 2 Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2) Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles Case 3 Chap. 3 -ILP 2

Multiple Issue Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. Our Dream is CPI < 1 !! 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Multiple Issue Issuing Multiple Instructions/Cycle Flavor I: Vector Processing:Explicit coding of independent loops as operations on large vectors of numbers • Multimedia instructions being added to many processors Flavor II: Superscalar processors issue varying number of instructions per clock - can be either statically scheduled (by the compiler) or dynamically scheduled (by the hardware). Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 Chap. 3 -ILP 2

Multiple Issue Issuing Multiple Instructions/Cycle Flavor III: VLIW - Very Long Instruction Word - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions. fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates • Joint HP/Intel agreement in 1999/2000 • Intel Architecture-64 (IA-64) 64-bit address • Style: “Explicitly Parallel Instruction Computer (EPIC)” Chap. 3 -ILP 2

Multiple Issue Issuing Multiple Instructions/Cycle Flavor III - continued: • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent • Smaller code size than old VLIW, larger than x86/RISC • Groups can be linked to show independence > 3 instr • 64 integer registers + 64 floating point registers • Not separate files per functional unit as in old VLIW • Hardware checks dependencies (interlocks => binary compatibility over time) • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mis-predictions? • IA-64 : name of instruction set architecture; EPIC is type • Itanium is name of current implementation Chap. 3 -ILP 2

Multiple Issue A SuperScalar Version of MIPS Issuing Multiple Instructions/Cycle • In our MIPS example, we can handle 2 instructions/cycle: • Floating Point • Anything Else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB • 1 cycle load delay causes delay to 3 instructions in Superscalar • instruction in right half can’t use it, nor instructions in next slot Chap. 3 -ILP 2

Multiple Issue A SuperScalar Version of MIPS Unrolled Loop Minimizes Stalls for Scalar 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration Latencies: LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles Chap. 3 -ILP 2

Multiple Issue A SuperScalar Version of MIPS Loop Unrolling in Superscalar Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD -16(R1),F12 8 SD -24(R1),F16 9 SUBI R1,R1,#40 10 BNEZ R1,LOOP 11 SD 8(R1),F20 12 • Unrolled 5 times to avoid delays (+1 due to SS) • 12 clocks, or 2.4 clocks per iteration Chap. 3 -ILP 2

Multiple Issue Multiple Instruction Issue & Dynamic Scheduling Performance of Dynamic Superscalar Iteration Instructions Issues Executes Writes result no. clock-cycle number 1 LD F0,0(R1) 1 2 4 1 ADDD F4,F0,F2 1 5 8 1 SD 0(R1),F4 2 9 1 SUBI R1,R1,#8 3 4 5 1 BNEZ R1,LOOP 4 5 2 LD F0,0(R1) 5 6 8 2 ADDD F4,F0,F2 5 9 12 2 SD 0(R1),F4 6 13 2 SUBI R1,R1,#8 7 8 9 2 BNEZ R1,LOOP 8 9 4 clocks per iteration Branches, Decrements still take 1 clock cycle Chap. 3 -ILP 2

Multiple Issue VLIW Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 • Unrolled 7 times to avoid delays • 7 results in 9 clocks, or 1.3 clocks per iteration • Need more registers to effectively use VLIW Chap. 3 -ILP 2

Multiple Issue Limitations With Multiple Issue Limits to Multi-Issue Machines • Inherent limitations of ILP • 1 branch in 5 instructions => how to keep a 5-way VLIW busy? • Latencies of units => many operations must be scheduled • Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy. • Difficulties in building HW • Duplicate Functional Units to get parallel execution • Increase ports to Register File (VLIW example needs 6 read and 3 write for Int. Reg. & 6 read and 4 write for Reg.) • Increase ports to memory • Decoding SS and impact on clock rate, pipeline depth Chap. 3 -ILP 2

Multiple Issue Limitations With Multiple Issue Limits to Multi-Issue Machines • Limitations specific to either SS or VLIW implementation • Decode issue in SS • VLIW code size: unroll loops + wasted fields in VLIW • VLIW lock step => 1 hazard & all instructions stall • VLIW & binary compatibility Chap. 3 -ILP 2

Multiple Issue Limitations With Multiple Issue Multiple Issue Challenges • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issue at same time, greater difficulty of decode and issue • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches Chap. 3 -ILP 2

Speculation • Every cycle, execute an instruction – even if it may be the WRONG!! Instruction. • Prediction is not sufficient to have high amount of ILP • Overcome control dependence by speculating on the outcome of branches • Execute the program as if our guesses were correct • Dynamic scheduling: Fetch, Issue (No execute) • Speculation:Fetch, Issue, Execute • Incorrect speculation  Undo 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Speculation Key Ideas • Dynamic branch prediction to choose which instructions to execute. • Speculation to allow the execution of instructions before the control dependence is resolved. • Dynamic scheduling to deal with the scheduling of different combinations of basic blocks • PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon Chap. 3 -ILP 2

Speculation Load/Store RAW Hazard • Question: Given a load that follows a store in program order, are the two related? • (Alternatively: is there a RAW hazard between the store and the load)? Eg: st 0(R2),R5 ld R6,0(R3) • Can we go ahead and start the load early? • Store address could be delayed for a long time by some calculation that leads to R2 (divide?). • We might want to issue/begin execution of both operations in same cycle. • Answer is that we are not allowed to start load until we know that address 0(R2)  0(R3) Chap. 3 -ILP 2

Hardware Support for Memory Disambiguation Speculation • Need buffer to keep track of all outstanding stores to memory, in program order. • Keep track of address (when becomes available) and value (when becomes available) • FIFO ordering: will retire stores from this buffer in program order • When issuing a load, record current head of store queue (know which stores are ahead of you). • When have address for load, check store queue: • If any store prior to load is waiting for its address, stall load. • If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: • store value available  return value • store value not available  return ROB number of source • Otherwise, send out request to memory • Actual stores commit in order, so no worry about WAR/WAW hazards through memory. Chap. 3 -ILP 2

Studies of ILP • Conflicting studies of amount of improvement available • Benchmarks (vectorized FP Fortran vs. integer C programs) • Hardware sophistication • Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

Studies of ILP Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle Chap. 3 -ILP 2

Studies of ILP Upper Limit to ILP: Ideal Machine(Figure 3.35, page 242) This is the amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle. Chap. 3 -ILP 2

Studies of ILP Impact of Realistic Branch Prediction What parallelism do we get when we don’t allow perfect branch prediction, as in the last picture, but assume some realistic model? Possibilities include: 1. Perfect - all branches are perfectly predicted (the last slide) 2. Selective History Predictor - a complicated but do-able mechanism for selection. 3. Standard 2-bit history predictor with 512 2-bit entries. 4. Static prediction based on past history of the program. 5. None - Parallelism is limited to basic block. Chap. 3 -ILP 2

Studies of ILP Bonus!! Selective History Predictor 8096 x 2 bits 1 0 Taken/Not Taken 11 10 01 00 Choose Non-correlator Branch Addr Choose Correlator 2 Global History 00 8K x 2 bit Selector 01 10 11 11 Taken 10 01 Not Taken 00 2048 x 4 x 2 bits Chap. 3 -ILP 2

Studies of ILP Impact of Realistic Branch PredictionFigure 3.39, Page 248 FP: 15 - 45 Limiting the type of branch prediction. Integer: 6 - 12 IPC Chap. 3 -ILP 2 Perfect Selective Hist BHT (512) Static No prediction

The Pentium 4 • This section looks at some REAL implementations of the hardware. We concentrate mostly on Pentium 4. 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP 3.10 The Pentium 4 Chap. 3 -ILP 2

The Pentium 4 Dynamic Scheduling in P6 (Pentium Pro, II, III) • Q: How pipeline 1 to 17 byte 80x86 instructions? • P6 doesn’t pipeline 80x86 instructions • P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS) • Sends micro-operations to reorder buffer & reservation stations • Many instructions translate to 1 to 4 micro-operations • Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations • 14 clocks in total pipeline (~ 3 state machines) Chap. 3 -ILP 2

The Pentium 4 Dynamic Scheduling in P6 Parameter 80x86 microops Max. instructions issued/clock 3 6 Max. instr. complete exec./clock 5 Max. instr. commited/clock 3 Window (Instrs in reorder buffer) 40 Number of reservations stations 20 Number of rename registers 40 No. integer functional units (FUs) 2No. floating point FUs 1No. SIMD Fl. Pt. FUs 1No. memory Fus 1 load + 1 store Chap. 3 -ILP 2

ReorderBuffer Reserv.Station 6 uops 16B The Pentium 4 P6 Pipeline • 14 clocks in total (~3 state machines) • 8 stages are used for in-order instruction fetch, decode, and issue • Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops) • 3 stages are used for out-of-order execution in one of 5 separate functional units • 3 stages are used for instruction commit Execu-tionunits(5) Gradu-ation 3 uops/clk InstrDecode3 Instr/clk InstrFetch16B/clk Renaming3 uops/clk Chap. 3 -ILP 2

The Pentium 4 AMD Althon • Similar to P6 microarchitecture (Pentium III), but more resources • Transistors: PIII 24M v. Althon 37M • Die Size: 106 mm2 v. 117 mm2 • Power: 30W v. 76W • Cache: 16K/16K/256K v. 64K/64K/256K • Window size: 40 vs. 72 uops • Rename registers: 40 v. 36 int +36 Fl. Pt. • BTB: 512 x 2 v. 4096 x 2 • Pipeline: 10-12 stages v. 9-11 stages • Clock rate: 1.0 GHz v. 1.2 GHz • Memory bandwidth: 1.06 GB/s v. 2.12 GB/s Chap. 3 -ILP 2

The Pentium 4 Pentium 4 • Still translate from 80x86 to micro-ops • P4 has better branch predictor, more FUs • Instruction Cache holds micro-operations vs. 80x86 instructions • no decode stages of 80x86 on cache hit • called “trace cache” (TC) • Faster memory bus: 400 MHz v. 133 MHz • Caches • Pentium III: L1I 16KB, L1D 16KB, L2 256 KB • Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB • Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock • Clock rates: • Pentium III 1 GHz v. Pentium IV 1.5 GHz • 14 stage pipeline vs. 24 stage pipeline Chap. 3 -ILP 2

The Pentium 4 Pentium 4 features • Multimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions • When used by programs?? • Faster Floating Point: execute 2 64-bit Fl. Pt. Per clock • Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs • Using RAMBUS DRAM • Bandwidth faster, latency same as SDRAM • Cost 2X-3X vs. SDRAM • ALUs operate at 2X clock rate for many ops • Pipeline doesn’t stall at this clock rate: uops replay • Rename registers: 40 vs. 128; Window: 40 v. 126 • BTB: 512 vs. 4096 entries (Intel: 1/3 improvement) Chap. 3 -ILP 2

The Pentium 4 Pentium, Pentium Pro, Pentium 4 Pipeline • Pentium (P5) = 5 stagesPentium Pro, II, III (P6) = 10 stages (1 cycle ex)Pentium 4 (NetBurst) = 20 stages (no decode) Chap. 3 -ILP 2

The Pentium 4 Block Diagram of Pentium 4 Microarchitecture • BTB = Branch Target Buffer (branch predictor) • I-TLB = Instruction TLB, Trace Cache = Instruction cache • RF = Register File; AGU = Address Generation Unit • "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s Chap. 3 -ILP 2

The Pentium 4 Pentium 4 Die Photo • 42M Xtors • PIII: 26M • 217 mm2 • PIII: 106 mm2 • L1 Execution Cache • Buffer 12,000 Micro-Ops • 8KB data cache • 256KB L2$ Chap. 3 -ILP 2

The Pentium 4 Benchmarks: Pentium 4 v. PIII v. Althon • SPEC base2000 • Int, P4@1.5 GHz: 524, PIII@1GHz: 454, AMD Althon@1.2Ghz:? • FP, P4@1.5 GHz: 549, PIII@1GHz: 329, AMD Althon@1.2Ghz:304 • WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) • P4 : 164, PIII : 167, AMD Althon: 180 • Quake 3 Arena: P4 172, Althon 151 • SYSmark 2000 composite: P4 209, Althon 221 • Office productivity: P4 197, Althon 209 • S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed." Chap. 3 -ILP 2

The Pentium 4 Why is the Pentium 4 Slower than the Pentium 3? • Instruction count is the same for x86 • Clock rates: P4 > Althon > PIII • How can P4 be slower? • Time = Instruction count x CPI x 1/Clock rate • Average Clocks Per Instruction (CPI) of P4 must be worse than Althon, PIII • Will CPI ever get < 1.0 for real programs? Chap. 3 -ILP 2

The Pentium 4 Another Approach: Mulithreaded Execution for Servers • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Multithreading: multiple threads to share the functional units of 1 processor via overlapping • processor must duplicate indepedent state of each thread e.g., a separate copy of register file and a separate PC • memory shared through the virtual memory mechanisms • Threads execute overlapped, often interleaved • When a thread is stalled, perhaps for a cache miss, another thread can be executed, improving throughput Chap. 3 -ILP 2

The Pentium 4 Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • large set of virtual registers that can be used to hold the register sets of independent threads (assuming separate renaming tables are kept for each thread) • out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Chap. 3 -ILP 2

Computer Architecture