ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture

ECE4100/6100 Guest Lecture:P6 & NetBurstMicroarchitecture Prof. Hsien-Hsin Sean Lee School of ECE Georgia Institute of Technology February 11, 2003

Why studies P6 from last millennium? • A paradigm shift from Pentium • A RISC core disguised as a CISC • Huge market success: • Microarchitecture • And stock price • Architected by former VLIW and RISC folks • Multiflow (pioneer in VLIW architecture for super-minicomputer) • Intel i960 (Intel’s RISC for graphics and embedded controller) • Netburst (P4’s microarchitecture) is based on P6 2

P6 Basics • One implementation of IA32 architecture • Super-pipelined processor • 3-way superscalar • In-order front-end and back-end • Dynamic execution engine (restricted dataflow) • Speculative execution • P6 microarchitecture family processors include • Pentium Pro • Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) • Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) • Celeron (without MP support) • Later P-II/P-III/Celeron all have on-die L2 cache 3

MCH ICH x86 Platform Architecture Host Processor L1 Cache (SRAM) L2 Cache P6 Core (SRAM) Back-Side Bus On-die or on-package Front-Side Bus GPU System Memory (DRAM) Graphics Processor AGP chipset Local Frame Buffer PCI USB I/O 4

Pentium III Die Map • EBL/BBL – External/Backside Bus logic • MOB - Memory Order Buffer • Packed FPU - Floating Point Unit for SSE • IEU - Integer Execution Unit • FAU - Floating Point Arithmetic Unit • MIU - Memory Interface Unit • DCU - Data Cache Unit (L1) • PMH - Page Miss Handler • DTLB - Data TLB • BAC - Branch Address Calculator • RAT - Register Alias Table • SIMD - Packed Floating Point unit • RS - Reservation Station • BTB - Branch Target Buffer • TAP – Test Access Port • IFU - Instruction Fetch Unit and L1 I-Cache • ID - Instruction Decode • ROB - Reorder Buffer • MS - Micro-instruction Sequencer 5

ISA Enahncement (on top of Pentium) • CMOVcc / FCMOVcc r, r/m • Conditional moves (predicated move) instructions • Based on conditional code (cc) • FCOMI/P : compare FP stack and set integer flags • RDPMC/RDTSC instructions • Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory • MMX in Pentium II • SIMD integer operations • SSE in Pentium III • Prefetches (non-temporal nta + temporal t0, t1, t2), sfence • SIMD single-precision FP operations 6

RET1 RET2 IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX I-Cache Next IP Br Dec Rotate Dec1 Dec2 ILD RS Write In-order FE 11 12 13 14 15 16 17 Exec / WB ROB Scheduling Delay RS schd RS Disp 20 21 22 … 31 32 33 Single-cycle inst pipeline RAT IDQ 82 83 Exec n Exec2 RS Scheduling Delay 31 32 33 Multi-cycle inst pipeline .. .. 81: Mem/FP WB 82: Int WB 81 82 83 83: Data WB Retirement in-order boundary DCache2 DCache1 AGU … Non-blocking memory pipeline 31 32 33 42 43 81 82 83 FE in-order boundary Mob wakeup Dcache2 MOB wr MOB disp DCache1 MOB blk AGU … …….. Blocking memory pipeline Ret ROB rd Ret ptr wr 40 41 31 32 33 42 43 42 43 RRF wr 81 82 83 … 91 92 93 MOB Scheduling Delay P6 Pipelining 7

MMX MIU FEU AGU Data Cache Unit (L1) Memory Order Buffer Register Alias Table Allocator Microcode Sequencer Instruction Decoder Instruction Decoder P6 Microarchitecture External bus Chip boundary Memory Cluster Bus Cluster Bus interface unit Instruction Fetch Unit Instruction Fetch Unit IEU/JEU Control Flow IEU/JEU (Restricted) Data Flow BTB/BAC Instruction Fetch Cluster Reservation Station Out-of-order Cluster ROB & Retire RF 8 Issue Cluster

Instruction Fetching Unit addr data Other fetch requests Select mux • IFU1: Initiate fetch, requesting 16 bytes at a time • IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction • IFU3: Align instructions to 3 decoders in 4-1-1 format Instruction buffer Streaming Buffer Length marks Instruction Cache ILD Instruction rotator Linear Address Victim Cache Next PC Mux P.Addr Instruction TLB #bytes consumed by ID Prediction marks Branch Target Buffer 9

Pattern History Tables (PHT) 0000 New (spec) history 0001 0010 1 1 1 0 10 Spec. update Branch History Register (BHR) 1101 Prediction 1110 1 0 1111 2-bit sat. counter Rc: Branch Result Dynamic Branch Prediction • Similar to a 2-level PAs design • Associated with each BTB entry • W/ 16-entry Return Stack Buffer • 4 branch predictions per cycle (due to 16-byte fetch per cycle) W0 W1 W2 W3 512-entry BTB • Static prediction provided by Branch Address Calculator when BTB misses (see prior slide) 10

Static Branch Prediction Unconditional PC-relative? BTB miss? No No Yes Yes PC-relative? Return? No No Indirect jump Yes Yes Conditional? No Yes BTB’s decision Taken Backwards? No Taken Yes Taken Taken 11 Not Taken Taken

X86 Instruction Decode IFU3 • 4-1-1 decoder • Decode rate depends on instruction alignment • DEC1: translate x86 into micro-operation’s (ops) • DEC2: move decoded ops to ID queue • MS performs translations either • Generate entire op sequence from microcode ROM • Receive 4 ops from complex decoder, and the rest from microcode ROM complex (1-4) simple (1) simple (1) Micro-instruction sequencer (MS) Instruction decoder queue (6 ops) S: Simple C: Complex 12

Allocator • The interface between in-order and out-of-order pipelines • Allocates • “3-or-none” ops per cycle into RS, ROB • “all-or-none” in MOB (LB and SB) • Generate physical destination Pdst from the ROB and pass it to the Register Alias Table (RAT) • Stalls upon shortage of resources 13

Renaming Example PSrc RRF 25 EAX 0 EBX 2 0 ECX ECX 1 EDX 15 0 RRF ROB Register Alias Table (RAT) • Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 opper cycle • 40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc) • RAT looks up physical ROB locations for renamed sources based on RRF bit Integer RAT Array Logical Src Array Physical Src (Psrc) In-order queue Int and FP Overrides RAT PSrc’s FP TOS Adjust FP RAT Array Allocator Physical ROB Pointers 14

RRF(1) PSrc(6) Size(2) INT Low Bank (32b/16b/L): 8 entries INT High Bank (H): 4 entries Partial Register Width Renaming • 32/16-bit accesses: • Read from low bank • Write to both banks • 8-bit RAT accesses: depending on which Bank is being written Integer RAT Array Logical Src Array Physical Src In-order queue Int and FP Overries RAT Physical Src FP TOS Adjust FP RAT Array op0: MOV AL = (a) op1: MOV AH = (b) op2: ADD AL = (c) op3: ADD AH = (d) Allocator Physical ROB Pointers from Allocator 15

AX write read CMP EAX, EBX INC ECX JBE XX ; stall EAX Partial flag stalls (1) MOVB AL, m8 ; ADD EAX, m32 ; stall Partial register stalls TEST EBX, EBX LAHF ; stall XOR EAX, EAX MOVB AL, m8 ; ADD EAX, m32 ; no stall Partial flag stalls (2) Idiom Fix (1) • JBE reads both ZF and CF while INC affects (ZF,OF,SF,AF,PF) • LAHF loadslow byteof EFLAGS SUB EAX, EAX MOVB AL, m8 ; ADD EAX, m32 ; no stall Idiom Fix (2) Partial Stalls due to RAT • Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read • Partial flags stalls: Occurs when a subsequent instruction read more flags than a prior unretired instruction touches 16

AGU1 RRF IEU0 Pfshuf Fadd Fmul Pfadd Div IEU1 JEU AGU0 Pfmul Imul Reservation Stations WB bus 0 • Gateway to execution: binding max 5 op to each port per cycle • 20 op entry buffer bridging the In-order and Out-of-order engine • RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. • Oldest first FIFO scheduling when multiple ops are ready at the same cycle Port 0 WB bus 1 Port 1 Loaded data Port 2 RS Ld addr LDA MOB DCU STA Port 3 St addr STD St data Port 4 ROB Retired data 17

ALLOC RAT RS RRF MS ReOrder Buffer • A 40-entry circular buffer • Similar to that described in [SmithPleszkun85] • 157-bit wide • Provide 40 alias physical registers • Out-of-order completion • Deposit exception in each entry • Retirement (or de-allocation) • After resolving prior speculation • Handle exceptions thru MS • Clear OOO state when a mis-predicted branch or exception is detected • 3 op’s per cycle in program order • For multi-op x86 instructions: none or all (atomic) ROB . . . (exp) code assist 18

Memory Execution Cluster • Manage data memory accesses • Address Translation • Detect violation of access ordering RS / ROB LD STA STD Load Buffer DTLB DCU LD STA FB Store Buffer EBL Memory Cluster Blocks • Fill buffers in DCU (similar to MSHR [Kroft’81]) for handling cache misses (non-blocking) 19

Memory Order Buffer (MOB) • Allocated by ALLOC • A second order RS for memory operations • 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) • MOB • 16-entry load buffer (LB) • 12-entry store address buffer (SAB) • SAB works in unison with • Store data buffer (SDB) in MIU • Physical Address Buffer (PAB) in DCU • Store Buffer (SB): SAB + SDB + PAB • Senior Stores • Upon STD/STA retired from ROB • SB marks the store “senior” • Senior stores are committed back in program order to memory when bus idle or SB full • Prefetch instructions in P-III • Senior load behavior • Due to no explicit architectural destination 20

Store Coloring • ALLOC assigns Store Buffer ID (SBID) in program order • ALLOC tags loads with the most recent SBID • Check loads against stores with equal or younger SBIDs for potential address conflicts • SDB forwards data if conflict detected x86 Instructions op’s store color mov (0x1220), ebx std (ebx) 2 sta 0x1220 2 mov (0x1110), eax std (eax) 3 sta 0x1100 3 mov ecx, (0x1220) ld 3 mov edx, (0x1280) ld 3 mov (0x1400), edx std (edx) 4 sta 0x1400 4 mov edx, (0x1380) ld 4 21

Memory Type Range Registers (MTRR) • Control registers written by the system (OS) • Supporting Memory Types • UnCacheable (UC) • Uncacheable Speculative Write-combining (USWC or WC) • Use a fill buffer entry as WC buffer • WriteBack (WB) • Write-Through (WT) • Write-Protected (WP) • E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. • Page Miss Handler (PMH) • Look up MTRR while supplying physical addresses • Return memory types and physical address to DTLB 22

Intel NetBurst Microarchitecture • Pentium 4’s microarchitecture, a post-P6 new generation • Original target market: Graphics workstations, but … the major competitor screwed up themselves… • Design Goals: • Performance, performance, performance, … • Unprecedented multimedia/floating-point performance • Streaming SIMD Extensions 2 (SSE2) • Reduced CPI • Low latency instructions • High bandwidth instruction fetching • Rapid Execution of Arithmetic & Logic operations • Reduced clock period • New pipeline designed for scalability 23

Innovations Beyond P6 • Hyperpipelined technology • Streaming SIMD Extension 2 • Enhanced branch predictor • Execution trace cache • Rapid execution engine • Advanced Transfer Cache • Hyper-threading Technology (in Xeon and Xeon MP) 24

Pentium 4 Fact Sheet • IA-32 fully backward compatible • Available at speeds ranging from 1.3 to ~3 GHz • Hyperpipelined (20+ stages) • 42+ million transistors • 0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz; • Die Size of 217mm2 • Consumes 55 watts of power at 1.5Ghz • 400MHz (850) and 533MHz (850E) system bus • 512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up to 89.6 GB/s @2.8GHz to L1) • 1MB or 512KB L3 cache (in Xeon MP) • 144 new 128 bit SIMD instructions (SSE2) • HyperThreading Technology (only enabled in Xeon and Xeon MP) 25

Recent Intel IA-32 Processors 26

Building Blocks of Netburst System bus L1 Data Cache Bus Unit Level 2 Cache Execution Units Memory subsystem INT and FP Exec. Unit Fetch/ Dec ETC μROM OOO logic Retire BTB / Br Pred. Branch history update Out-of-Order Engine 27 Front-end

Pentium 4 Microarchitectue BTB (4k entries) I-TLB/Prefetcher 64 bits 64-bit System Bus Code ROM IA32 Decoder Trace Cache BTB (512 entries) Quad Pumped 400M/533MHz 3.2/4.3 GB/sec BIU Execution Trace Cache op Queue Allocator / Register Renamer INT / FP op Queue Memory op Queue Fast Simple FP Memory scheduler Slow/General FP scheduler FP RF / Bypass Ntwk INT Register File / Bypass Network U-L2 Cache 256KB 8-way 128B line, WB 48 GB/s @1.5Gz FP Move FP MMX SSE/2 AGU AGU 2x ALU 2x ALU Slow ALU Ld addr St addr Simple Inst. Simple Inst. Complex Inst. 256 bits 28 L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port)

RET1 RET2 IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX PREF DEC DEC EXEC WB P5 Microarchitecture P6 Microarchitecture TC NextIP TC Fetch Drive Alloc Rename Queue Schedule Dispatch Reg File Exec Flags Br Ck Drive NetBurst Microarchitecture Pipeline Depth Evolution 29

Execution Trace Cache • Primary first level I-cache to replace conventional L1 • Decoding several x86 instructions at high frequency is difficult, take several pipeline stages • Branch misprediction penalty is horrible • lost 20 pipeline stages vs. 10 stages in P6 • Advantages • Cache post-decodeops • High bandwidth instruction fetching • Eliminate x86 decoding overheads • Reduce branch recovery time if TC hits • Hold up to 12,000 ops • 6 ops per trace line • Many (?) trace lines in a single trace 30

Execution Trace Cache • Deliver 3 op’s per cycle to OOO engine • X86 instructions read from L2 when TC misses (7+ cycle latency) • TC Hit rate ~ 8K to 16KB conventional I-cache • Simplified x86 decoder • Only one complex instruction per cycle • Instruction > 4 op will be executed by micro-code ROM (P6’s MS) • Perform branch prediction in TC • 512-entry BTB + 16-entry RAS • With BP in x86 IFU, reduce 1/3 misprediction compared to P6 • Intel did not disclose the details of BP algorithms used in TC and x86 IFU (Dynamic + Static) 31

Out-Of-Order Engine • Similar design philosophy with P6 uses • Allocator • Register Alias Table • 128 physical registers • 126-entry ReOrder Buffer • 48-entry load buffer • 24-entry store buffer 32

RF (128-entry) ROB (126) Front-end RAT RAT EAX EAX EBX EBX ECX ECX EDX EDX ESI Allocated sequentially ESI EDI EDI ESP ESP EBP EBP Retirement RAT EAX EBX . . . . . . ECX . . . . . . EDX ESI EDI Data Status ESP EBP NetBurst Register Renaming Register Renaming Schemes ROB (40-entry) Allocated sequentially Data Status RRF P6 Register Renaming 33

Exec Port 0 Exec Port 1 Load Port Store Port Fast ALU (2x pumped) FP Move Fast ALU (2x pumped) INT Exec FP Exec Memory Load Memory Store • Add/sub • Logic • Store Data • Branches • FP/SSE Move • FP/SSE Store • FXCH • Add/sub • Shift • Rotate • FP/SSE Add • FP/SSE Mul • FP/SSE Div • MMX • Loads • LEA • Prefetch • Stores Micro-op Scheduling • op FIFO queues • Memory queue for loads and stores • Non-memory queue • op schedulers • Several schedulers fire instructions to execution (P6’s RS) • 4 distinct dispatch ports • Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) 34

Data Memory Accesses • 8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher) • Load-to-use speculation • Dependent instruction dispatched before load finishes • Due to the high frequency and deep pipeline depth • Scheduler assumes loads always hit L1 • If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculation • Replay logic – Re-execute the load when mis-speculated • Independent instructions are allowed to proceed • Up to 4 outstanding load misses (= 4 fill buffers in original P6) • Store-to-load forwarding buffer • 24 entries • Have the same starting physical address • Load data size <= store data size 35

Streaming SIMD Extension 2 • P-III SSE (Katmai New Instructions: KNI) • Eight 128-bit wide xmm registers (new architecture state) • Single-precision 128-bit SIMD FP • Four 32-bit FP operations in one instruction • Broken down into 2 ops for execution (only 80-bit data in ROB) • 64-bit SIMD MMX (use 8 mm registers — map to FP stack) • Prefetch (nta, t0, t1, t2) and sfence • P4 SSE2 (Willamette New Instructions: WNI) • Support Double-precision 128-bit SIMD FP • Two 64-bit FP operations in one instruction • Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) • Enhanced 128-bit SIMD MMX using xmm registers 36

X3 X2 X1 X0 Y3 Y2 Y1 Y0 op op op op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, imm8) xmm1 X3 X2 X1 X0 xmm2 Y3 Y2 Y1 Y0 Y3 Y3 X0 X1 xmm1 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1) Examples of Using SSE xmm1 xmm1 X3 X2 X1 X0 X3 X2 X1 X0 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 op xmm1 xmm1 X3 X2 X1 X0 op Y0 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2) 37

X3 X2 X1 X0 Y3 Y2 Y1 Y0 op op op op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0 Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8) xmm1 X3 X2 X1 X0 xmm2 Y3 Y2 Y1 Y0 Y3 Y3 X0 X1 xmm1 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1) Examples of Using SSE and SSE2 SSE xmm1 xmm1 X3 X2 X1 X0 X3 X2 X1 X0 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 op xmm1 xmm1 X3 X2 X1 X0 op Y0 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2) SSE2 X1 X0 X1 X0 X1 X0 xmm1 xmm1 Y1 Y0 Y1 Y0 Y1 Y0 xmm2 xmm2 op op op xmm1 xmm1 X1 op Y1 X0 op Y0 X1 X0 op Y0 X1 or X0 Y1 or Y0 Packed DP FP operation (e.g. ADDPDxmm1, xmm2) Scalar DP FP operation (e.g. ADDSDxmm1, xmm2) Shuffle DP operation (2-bit imm) (e.g. SHUFPD xmm1, xmm2, imm2) Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8) 38

HyperThreading • In Intel Xeon Processor and Intel Xeon MP Processor • Enable Simultaneous Multi-Threading (SMT) • Exploit ILP through TLP (—Thread-Level Parallelism) • Issuing and executing multiple threads at the same snapshot • Single P4 Xeon appears to be 2 logical processors • Share the same execution resources • Architectural states are duplicated in hardware 39

FU2 FU4 FU1 FU3 Thread 1 Unused Thread 2 Thread 3 Thread 4 Thread 5 Execution Time Fine-grained Multithreading (cycle-by-cycle Interleaving) Chip Multiprocessor (CMP) Conventional Superscalar Single Threaded Coarse-grained Multithreading (Block Interleaving) Simultaneous Multithreading Multithreading (MT) Paradigms 40

More SMT commercial processors • Intel Xeon Hyperthreading • Supports 2 replicated hardware contexts: PC (or IP) and architecture registers • New directions of usage • Helper (or assisted) threads (e.g. speculative precomputation) • Speculative multithreading • Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists) • SUN 4-SMT-processor CMP? 41

Speculative Multithreading • SMT can justify wider-than-ILP datapath • But, datapath is only fully utilized by multiple threads • How to speed up single-thread program by utilizing multiple threads? • What to do with spare resources? • Execute both sides of hard-to-predictable branches • Eager execution or Polypath execution • Dynamic predication • Send another thread to scout ahead to warm up caches & BTB • Speculative precomputation • Early branch resolution • Speculatively execute future work • Multiscalar or dynamic multithreading • e.g. start several loop iterations concurrently as different threads, if data dependence is detected, redo the work • Run a dynamic compiler/optimizer on the side • Dynamic verification • DIVA or Slipstream Processor 42

ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture