The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz

The Alpha 21264 Microprocessor:Out-of-Order Execution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA

Some Highlights • Continued Alpha performance leadership • 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V • 15 Million transistors, 3.1 cm2, 587 pin PGA • Specint95 of 30+ and Specfp95 of 50+ • Out-of-order and speculative execution • 4-way integer issue • 2-way floating-point issue • Sophisticated tournament branch prediction • High-bandwidth memory system (1+ GB/sec)

Alpha 21264: Block Diagram FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec Bus Inter- face Unit Sys Bus Addr Exec L1 Data Cache 64KB 2-Set 64-bit Reg File (80) Exec Cache Bus 80 in-flight instructions plus 32 loads and 32 stores Addr 128-bit Exec Next-Line Address Phys Addr 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set 44-bit FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address

21264 Instruction Fetch Bandwidth Enablers • The 64 KB two-way associative instruction cache supplies four instructions every cycle • The next-fetch and set predictors provide the fast cache access times of a direct-mapped cache and eliminate bubbles in non-sequential control flows • The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions • The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts

Learns Dynamic Jumps 0-cycle Penalty On Branches Set-Associative Behavior Instruction Stream Improvements ... PC GEN Instr Decode, Branch Pred PC Tag S0 Tag S1 Icache Data Next Fetch Set Pred Cmp Cmp Hit/Miss/Set Miss Next address + set Instructions (4)

(predict taken) if (a == 0) TNTN if (a%2 == 0) if (b == 0) (predict taken) if (b == 0) NNNN (predict taken) if (a == b) Fetch Stage: Branch Prediction • Others can be predicted based on how the program arrived at the branch: Global Correlation • Some branch directions can be predicted based on their past behavior: Local Correlation

Local Prediction (1024x3) Tournament Branch Prediction Local History (1024x10) Global Prediction (4096x2) Choice Prediction (4096x2) Program Counter Global History Prediction

Mapper and Queue Stages • Mapper: • Rename 4 instructions per cycle (8 source / 4 dest ) • 80 integer + 72 floating-point physical registers • Queue Stage: • Integer: 20 entries / Quad-Issue • Floating Point: 15 entries / Dual-Issue • Instructions issued out-of-order when data ready • Prioritized from oldest to youngest each cycle • Instructions leave the queues after they issue • The queue collapses every cycle as needed

Map and Queue Stages MAPPER ISSUE QUEUE Arbiter Saved Map State Grant Request Register Numbers Renamed Registers Register Scoreboard MAP CAMs 80 Entries 4 Inst / Cycle Issued Instructions Instructions QUEUE 20 Entries

Reg Reg Register and Execute Stages Floating Point Execution Units Integer Execution Units Cluster 1 Cluster 0 m. video mul FP Mul shift/br shift/br add / logic add / logic Reg FP Add add / logic / memory add / logic / memory FP Div SQRT

Reg Reg Integer Cross-Cluster Instruction Scheduling and Execution Integer Execution Units Cluster 1 Cluster 0 • Instructions are statically pre-slotted to the upper or lower execution pipes • The issue queue dynamically selects between the left and right clusters • This has most of the performance of 4-way with the simplicity of 2-way issue m. video mul shift/br shift/br add / logic add / logic add / logic / memory add / logic / memory

21264 On-Chip Memory System Features • Two loads/stores per cycle • Any combination • 64 KB two-way associative L1 data cache (9.6 GB/sec) • Phase-pipelined at > 1 GHz (no bank conflicts!) • 3 cycle latency (issue to issue of consumer) with hit prediction • Out-of-order and speculative execution • Minimizes effective memory latency • 32 outstanding loads and 32 outstanding stores • Maximizes memory system parallelism • Speculative stores forward data to subsequent loads

Cluster 0 Memory Unit Cluster 1 Memory Unit System @ 1.5x 128 128 64 64 64 Speculative Store Data Bus Interface 128 L2 @ 1.5x Double-Pumped GHz Data Cache Memory System (Continued) Data Path Address Path • New references check addresses against existing references • Speculative store data that address matches bypasses to loads • Multiple misses to the same cache block merge • Hazard detection logic manages out-of-order references Data Busses Fill Data Victim Data

Low-latency Speculative Issue of Integer Load Data Consumers (Predict Hit) LDQ Cache Lookup Cycle LDQ R0, 0(R0) ADDQ R0, R0, R1 OP R2, R2, R3 Q R E D Q R E D Squash this ADDQ on a miss Q R E D This op also gets squashed Q R E D Restart ops here on a miss 3 Cycle Latency ADDQ Issue Cycle • When predicting a load hit: • The ADDQ issues (speculatively) after 3 cycles • Best performance if the load actually hits (matching the prediction) • The ADDQ issues before the load hit/miss calculation is known • If the LDQ misses when predicted to hit: • Squash two cycles (replay the ADDQ and its consumers) • Force a “mini-replay” (direct from the issue queue)

Earliest ADDQ Issue Low-latency Speculative Issue of Integer Load Data Consumers (Predict Miss) LDQ Cache Lookup Cycle LDQ R0, 0(R0) OP1 R2, R2, R3 OP2 R4, R4, R5 ADDQ R0, R0, R1 Q R E D Q R E D Q R E D Q R E D ADDQ can issue 5 Cycle Latency (Min) • When predicting a load miss: • The minimum load latency is 5 cycles (more on a miss) • There are no squashes • Best performance if the load actually misses (as predicted) • The hit/miss predictor: • MSB of 4-bit counter (hits increment by 1, misses decrement by 2)

Dynamic Hazard Avoidance(Before Marking) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29) Store followed by load to the same memory address! First execution order: LDQ R0, 64(R28) LDQ R1, 0(R29) … STQ R0, 0(R28) This (re-ordered) load got the wrong data value!

Dynamic Hazard Avoidance(After Marking/Training) Program order (Assume R28 == R29): LDQ R0, 64(R28) STQ R0, 0(R28) LDQ R1, 0(R29) This load is marked this time Store followed by load to the same memory address! Subsequent executions (after marking): LDQ R0, 64(R28) … STQ R0, 0(R28) LDQ R1, 0(R29) The marked (delayed) load gets the bypassed store data!

New Memory Prefetches • Software-directed prefetches • Prefetch • Prefetch w/ modify intent • Prefetch, evict it next • Evict Block (eject data from the cache) • Write hint • allocate block with no data read • useful for full cache block writes

21264 Off-Chip Memory System Features • 8 outstanding block fills + 8 victims • Split L2 cache and system busses (back-side cache) • High-speed (bandwidth) point-to-point channels • clock-forwarding technology, low pin counts • L2 hit load latency (load issue to consumer issue) = 6 cycles + SRAM latency • Max L2 cache bandwidth of 16 bytes per 1.5 cycles • 6.4 GB/sec with a 400Mhz transfer rate • L2 miss load latency can be 160ns (60 ns DRAM) • Max system bandwidth of 8 bytes per 1.5 cycles • 3.2 GB/sec with a 400Mhz transfer rate

21264 Pin Bus System Pin Bus L2 Cache Port Alpha 21264 L2 Cache TAG RAMs Address Out TAG System Chip-set (DRAM and I/O) Address In Address System Data Data L2 Cache Data RAMs 64b 128b Independent data busses allow simultaneous data transfers L2 Cache: 0 -16MB Synch Direct Mapped Peak Bandwidth: Example 1: Reg-Reg 133 MHz Burst RAM 2.1 GB/Sec Example 2: ‘RISC’ 200+ MHz Late Write 3.2+ GB/Sec Example 3: Dual Data 400+ MHz 6.4+ GB/Sec

Compaq Alpha / AMD Shared System Pin Bus (External Interface) System Pin Bus (“Slot A”) AMD K7 Address Out System Chip-set (DRAM and I/O) Address In Cache System Data 64b • High-performance system pin bus • Shared system chipset designs • This is a win-win!

Dual 21264 System Example 64b Memory Bank 256b 21264 & L2 Cache 8 Data Switches 256b 64b 100MHz SDRAM 21264 & L2 Cache Control Chip Memory Bank PCI-chip PCI-chip 100MHz SDRAM 64b PCI 64b PCI

Measured Performance: SPEC95

Measured Performance: STREAMS

Alpha 21264 Floating-point Unit Bus Interface Unit Integer Mapper Floating Mapper and Queue Integer Unit (Left) Integer Unit (right) Memory Controller Integer Queue Instruction Data path Data and Control Busses Memory Controller Data Cache Instruction Cache BIU

Summary • The 21264 will maintain Alpha’s performance lead • 30+ Specint95 and 50+ Specfp95 • 1+ GB/sec memory bandwidth • The 21264 proves that both high frequency and sophisticated architectural features can coexist • high-bandwidth speculative instruction fetch • out-of-order execution • 6-way instruction issue • highly parallel out-of-order memory system

The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz