Advanced Branch Prediction Techniques in Computer Architecture

CENG 450Computer Systems and ArchitectureLecture 11 Amirali Baniasadi amirali@ece.uvic.ca

This Lecture • Branch Prediction • Multiple Issue

Branch Prediction • Predicting the outcome of a branch • Direction: • Taken / Not Taken • Direction predictors • Target Address • PC+offset (Taken)/ PC+4 (Not Taken) • Target address predictors • Branch Target Buffer (BTB)

Why do we need branch prediction? • Branch prediction • Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) • Allows useful work to be completed while waiting for the branch to resolve

Branch Prediction Strategies • Static • Decided before runtime • Examples: • Always-Not Taken • Always-Taken • Backwards Taken, Forward Not Taken (BTFNT) • Profile-driven prediction • Dynamic • Prediction decisions may change during the execution of the program

What happens when a branch is predicted? • On misprediction: • No speculative state may commit • Squash instructions in the pipeline • Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed

A Generic Branch Predictor Predicted Stream PC, T or NT Fetch f(PC, x) Resolve Actual Stream f(PC, x) = T or NT Actual Stream Execution Order Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty

Bimodal Branch Predictors • Dynamically store information about the branch behaviour • Branches tend to behave in a fixed way • Branches tend to behave in the same way across program execution • Index a Pattern History Table using the branch address • 1 bit: branch behaves as it did last time • Saturating 2 bit counter: branch behaves as it usually does

Saturating-Counter Predictors • Consider strongly biased branch with infrequent outcome • TTTTTTTTNTTTTTTTTNTTTT • Last-outcome will misspredict twice per infrequent outcome encounter: • TTTTTTTTNTTTTTTTTNTTTT • Idea: Remember most frequent case • Saturating-Counter: Hysteresis • often called bi-modal predictor • Captures Temporal Bias

Bimodal Prediction • Table of 2-bit saturating counters • Predict the most common direction • Advantages: simple, cheap, “good” accuracy • Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT

Correlating Predictors • From program perspective: • Different Branches may be correlated • if (aa == 2) aa = 0; • if (bb == 2) bb = 0; • if (aa != bb) then … • Can be viewed as a pattern detector • Instead of keeping aggregate history information • I.e., most frequent outcome • Keep exact history information • Pattern of n most recent outcomes • Example: • BHR: n most recent branch outcomes • Use PC and BHR (xor?) to access prediction table

Pattern-based Prediction • Nested loops: for i = 0 to N for j = 0 to 3 … • Branch Outcome Stream for j-for branch • 11101110111011101110 • Patterns: • 111 -> 0 • 110 -> 1 • 101 -> 1 • 011 -> 1 • 100% accuracy • Learning time 4 instances • Table Index (PC, 3-bit history)

Two-level Branch Predictors • A branch outcome depends on the outcomes of previous branches • First level: Branch History Registers (BHR) • Global history / Branch correlation: past executions of all branches • Self history / Private history: past executions of the same branch • Second level: Pattern History Table (PHT) • Use first level information to index a table • Possibly XOR with the branch address • PHT: Usually saturating 2 bit counters • Also private, shared or global

Gshare Predictor (McFarling) Branch History Table Global BHR • PC and BHR can be • concatenated • completely overlapped • partially overlapped • xored, etc. • How deep BHR should be? • Really depends on program • But, deeper increases learning time • May increase quality of information Prediction f PC

Hybrid Prediction PC GSHARE Bimodal ... T/NT T/NT Selector T/NT • Combining branch predictors • Use two different branch predictors • Access both in parallel • A third table determines which prediction to use Two or more predictor components combined • Different branches benefit from different types of history

Issues Affecting Accurate Branch Prediction • Aliasing • More than one branch may use the same BHT/PHT entry • Constructive • Prediction that would have been incorrect, predicted correctly • Destructive • Prediction that would have been correct, predicted incorrectly • Neutral • No change in the accuracy

More Issues • Training time • Need to see enough branches to uncover pattern • Need enough time to reach steady state • “Wrong” history • Incorrect type of history for the branch • Stale state • Predictor is updated after information is needed • Operating system context switches • More aliasing caused by branches in different programs

Performance Metrics • Misprediction rate • Mispredicted branches per executed branch • Unfortunately the most usually found • Instructions per mispredicted branch • Gives a better idea of the program behaviour • Branches are not evenly spaced

Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. FP: 75 - 150 Integer: 18 - 60 IPC Instructions that could theoretically be issued per cycle.

Impact of Realistic Branch Prediction FP: 15 - 45 • Limiting the type of branch prediction. Integer: 6 - 12 IPC

Pentium III • Dynamic branch prediction • 512-entry BTB predicts direction and target, 4-bit history used with PC to derive direction • Mispredicted: at least 9 cycles, as many as 26, average 10-15 cycles

AMD Athlon K7 • 10-stage integer, 15-stage fp pipeline, predictor accessed in fetch • 2K-entry bimodal, 2K-entry BTB • Branch Penalties: • Mispredict penalty: at least 10 cycles

Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors

1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB

Superscalar Vs. VLIW • Religious debate, similar to RISC vs. CISC • Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) • Q. Who can schedule code better, hardware or software?

Hardware Scheduling • High branch prediction accuracy • Dynamic information on latencies (cache misses) • Dynamic information on memory dependences • Easy to speculate (& recover from mis-speculation) • Works for generic, non-loop, irregular code • Ex: databases, desktop applications, compilers • Limited reorder buffer size limits “lookahead” • High cost/complexity • Slow clock

Software Scheduling • Large scheduling scope (full program), large “lookahead” • Can handle very long latencies • Simple hardware with fast clock • Only works well for “regular” codes (scientific, FORTRAN) • Low branch prediction accuracy • Can improve by profiling • No information on latencies like cache misses • Can improve by profiling • Pain to speculate and recover from mis-speculation • Can improve with hardware support

Superscalar Processors • Pioneer: IBM (America => RIOS, RS/6000, Power-1) • Superscalar instruction combinations • 1 ALU or memory or branch + 1 FP (RS/6000) • Any 1 + 1 ALU (Pentium) • Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch (Pentium II) • Impact of superscalar • More opportunity for hazards (why?) • More performance loss due to hazards (why?)

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution

SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation

The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit

Readings • New paper on branch prediction online. READ. • Material would be used in the THIRD quiz

Advanced Branch Prediction Techniques in Computer Architecture

Advanced Branch Prediction Techniques in Computer Architecture

Presentation Transcript

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 7

CENG 450 Computer Systems and Architecture Lecture 10

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 12

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 16

CENG 450 Computer Systems and Architecture Lecture 6

CENG 450 Computer Systems and Architecture Lecture 13

CENG 450 Computer Systems and Architecture Lecture 4

CENG 450 Computer Systems and Architecture Lecture 14

CENG 450 Computer Systems and Architecture Lecture 9

CENG 450 Computer Systems and Architecture Lecture 8

CENG 450 Computer Systems and Architecture Lecture 15

CENG 450 Computer Systems and Architecture Lecture 7

CENG 450 Computer Systems and Architecture Lecture 12