Branch Prediction

Branch Prediction J. Nelson Amaral

Why Branch Prediction? • Every 5-7 instruction of a program is a branch • Not predicting, or miss-predicting, is very costly in architectures with deep pipelines or with many functional units. Baer p. 129

Anatomy of a Predictor Baer p. 130

Anatomy of a Branch Predictor Prog. Exec. • Event Source: the execution of the program • Predictive information: • Can be encoded in the instruction code • a bit indicates most likely outcome • forward/backward branch • Obtained from some profiling information Baer p. 130

Anatomy of a Branch Predictor (cont.) Event Selec. • Event Selection: when to predict? • Simple solution: compute the prediction for every instruction (even non-branches) • Only use the result of the prediction for branches Baer p. 130

Anatomy of a Branch Predictor (cont.) Pred. Index. • Prediction Indexing: • Use part of the PC to index prediction tables: • history of outcome of previous branches at this PC • history of execution path leading to this PC Baer p. 130

Anatomy of a Branch Predictor (cont.) • Predictor Mechanism: • Static (example): • forward: always not taken • backward: always taken • Dynamic: • Finite State Machine predictor: saturating counters • Markov predictor: correlation Pred. Mechan. Baer p. 131

Anatomy of a Branch Predictor (cont.) • Feedback and Recovery: • Use real outcome to reinforce prediction • Must recover from miss-predictions Feedback Baer p. 131

Control Flow Statistics A 4-way superscalar has to predict a branch, on average, every other cycle. Baer p. 131

Interbranch Distances 40% of the time there is 1 or 0 cycles between predictions Branch resolution takes +/- 10 cycles If the prediction is wrong, up to 40 wrong instructions are in flight by the time the resolution occurs. Simulation for a 4-way out-of-order architecture Baer p. 131

Static Predictions OR Always Taken Always Not Taken Baer p. 132

Static Predictions • Early studies indicated that 2/3 of branches are taken • but 30% of those branches were unconditional! • For conditional branches there appears to be no preferred direction. Always Taken Baer p. 132

Alternative Static Predictions Accuracy improvements are barely noticeable. Static prediction based on profiling is slightly better. Static branch-not-taken has no implementation cost on pipeline. Forward Always Not Taken Backward Always Taken Baer p. 132

Dynamic Predictors • Prediction of a given branch changes with the execution of the program. • Simple: a finite-state machine encodes the outcome of a few recent executions of the branch. • Elaborate: Not only early branch outcomes, but other correlated parts of the programs are considered. Baer p. 132

When to predict? • Static prediction: at the Instruction Decode stage • Know that the instruction is a branch • Dynamic prediction: at the Instruction Fetch stage • Calculate prediction for every instruction, even non-branch ones. Baer p. 133

What to Predict? • Branch Direction: Is branch taken on not? • Branch Target: Address of next instruction for a taken branch Baer p. 133

Predicting Direction • Where we find the prediction? • How to encode the prediction? Look at the recent past: What was the direction the last time this same branch was executed? A single bit encodes the prediction: Prediction bit is set at prediction time. Baer p. 133

Prediction Hysteresis • Look at the last two resolutions • Two wrong predictions are necessary to change the prediction • Motivated by wrong predictions at the end of inner loops. Baer p. 133

2-Bit Saturating Counter Last instance was not taken but the previous was taken Last two instances were taken Last instance was taken but the previous was not Last two instances were not taken Baer p. 134

2-Bit Saturating Counter (Example) for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end; m ≤ 0 i ← 0 1-bit n ≥ 0 i j Pred Outc 0 0 NT T j ← 0 0 1 T T 0 n T NT S1; S2; …; Sk 1 0 NT T j←j+1 T 1 1 T T j < n NT i←i+1 2 × m misspredictions i < m i←i+1 Baer p. 134

2-Bit Saturating Counter (Example) for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end; m ≤ 0 i ← 0 1-bit 2-bit n ≥ 0 i j Pred Outc State Pred Outc 0 0 NT T wNT NT T j ← 0 0 1 T T sT T T 0 n T NT sT T NT S1; S2; …; Sk 1 0 NT T wT T T j←j+1 T 1 1 T T sT T T j < n NT i←i+1 i < m m + 1 misspredictions i←i+1 Baer p. 134

Accuracy of Branch Prediction • Includes unconditional branches • Predictions are associated with branches after each branch’s first execution Average of 26 traces (IBM 379, DEC PDP-11, CDC 6400) Average of 32 traces (MIPS R2000, Sun SPARC, DEC VAX, Motorola 68000) 3-bit counters yield only minor improvements Fix prediction. Determined by the first execution of the branch. Baer p. 135

Where to store the Prediction 32-bit address → 230 entries Need one (or two) bit for each possible branch address. Storing prediction bits with instructions. Need to modify code every 5 instructions. Many more bits for tags than for predictions. Use a cache (Branch Prediction Buffer – BPB). Solution: ditch the tags. Baer p. 136

Pattern History Table (PHT) Use selected bits from PC to index (or hash) the PHT. Each entry of the PHP stores the state of a finite state machine associated with a branch. Aliasing: multiple branches may index the same PHT entry. Performance degrades slightly. Baer p. 136

Accuracy of Bimodal Predictor(based on PHT) Based on 10 SPEC89 traces. Baer p. 137

Where the Predictor is Stored? Separate PHT Embedded in Instruction cache MIPS R10000: (512 counters) Alpha 21264: 1 counter per instruction? (2K counters) Sun UltraSPARC: 2 counters/cache line (2K counters) IBM PowerPC 620: (512 counters) AMD K5: 1 counter/cache line (1K counters) Intel Pentium: Combines PHP with Branch Target Buffer (512 entries) Baer p. 137

Feedback and Recovery Feedback Baer p. 137

Feedback: Bimodal Predictor • Feedback: update 2-bit counter for executing branch • When the updating is done? • When the actual direction is found (EX stage) Other predictions of the same branch are done. • When the branch commits Even more predictions are done. • Speculatively when the prediction is done Only reinforces prediction in bimodal predictor. EX/commit updating makes little difference in performance. Baer p. 137 Textbook typo (p. 137): choice for the timing of the “update”.

Local × Global Predictor • Local: • Only use history of the branch to be predicted • Global: • Use history of other branches that precede the branch to be predicted. Baer p. 138

Motivation for Global Prediction • Example from SPEC program eqntott: if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } If b1 and b2 are taken, then b3 is not taken. Baer p. 138

Correlator Predictor Two-level predictor. History Register Shifted-out bits are lost 1 inserted to the right when a branch is taken (0 otherwise) Baer p. 139

Update Problem in theCorrelator Predictor • PHT is updated non-speculatively at commit stage. • What is the problem with non-speculative updates of the global register? Baer p. 139

Updating the Global Register in theCorrelator Predictor if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } Branches b1 and b2 are not include in the prediction of branch b3! Baer p. 139

Updating the Global Register in theCorrelator Predictor Mispredictions and cache misses affect the commit time of earlier branches. • Two consecutive predictions • of a branch b may use different • ancestors of b. • Even if the path leading to • b is the same if (aa == 2) /* b1 */ aa = 0; if (bb == 2) /* b2 */ bb = 0; if(aa != bb){ /* b3 */ …. } Baer p. 139

Solution to the Update Problem in theCorrelator Predictor • Update Global Register speculatively when prediction is made. • New problem: • Need a repair mechanism • All bits after a misprediction are from branches in the wrong path. Baer p. 139

Repair Mechanism for Global Register in the Correlator Predictor • Decode Stage: • Checkpoint current GR into a FIFO queue • Commit Stage: • H: head of the queue • The corresponding check-pointed GR is H. • Correct prediction: discard H • Incorrect prediction: shift branch outcome into H and make it the new GR. Baer p. 144

Optimization to GR Checkpointing Put into the queue a GR that has the corrected bit shifted into it. Baer p. 144

Issues with Correlator Predictor • For small PHTs • Performance is worse than local predictors • It does not use the location of the branch in the program for the prediction • May introduce excessive aliasing • Solution to the aliasing problem: • Reintroduce the PC in the indexing of PHT Baer p. 140

gshare Predictor A common hash is an XOR function. Baer p. 141

Accuracy and Use of gshare • Almost perfect for SPEC FP95. • 0.83 accuracy for SPEC INT95 • 0.65 for program go Sun UltraSPARC IBM Power4 AMD K5 Baer p. 141

Example m ≤ 0 i ← 0 • Assume n=4: • bimodal mispredicts 1/5 times • global mispredicts from 0 to 5 times depending on other branches in the loop • This branch has a fix pattern: • “4 taken, 1 not taken” • How can this pattern be learned? • Remember the history of individual branches • We need predictors more attuned to locality of individual branches n ≥ 0 j ← 0 S1; S2; …; Sk j←j+1 T j < n NT i←i+1 i < m i←i+1 Baer p. 142

global-set predictor • First Level: A global shift register for correlations • Second Level: A set of multiple PHTs to prevent aliasing • expensive in terms of storage • must use few PHTs to be viable Baer p. 142/143

set-global predictor • Set of Branch History registers (BHT) • A single global PHT Baer p. 143

set-set predictor • A set of branch history registers (BHT) • A set of PHTs Baer p. 143

Predicting the Branch Target • When is the target of a branch computed? • In a superscalar architecture (p.e., the IA-32 of the Intel P6) after several pipeline stages. • What is the point of predicting direction early if we don’t know where the branch goes? • Need to also predict the branch target address. Baer p. 145

Branch Target Buffer (BTB) • A cachelike storage that records branch addresses and associated targets • If there is a hit in BTB for branch predicted taken: • PC ← Target in BTB for branch Baer p. 146

Integrated BTB-PHT • BTB needs much more space than the PHT • # of entries is limited by BTB. • BTB must be accessed on a single cycle Baer p. 146

Decoupled BTB-PHT • Parallel BTB and PHT access • if PHT say ‘taken’ and hit in BTB then PC ← Address in BTB Baer p. 146

Decoupled BTB-PHT • For space efficiency: • Only taken branches are added to BTB • They are added at the backend when the outcome is known. IBM PowerPC 620: 256-entry, 2-way set-associative BTB 2K counter PHT Baer p. 146

Integrating the BTB with the Branch History Table (BHT) Most likely, it is not the same bit field from the PC that is used to index the BTB+BHT and to select the PHT Intel P6 4-bit local history 512 BTB entries # of PHTs not published What happens on a BTB miss? “Backward taken, forward not taken” prediction. • The history of all branches needs to be recorded in BTB+BHT • Taken and not taken branches need to be included Baer p. 147

Branch Prediction