Modern Branch Prediction Techniques in Processor Pipelines

Branch Prediction Static, Dynamic Branch prediction techniques

Next fetch started PC Fetch I-cache Fetch Buffer Decode Issue Buffer Execute Func. Units Result Buffer Branchexecuted Commit Arch. State Control Flow PenaltyWhy Branch Prediction Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! work lost if pipeline makes wrong prediction ~ Loop length x pipeline width

Branch Penalties in a Superscalarare extensive

Reducing Control Flow Penalty • Software solutions • Minimize branches - loop unrolling • Increases the run length • Hardware solutions • Find something else to do - delay slots • Speculate –Dynamicbranch prediction • Speculative execution of instructionsbeyond branch

Branch Prediction • Motivation: • Branch penalties limit performance of deeply pipelined processors • Much worse for superscalar processors • Modern branch predictors have high accuracy • (>95%) and can reduce branch penalties significantly • Required hardware support: • Dynamic Prediction HW: • Branch history tables, branch target buffers, etc. • Mispredict recovery mechanisms: • Keep computation result separate from commit • Kill instructions following branch • Restore state to state following branch

JZ JZ Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: backward 90% forward 50% • ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 • bne0 (preferred taken) beq0 (not taken) • ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate

Branch Prediction Needs • Target address generation • Get register: PC, Link reg, GP reg. • Calculate: +/- offset, auto inc/dec • Target speculation • Condition resolution • Get register: condition code reg, count reg., other reg. • Compare registers • Condition speculation

Target address generation takes time

Condition resolution takes time

Solution: Branch speculation

Branch Prediction Schemes • 2-bit Branch-Prediction Buffer • Branch Target Buffer • Correlating Branch Prediction Buffer • Tournament Branch Predictor • Integrated Instruction Fetch Units • Return Address Predictors (for subroutines, Pentium, Core Duo) • Predicated Execution (Itanium)

Dynamic Branch Predictionlearning based on past behavior History Information • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value }

Branch History Table (BHT)Table of predictors Predictor 0 Branch PC Predictor 1 • Each branch given its own predictor • BHT is table of “Predictors” • Could be 1-bit or more • Indexed by PC address of Branch • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): • End of loop case: when it exits loop • First time through loop, it predicts exit instead of looping • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) • Misprediction  Flush Reorder Buffer • In Fetch state of branch: • Use Predictor to make prediction • When branch completes • Update corresponding Predictor Predictor 7

Fetch PC 0 0 I-Cache k 2k-entry BHT, 2 bits/entry BHT Index Instruction Opcode offset + Branch? Taken/¬Taken? Target PC Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

2-bit Dynamic Branch Predictionmore accurate than 1-bit • Better Solution: 2-bit scheme where change prediction only if get misprediction twice: • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT

Branch PC Predicted PC BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only  not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction

A PC Generation/Mux P Instruction Fetch Stage 1 BTB F Instruction Fetch Stage 2 BHT in later pipeline stage corrects when BTB misses a predicted taken branch B BHT Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only updated after branch resolves in E stage Combining BTB and BHT • BTB entries considerably more expensive than BHT, • fetch redirected earlier in pipeline - can accelerate indirect branches (JR) • BHT can hold many more entries - more accurate

Pop return address when subroutine return decoded Push return address when function call executed k entries (typically k=8-16) Subroutine Return Stack • Small stack – accelerate subroutine returns • more accurate than BTBs. &nextc &nextb &nexta

Mispredict Recovery • In-order execution machines: • Instructions issued after branch cannot write-back before branch resolves • all instructions in pipeline behind mispredicted branch Killed

Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. • IA-64: 64 1-bit condition fields selected so conditional execution of any instruction • This transformation is called “if-conversion” • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C

Accuracy v. Size (SPEC89)

Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump

Modern Branch Prediction Techniques in Processor Pipelines

Modern Branch Prediction Techniques in Processor Pipelines

Presentation Transcript

Branch Prediction

Dynamic Branch Prediction

Branch Prediction Logic

Branch Prediction

Dynamic Branch Prediction

Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

Dynamic Branch Prediction

Branch prediction

Branch Prediction

Dynamic Branch Prediction

Static Branch Prediction

Branch Prediction

Branch Prediction

Branch prediction

Branch Prediction Techniques

Branch Prediction Logic