240 likes | 379 Views
Modern processors implement dynamic branch prediction techniques to minimize the control flow penalty and improve performance. This involves employing software and hardware solutions such as loop unrolling, delay slots, and dynamic branch prediction. Branch prediction schemes like 2-bit prediction buffers effectively predict branch outcomes to reduce mispredictions. Key components include branch history tables, branch target buffers, and mispredict recovery mechanisms. By utilizing dynamic branch prediction, processors can achieve high accuracy and significantly reduce branch penalties, ultimately enhancing overall performance.
E N D
Branch Prediction Static, Dynamic Branch prediction techniques
Next fetch started PC Fetch I-cache Fetch Buffer Decode Issue Buffer Execute Func. Units Result Buffer Branchexecuted Commit Arch. State Control Flow PenaltyWhy Branch Prediction Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! work lost if pipeline makes wrong prediction ~ Loop length x pipeline width
Reducing Control Flow Penalty • Software solutions • Minimize branches - loop unrolling • Increases the run length • Hardware solutions • Find something else to do - delay slots • Speculate –Dynamicbranch prediction • Speculative execution of instructionsbeyond branch
Branch Prediction • Motivation: • Branch penalties limit performance of deeply pipelined processors • Much worse for superscalar processors • Modern branch predictors have high accuracy • (>95%) and can reduce branch penalties significantly • Required hardware support: • Dynamic Prediction HW: • Branch history tables, branch target buffers, etc. • Mispredict recovery mechanisms: • Keep computation result separate from commit • Kill instructions following branch • Restore state to state following branch
JZ JZ Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: backward 90% forward 50% • ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 • bne0 (preferred taken) beq0 (not taken) • ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate
Branch Prediction Needs • Target address generation • Get register: PC, Link reg, GP reg. • Calculate: +/- offset, auto inc/dec • Target speculation • Condition resolution • Get register: condition code reg, count reg., other reg. • Compare registers • Condition speculation
Branch Prediction Schemes • 2-bit Branch-Prediction Buffer • Branch Target Buffer • Correlating Branch Prediction Buffer • Tournament Branch Predictor • Integrated Instruction Fetch Units • Return Address Predictors (for subroutines, Pentium, Core Duo) • Predicated Execution (Itanium)
Dynamic Branch Predictionlearning based on past behavior History Information • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value }
Branch History Table (BHT)Table of predictors Predictor 0 Branch PC Predictor 1 • Each branch given its own predictor • BHT is table of “Predictors” • Could be 1-bit or more • Indexed by PC address of Branch • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): • End of loop case: when it exits loop • First time through loop, it predicts exit instead of looping • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) • Misprediction Flush Reorder Buffer • In Fetch state of branch: • Use Predictor to make prediction • When branch completes • Update corresponding Predictor Predictor 7
Fetch PC 0 0 I-Cache k 2k-entry BHT, 2 bits/entry BHT Index Instruction Opcode offset + Branch? Taken/¬Taken? Target PC Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
2-bit Dynamic Branch Predictionmore accurate than 1-bit • Better Solution: 2-bit scheme where change prediction only if get misprediction twice: • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT
Branch PC Predicted PC BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb
BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction
A PC Generation/Mux P Instruction Fetch Stage 1 BTB F Instruction Fetch Stage 2 BHT in later pipeline stage corrects when BTB misses a predicted taken branch B BHT Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only updated after branch resolves in E stage Combining BTB and BHT • BTB entries considerably more expensive than BHT, • fetch redirected earlier in pipeline - can accelerate indirect branches (JR) • BHT can hold many more entries - more accurate
Pop return address when subroutine return decoded Push return address when function call executed k entries (typically k=8-16) Subroutine Return Stack • Small stack – accelerate subroutine returns • more accurate than BTBs. &nextc &nextb &nexta
Mispredict Recovery • In-order execution machines: • Instructions issued after branch cannot write-back before branch resolves • all instructions in pipeline behind mispredicted branch Killed
Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. • IA-64: 64 1-bit condition fields selected so conditional execution of any instruction • This transformation is called “if-conversion” • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C
Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump