Dynamic Branch Prediction

CSCI 620 Dynamic Branch Prediction So far, we looked at hardware techniques to reduce hazards from Data Dependences and Name Dependences Now, the (hardware) techniques to handle Control Dependences

CSCI 620 Summary of techniques to increase ILP • Dynamic inst scheduling • -Scoreboard • -Tomasulo’s • Dynamic Branch Predictions • Branch Target Buffers + Branch Target cache + Return address predictor • Superscalar/multiple issue • -Superscalar • -VLIW • Hardware based Speculation To increase ILP Reduce Stalls Reduces Structural & Data Hazards Hardware techniques Reduces Control Hazards Techniques for reducing Stalls Software techniques by compiler--later

CSCI 620 Branch prediction methods • If branches can be predicted (Taken or Not-Taken) with high percentage, then we can reduce stalls by fetching the “target instruction” earlier • When is information about branches gathered/applied? – When the machine is designed—hardwired – When the program is compiled (“compile-time”) – When a “training run” of the program is executed (“profile-based”) – As the program is being executed (“dynamic”)

CSCI 620 Dynamic Branch Prediction • So far we looked at Hardware solutions for Data Dependences & Name Dependences • Control dependences limit ILP(due to branch penalties)—key to performance for current microprocessors • Branch penalties depend upon ? —Penalties depend on 1) the Structure of pipeline 2) The type of predictor 3) Strategies used for recovering from mispredictions • Performance of branch prediction = (accuracy of prediction, cost of misprediction) • Branches arrive much faster when multiple instructions are issued per clock—branches will arrive n times faster in an n-issue processors • We want to predict outcome of branch as early as possible & as accurately as possible • Methods for branch prediction: • Branch History Table (1 or more bits) • Correlated branches • Tournament predictor Branch prediction techniques can be augmented with Branch target buffers & Speculation

CSCI 620 Dynamic Branch Prediction: A Simple Approach • Main idea is: Hope that future behavior is co-related to past behavior; loops, conditional branches • Branch History Table (BHT) (aka Branch Prediction Buffer) • Lower bits of PC address index table of 1-bit values(T or NT) • Entry says whether or not branch taken last time • No address check—several branch instructions share the same bit for prediction inaccuracy • Problem: In a loop, 1-bit BHT will cause two mispredictions 2) 2-bit scheme—next slide Lower bits of PC 000 T 001 NT T 010 Example: 8 bit buffer T 011 NT 100 NT 101 T 110 • End of loop branch, • - First time through loop, • when it predicts exit instead of looping • - Last time through the loop, when it exits instead of • looping as before NT 111

CSCI 620 Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken Dynamic Branch Prediction: A Better Way Solution: 2-bit scheme where prediction changes only if we get misprediction twice. Helps when target is known before result of condition. “Strongly Taken” “Strongly Not-Taken”

CSCI 620 Example: 1-bit vs. 2-bit Branch Prediction Assuming both start with “Not-Taken” prediction

CSCI 620 BHT General Case(Branch History Table) • n-bit predictor: • counter can hold values between 0 and • predict taken when value is greater than or equal to half of maximum value: • The counter is incremented on each taken branch • and decremented on each not taken branch

CSCI 620 BHT Accuracy • Mispredict because either: • Wrong guess for that branch • Got branch history of wrong branch from index table • 2-bit 4096-entry table: programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%. • 4096 entries about as good as infinite number of entries • 2-bit predictors work nearly as well as n-bit predictors

CSCI 620 Prediction accuracy of a 4096-entry 2-bit prediction buffer for SPEC89 benchmarks

CSCI 620 Prediction accuracy of a 4096-entry 2-bit prediction buffer versus an infinite buffer for SPEC89 benchmarks

CSCI 620 Correlating Branches (2nd method)(1st was Branch History Table) • Hypothesis: Previous 2-bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch • But branches are correlated; that is, behavior of recently-executed branches affects prediction of current branch if (aa == 2) aa = 0; #Branch 1 if (bb == 2) bb = 0; #Branch 2 if (aa! = bb) { #Branch 3 • Clearly branch #3 depends on outcome of #1 and #2 • Prediction must be a function of own branch as well as recent outcomes of other branches

CSCI 620 Correlating predictors (= two-level predictors) • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table • In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters • Thus, old 2-bit BHT is a (0,2) predictor • Global Branch History: m-bit shift register keeping T/NT status of last m branches. • Each entry in table has mn-bit predictors.

CSCI 620 Branch address(low-order bits) 4 bits 2-bits per branch predictor Prediction NT NN TN TT SHIFT N T 2-bit global branch history Correlating Branches (2,2) predictor Behavior of recent branches(global branch history) is used to select between four predictions of next branch, updating just that prediction

CSCI 620 Accuracy of Different Schemes(FROM SECOND EDITION) 20% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% Frequency of Mispredictions 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc expresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

CSCI 620 Tournament Predictors • Tournament Predictors are the most popular form of Multilevel branch predictors • Use 2-bit saturating counter per branch to choose between two different predictors • Use choice predictor to choose between global and local predictors n/m means: n = left predictor value m = right predictor value 0: Incorrect 1: Correct

CSCI 620 Tournament Predictors: DEC Alpha 21264 Tournament predictor using 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor • 4K entries index by history of last 12 branches (212 = 4K) • Each entry is a standard 2-bit predictor • Local predictor • Local history table: 1024 10-bit entries recording last 10 branches, index by branch address • The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

CSCI 620 Alpha 21264 Microprocessor’s Predictor From http://h18002.www1.hp.com/alphaserver/download/ev6chip.pdf

CSCI 620 Shift left at every local branch—collecting local history Only the most significant bit is used 0=NT, 1=T For each NT branch, it is decremented and for each T branch, it is incremented Keep history of choices between Local or Global Predictor Alpha 21264 Branch Predictor Shift left Keeps the history of branch globally

CSCI 620 Misprediction rate for 3 predictors on SPEC89

CSCI 620 Figure A.24 Branch Target calculation at ID stage—for MIPS MIPS way of handling branch—Decide branch as early as possible—due to simplicity of branch instructions(beq R1, R2, 100(R3))—Here R1-R2 =0? is done for branch decision and the branch address is calculated also. This takes Not Taken approach—penalty? 1 cycle

CSCI 620 Branch Target Predictions • We have been looking at the ways to predict direction of branches (Taken/Not-Taken) • It would be beneficial if we could predict the branchaddress (target address) and fetch it ASAP – Branch Target Buffer (BTB) • Branch target buffer (BTB) stores address of the last target location to avoid recomputing • Useful for conditional/unconditional branches – Return Address Stack (RAS) • Useful for procedure returns

CSCI 620 Branch Target Buffers Branch target calculation is costly and stalls the instruction fetch since the branch address is calculated in ID and the decision is done in EX in 5 stage pipeline(Appendix A)—To avoid stalls, we need to predict the target address and prefetch it in IF cycle !—so, while the branch instruction is fetched, the target address is fetched also—if the instruction happens to be a branch instruction, then we already have the target address BTB stores PCs (predicted target addresses) the same way as caches—content-addressable(associative) memory So, to prefetch, the current PC is sent to the BTB When a match is found, the corresponding Predicted PC(Predicted Target address) is returned along with “Branch prediction bit” If the branch was predicted taken, instruction fetch continues at the returned predicted PC, else fetch next sequential instruction(fall through)

CSCI 620 Instruction cache Branch Target Buffers (Figure 2.22) Optional bit since we only need to store Predicted Taken branches This bit could be used for extra prediction state bits 100 105 cache = content-addressable memory

CSCI 620 Enter branch instruction PC and next PC into branch target buffer Figure 2.23 The steps involved in handling an instruction with a branch-target buffer Send PC to memory and branch-target buffer IF No Yes Entry found in branch-target buffer? Send out predicted PC No Yes Is instruction ataken branch? ID No Yes Normal instruction execution Branch taken? Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls EX

CSCI 620 Branch Target Cache Target instructions stored here Branch Target Cache • Similar to BTB, but we also want to get the target instruction! – Prediction returns not just the target address, but also the instruction stored there – Allows zero-cycle unconditional branches (branch-folding) • Send target-instruction to ID rather than branch • Branch is not even sent into pipe • For conditional branches? Read Target instruction

CSCI 620 Return Address Predictors • Included in many recent processors – Alpha 21264 => 12 entry RAS (Return Address Stack ) • Procedure returns account for ~85% of indirect jumps (jumps whose address varies at run time). It will then return to many different locations—BTB may not predict accurately, • Therefore, small buffer of Return Addresses=cache of the most recent return addresses • Like a hardware stack, LIFO – At Procedure Call => Push Return address onto stack – Procedure Return => Prediction off of top of stack, Pop it • RAS tends to work quite well since call depths are typically not large

CSCI 620 Prediction accuracy for Return Address Predictors

CSCI 620 Multiple Issue Machines Superscalar means “Beyond Scalar”— ”Scalar“ means “one operation at a time” • To decrease the CPI to less than “one”, multiple issue machines—they come in two flavors • Superscalar: multiple parallel dedicated pipelines: • Issue varying number of instructions per cycle • either statically scheduled by compiler(in-order) and/or dynamically by hardware(out-of-order) (Tomasulo--# of inst issued depends upon?) • IBM PowerPC, Sun UltraSparc, DEC Alpha, IA32 Pentium • Is this same as multicore? • VLIW (Very Long Instruction Word): Several operations encoded as one long instruction: • Instructions have wide template (4-16 operations) • Also classified as EPIC (Explicitly Parallel Instruction Computer) • IA-64 Itanium # of Reservation Stations NO, it is single CPU with multiple Functional Units http://www.semiconductors.philips.com/acrobat_download/other/vliw-wp.pdf

CSCI 620 Choices for Multiple Issue machines

CSCI 620 Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar MIPS: Assume (for simplicity of description) 2 instructions/cycle, 1 FP & 1 anything else • Fetch 64-bits/clock cycle; integer on left, FP on right • Issue 2 instructions in one clock cycle

CSCI 620 Multiple Issue Challenges • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issued at same time, greater difficulty in decode and issue • Even 2-way scalar  examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • VLIW: tradeoff=larger instruction space but simpler decoding—no hardware scheduling • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent  execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch  16 to 24 bits per field  7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches • Static issue– completely rely on compilers to schedule (Superscalar machines rely on hardware(dynamic issue))

CSCI 620 Limitations to Multi-Issue Machines • As usual, we have to deal with the three hazards: – Structural Hazards – Data Hazards – Control Hazards • Multiple issue gives: – More hazards probable (why?) – Also larger performance hit from hazards (why?)

CSCI 620 Summary So Far • Many things looked at independently • Instruction Fetch – Branch Prediction (fill scheduler with instruction + multiple instructions per cycle) • Scheduling/Hazard elimination – Dynamic Scheduling with Tomasulo (RAW Hazards) – Register Renaming (WAR and WAW Hazards) • Multiple functional units, register file ports – Potentially can reduce CPI < 1 • Speculative Execution • Precise Interrupts • Memory systems

CSCI 620 Major Techniques to increase ILP

CSCI 620 Focus on Speculation/Interrupts • As ILP is pushed further, maintaining “Control Dependences” becomes an increasing burden • Branch prediction reduces stalls but for machines executing multiple instructions per clock, branch prediction may not be enough to keep high ILP Factors to consider • Precise Interrupts – All instructions before interrupt must be completed – All instructions after interrupt must behave as if they never started • Speculation: Fetch, issue, execute the target inst of predicted branch – If branch prediction is wrong, could update state incorrectly leading to wrong program behavior—should be taken care of! • Out-of-Order completion – Post-interrupt/mispredict writebacks change state of machine—should be taken care of!

CSCI 620 Hardware-Based Speculation • Instead of just instruction fetch and decode, also execute instructions based on prediction of branch. • Execute instructions out of order as soon as their operands are available. • But hold instruction commitoperation until branch is decided. • Re-order instructions after execution and commit them in order • Using reorder buffer or ROB • register file not updated until commit • Do not raise exceptions until the instruction is committed • ROB holds and provides operands until commit.

ROB & Store buffer are combined into one ROB Executed instructions are held in ROB until they are no longer “Speculative” instruction commit  commit in-order Example of Tomasulo with Speculation applied to Floating point unit But equally applicable to Integer unit CSCI 620

CSCI 620 4-step execution on Tomasulo with Speculation • Issue—get instruction from instruction queue—among the instructions, there will be branch instructions and target instructions (of course with predicted branch)—These target instructions(along with Store) will go into ROB If a reservation station and reorder buffer slotfree, issue instr & send(to RS) operands & reorder buffer # (for RS to get the operand when it(#) commits) 2) Execute When both operands ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks for RAW Cycles taken are dependent on the operation—Load=2 cycles, Store=1 cycle, …… 3) Write result—finish execution (WB) Write on Common Data Bus to all awaiting RSs & reorder buffer; mark reservation station as available(free). (tags are now ROB #s not RS #s) 4) Commit—update register with reorder result Three sequences of actions when an instruction reaches the head of ROB: • Normal Commit: write registers, in-order commit • Store: update memory • Branch with incorrect prediction: flush ROB and flush reservation stations and restart execution at correct PC`

Op – Operation to perform in the unit (e.g., + or – ) Qj, Qk – ROB # producing source registers Vj, Vk – Value of source operands—temp regs for renaming Busy – Indicates reservation station and FU is busy State – Which state? Issue, EX, WB, Commit Destination – Destination register # Value – Result of operation Busy – ROB entry is occupied Load buffer ROB FP register status CSCI 620

This is called? Register Renaming! CSCI 620

CSCI 620

When Mult1 finishes, it sends the result on CDB with #3 as the tag CSCI 620

For SUBD, the Control logic checks F2 and finds that it was renamed to #2, so look at the CDB for #2(which is broadcasted at this moment) and gets the data—Mem[45+Regs[R3]] CSCI 620

CSCI 620

Dynamic Branch Prediction