Dynamic Branch Prediction & Speculation & more ILP

Dynamic Branch Prediction &Speculation &more ILP

Branches Kill! • Branches are very frequent • Branches arrive much faster when multiple instructions are issued per clock • According to ?? • Approx. 20% of all instructions • Can not wait until we know where it goes • Long pipelines(super-pipeline) • Branch outcome known after B cycles • No scheduling past the branch until outcome known • Superscalars (e.g. 4-way) • Branch every cycle or so! • One cycle of work, then bubbles for ~B cycles?

1.Dynamic Branch Prediction

Branch Prediction • Need to know two things • Whether the branch is taken or not (direction) • The target address if it is taken (target)

Branch Prediction: Direction • Needed for conditional branches • Most branches are of this type • Many, many kinds of predictors for this • Static: compiler annotation • Basic (stall the pipeline) • Predict-not-taken and predict-taken • Delayed branch • Dynamic: hardware prediction • Branch history table (1 or more bits) • Correlated branches • Branch target buffer • Performance = (accuracy, cost of misprediction)

1. Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4

One-Bit BTB Branch historytable of 2^K entries,1 bit per entry K bits of branchinstruction address Use this entry topredict this branch: 0: predict not taken 1: predict taken Index When branch direction resolved,go back into the table andupdate entry: 0 if not taken, 1 if taken

One-Bit Branch Predictor (cont’d) • 0xDC08: for(i=0; i < 100000; i++) • { • 0xDC44: if( ( i % 100) == 0 ) • tick( ); • 0xDC50: if( (i & 1) == 1) • odd( ); • } T N

NT TN 99.998% Prediction Rate DC44: TTTTT ... TNTTTTT … TNTTTTT … 98.0% 2 / 100 DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT … 0.0% 2 / 2 Examples DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT … 100,000 iterations How often is branch outcome != previous outcome? 2 / 100,000

The Bit Is Not Enough! • Example: short loop (8 iterations) • Taken 7 times, then not taken once • Not-taken misspredicted (was taken previously) • Execute the same loop again • First always misspredicted(previous outcome was not taken) • Then 6 predicted correctly • Then last one misspredicted again • results in two misspredicts per loop

2 3 0 1 FSM for 2bC (2-bit Counter) Two Bits BTB Predict NT Predict T Transistion on T outcome Transistion on NT outcome 0 1 FSM for Last-Outcome Prediction

Initial Training/Warm-up 0 1 1 1 1 … 1 1 0 1 1 … T T T T T T N T T T           0 1 2 3 3 … 3 3 2 3 3 … T T T T T T N T T T           Example 1bC: 2bC: Only 1 Mispredict per N branches now! DC08: 99.999% DC04: 99.0%

Another 2-bit Predictor • Two-Bit Predictor • First bit is the prediction • Second bit tells if it is strong or weak • A mispredict will • Weaken a strong prediction • Change a weak predictionto the opposite strongprediction • Correct prediction will • Strengthen a weak prediction • Leave strong predictions strong (from book) Taken Not taken Predict taken Predict taken 11 10 Taken Taken Not taken Not taken Predict not taken Predict not taken 00 01 Taken Not taken

Prediction Accuracy of a 4K-entry 2-bit Prediction Buffer These are good We can live with these This is bad!

Correlating Branch Predictors • Correlating Branch Predictors • The behavior of branch b3 is correlated with the behavior of branches b1 and b2 (b1 & b2 both not taken  b3 will be taken); A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. • Branch predictors that use the behavior of other branches to make prediction are called correlating predictors or two-level predictors.

Example • BNEZ R1, L1 ; branch b1 (d!=0) • DADDIU R1, R0, #1 • L1: DADDIU R3, R1, #-1 • BNEZ R3, L2 ; branch b2 • L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T

(N,M) Correlating Predictors • Branch outcome correlates with the outcome of some recently executed branches • Use this in our prediction • Keep N bits of historyof recent outcomes • Use a different M-bitpredictor for each differenthistory • Note: N-bit history means2^N different predictors foreach branch Branch address 2-bit per-branch predictors 4 xx prediction xx 2-bit global branch history

(m, n) Predictors • Use behavior of the last m branches • 2m n-bit predictors for each branch • Simple implementation • Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor

Size of the Buffers • Number of bits in a (m,n) predictor • 2m x n x Number of entries in the table • Example – assume 8K bits in the BHT • (0,1): 8K entries • (0,2): 4K entries • (2,2): 1K entries • (12,2): 1 entry! • Does not use the branch address • Relies only on the global branch history

Performance Comparison of 2-bit Predictors

Branch-Target Buffers • Further reduce control stalls (hopefully to 0) • Store the predicted address in the buffer • Access the buffer during IF PC of instruction to fetch Look up Predicted PC Number of entries in branch-target buffer Branch predicted taken or untaken No: instruction is not predicted to be branch; proceed normally = Yes: then instruction is a taken branch and predicted PC should be used as the next PC

Enter branch instruction PC and next PC into branch target buffer Prediction with BTF Send PC to memory and branch-target buffer IF No Yes Entry found in branch-target buffer? Send out predicted PC No Yes Is instruction ataken branch? ID No Yes Normal instruction execution Branch taken? Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls EX

Target Instruction Buffers • Store target instructions instead of addresses • Advantages • BTB access can take longer than time between IFs and BTB can be larger • Branch folding • Zero-cycle unconditional branches • Replace branch with target instruction

Performance Issues • Limitations of branch prediction schemes • Prediction accuracy (80% - 95%) • Type of program • Size of buffer • Penalty of misprediction • Fetch from both directions to reduce penalty • Memory system should: • Dual-ported • Have an interleaved cache • Fetch from one path and then from the other

Branch Target Buffer • BTB indexed by instruction address • We don’t even know if it is a branch! • If address matches a BTB entry, it ispredicted to be a branch • BTB entry tells whether it is taken (direction) and where it goes if taken • BTB takes only the instruction address, sowhile we fetch one instruction in the IF stagewe are predicting where to fetch the next one from

Return Address Stack (RAS) • Function returns are frequent, yet • Address is difficult to compute(have to wait until EX stage done to know it) • Address difficult to predict with BTB(function can be called from multiple places) • But return address is actually easy to predict • It is the address after the last call instructionthat we haven’t returned from yet • Hence the Return Address Stack

Return Address Stack (RAS) • Call pushes return address into the RAS • When a return instruction decoded,pop the predicted return address from RAS • Accurate prediction even w/ small RAS

Processor Branch Prediction Comparison Processor Released Accuracy Prediction Mechanism Cyrix 6x86 early '96 ca. 85% BHT associated with BTB Cyrix 6x86MX May '97 ca. 90% BHT associated with BTB AMD K5 mid '94 80% BHT associated with I-cache AMD K6 early '97 95% 2-level adaptive associated with BTIC and ALU Intel Pentium late '93 78% BHT associated with BTB Intel P6 mid '96 90% 2 level adaptive with BTB PowerPC750 mid '97 90% BHT associated with BTIC MC68060 mid '94 90% BHT associated with BTIC DEC Alpha early '97 95% Hybrid 2-level adaptive associated with I-cache HP PA8000 early '96 80% BHT associated with BTB SUN UltraSparc mid '95 88%int BHT associated with I-cache 94%FP

2.Speculation

Speculation • Predict branches, then do everything(execute, write result, schedule instructions) • What do we do when we mispredict? • Two things • Allow things-before-the-branch to complete • Undo things-after-the-branch we have completed • Solution • At the end, put instructions in the correct order again

Speculative Execution + Tomasulo’s Algorithm Speculation: The Picture Usually implemented as a circular buffer Store Results

Speculation Pipeline • New Structure: Reorder Buffer (ROB) • Queues instructions in the original order • Use ROB entry number as “name” in renaming • ROB entry keeps the result after Write Result • New stage: Commit • Takes the oldest instruction in ROB • If instruction executed and result in ROB entry • Write result to registers • Free the ROB entry • Do this N times per cycle in a N-way superscalar

Four Steps of Speculative Tomasulo Algorithm 1.Issue— (In-order) Get an instruction from Instruction Queue If a reservation station and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called “dispatch”) 2. Execution— (out-of-order) Operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result— (out-of-order) Finish execution (WB) Write on Common Data Bus (CDB) to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit— (In-order) Update registers, memory with reorder buffer result • When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. • A mispredicted branch at the head of the reorder buffer flushes the reorder buffer (cancels speculated instructions after the branch)

Hardware-Based Speculation Example Show speculated single-issue Tomasulo status when MUL.D is ready to commit Example on page 229

Reorder buffer entry # for MUL.D Reorder buffer entry # for DIV.D speculated Tomasulo status when MUL.D is ready to commit (next cycle)

Recovery From a Misprediction • Mispredicted branch eventually committed • Now precise state is in the registers • Everything before the branch done and in regs • Nothing after the branch is in regs yet • Flush all the other structures • Reservation stations, ROB, instruction queue • Restart fetch from correct destination • Precise exceptions? Same thing!

Speculation: Stores • ROB takes over the role of the store queue • Stores go to memory when they commit • Commit is in-order, so store order is correct • Mispredictions do not affect memory state

ROB vs. Register Renaming • How many ports do we need for the ROB? • Lots! Look at a single-issue processor: • Issue: read two entries and write one • Write Result: write one entry • Commit: read and write one entry • ROB has a dual role • Keeps results (names) • Keeps order

ROB vs. Register Renaming • Keeping results: physical registers • Have a large physical register file • Keep architected-to-physical mapping in a table • Physical registers hold all values (names) • Keeping order: simplified ROB • Only keeps info needed to commit instructions • Reservation stations also simplified • No need to keep values • Called “instruction window” instead of RS

How does it work? • Rename • Find in the rename RAT (Register Allocation Table)which physical registers are sources • Get a free physical register for destinationand change rename RAT • Dispatch • Wait in windowuntil all source registers have values, then • Read source values from registers • Write Result • Send result to destination register • Send destination register number to window

Committing • Wait until oldest instruction done • Change commit RAT • Before it said Rn is in Pj • Now change it so Rn is in Pk (the destination) • Free physical register Pj • Everything that wants Pj is already committed • All future uses of Rn should use Pk

Recovering Precise State • To get precise state after instruction X, we • Wait until X commits • The commit RAT is the precise state • E.g. recovery from branch misprediction • Wait until X commits • Rename map = commit map • Flush window & ROB, restart fetch

Register Renaming Example • 8 architectural (logical) registers: R0..R7 • 16 physical registers (numbered 0..15), 6-instruction window • Single-issue, nine-stage pipeline • Fetch (also use BTB to predict next fetch addr) • Decode • Rename and put in instruction window • Also use RAS and direction predictor, calculate target address if not indirect • Schedule • Instruction stays in schedule stage until operands ready • Read Operands • Execute • Also calculate target address if indirect • Read Memory • Write Result • Commit • Instruction stays in commit stage until it can actually commit

P0 R0 P0 R0 P8 I1: R1 P1 R1 P1 P9 I2: P2 P2 R2 P10 R2 I3: P3 P3 R3 R3 P11 I4: R4 P4 R4 P12 P4 I5: R5 P5 R5 P5 P13 I6: R6 P14 R6 P6 P6 I7: R7 R7 P7 P7 P15 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example reg alloc table RAT XOR R0, R0, R0 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) 0 LD.IMM R2, 4(R0) 0 LD.IMM R3, 400(R0) 0 AND R4, R0, R0 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) ROB no. #0 #0 Commit no. Cycle 3: Rename I1 1 R0 P8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ROB Physical regs

P0 R0 R0 P8 I1: R1 R1 P1 P1 I2: R2 P2 R2 P2 I3: P3 P3 R3 R3 I4: P4 P4 R4 R4 I5: P5 R5 R5 P5 I6: R6 P6 P6 R6 I7: P7 P7 R7 R7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P9 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P10 0 LD.IMM R2, 4(R0) P11 0 LD.IMM R3, 400(R0) P12 0 AND R4, R0, R0 P13 0 LD R5, 0(R3) P14 0 ADD R4, R4, R5 P15 ADD R3, R3, R2 BNE R3, R1, -12(PC) #1 #0 End of Cycle 3 1 R0 P8 0 0 0 0 0 0 0 0 0 0 0 0 0 0

P0 R0 R0 P8 I1: R1 P1 R1 P9 I2: R2 R2 P2 P2 I3: R3 P3 R3 P3 I4: P4 P4 R4 R4 I5: R5 P5 R5 P5 I6: P6 R6 R6 P6 I7: R7 R7 P7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P10 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P11 1 LD.IMM #1 P9 P8 0 416 LD.IMM R2, 4(R0) P12 0 LD.IMM R3, 400(R0) P13 0 AND R4, R0, R0 P14 0 LD R5, 0(R3) P15 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #2 #0 Cycle 4: Renamed I2 1 R0 P8 0 1 R1 P9 0 0 0 0 0 0 0 0 0 0 0 0 0

P0 R0 R0 P8 I1: R1 P1 R1 P9 I2: R2 P2 R2 P2 I3: P3 R3 R3 P3 I4: R4 P4 R4 P4 I5: P5 P5 R5 R5 I6: R6 P6 R6 P6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P10 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P11 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P12 0 LD.IMM R3, 400(R0) P13 0 AND R4, R0, R0 P14 0 LD R5, 0(R3) P15 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #2 #0 Cycle 4: Schedule(the XOR is scheduled) 1 R0 P8 0 1 R1 P9 0 0 0 0 0 0 0 0 0 0 0 0 0

P0 R0 P8 R0 I1: R1 P9 P1 R1 I2: R2 R2 P2 P10 I3: R3 P3 R3 P3 I4: R4 P4 P4 R4 I5: R5 P5 R5 P5 I6: R6 R6 P6 P6 I7: P7 P7 R7 R7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 0 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: Renamed I3 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 0 0 0 0 0 0 0 0 0 RR: 0 XOR #0 P8 P0 1 P0 1

P0 R0 R0 P8 I1: R1 P9 P1 R1 I2: R2 P2 P10 R2 I3: P3 R3 P3 R3 I4: R4 P4 P4 R4 I5: P5 R5 P5 R5 I6: R6 P6 P6 R6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 0 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: Schedule 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 0 0 0 0 0 0 0 0 0 0 XOR #0 P8 P0 1 P0 1

P0 R0 R0 P8 I1: R1 P9 P1 R1 I2: R2 P2 P10 R2 I3: P3 R3 P3 R3 I4: R4 P4 P4 R4 I5: P5 R5 P5 R5 I6: R6 P6 P6 R6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 0 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: I1 Reads Regs 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 XOR #0 P8 P0 1 P0 1 0 0 0 0 0 0 0 0 0 0

Dynamic Branch Prediction & Speculation & more ILP