1 / 72

Dynamic Branch Prediction & Speculation & more ILP

Dynamic Branch Prediction & Speculation & more ILP. Branches Kill!. Branches are very frequent Branches arrive much faster when multiple instructions are issued per clock According to ?? Approx. 20% of all instructions Can not wait until we know where it goes

Download Presentation

Dynamic Branch Prediction & Speculation & more ILP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Branch Prediction &Speculation &more ILP

  2. Branches Kill! • Branches are very frequent • Branches arrive much faster when multiple instructions are issued per clock • According to ?? • Approx. 20% of all instructions • Can not wait until we know where it goes • Long pipelines(super-pipeline) • Branch outcome known after B cycles • No scheduling past the branch until outcome known • Superscalars (e.g. 4-way) • Branch every cycle or so! • One cycle of work, then bubbles for ~B cycles?

  3. 1.Dynamic Branch Prediction

  4. Branch Prediction • Need to know two things • Whether the branch is taken or not (direction) • The target address if it is taken (target)

  5. Branch Prediction: Direction • Needed for conditional branches • Most branches are of this type • Many, many kinds of predictors for this • Static: compiler annotation • Basic (stall the pipeline) • Predict-not-taken and predict-taken • Delayed branch • Dynamic: hardware prediction • Branch history table (1 or more bits) • Correlated branches • Branch target buffer • Performance = (accuracy, cost of misprediction)

  6. 1. Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4

  7. One-Bit BTB Branch historytable of 2^K entries,1 bit per entry K bits of branchinstruction address Use this entry topredict this branch: 0: predict not taken 1: predict taken Index When branch direction resolved,go back into the table andupdate entry: 0 if not taken, 1 if taken

  8. One-Bit Branch Predictor (cont’d) • 0xDC08: for(i=0; i < 100000; i++) • { • 0xDC44: if( ( i % 100) == 0 ) • tick( ); • 0xDC50: if( (i & 1) == 1) • odd( ); • } T N

  9. NT TN 99.998% Prediction Rate DC44: TTTTT ... TNTTTTT … TNTTTTT … 98.0% 2 / 100 DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT … 0.0% 2 / 2 Examples DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT … 100,000 iterations How often is branch outcome != previous outcome? 2 / 100,000

  10. The Bit Is Not Enough! • Example: short loop (8 iterations) • Taken 7 times, then not taken once • Not-taken misspredicted (was taken previously) • Execute the same loop again • First always misspredicted(previous outcome was not taken) • Then 6 predicted correctly • Then last one misspredicted again • results in two misspredicts per loop

  11. 2 3 0 1 FSM for 2bC (2-bit Counter) Two Bits BTB Predict NT Predict T Transistion on T outcome Transistion on NT outcome 0 1 FSM for Last-Outcome Prediction

  12. Initial Training/Warm-up 0 1 1 1 1 … 1 1 0 1 1 … T T T T T T N T T T           0 1 2 3 3 … 3 3 2 3 3 … T T T T T T N T T T           Example 1bC: 2bC: Only 1 Mispredict per N branches now! DC08: 99.999% DC04: 99.0%

  13. Another 2-bit Predictor • Two-Bit Predictor • First bit is the prediction • Second bit tells if it is strong or weak • A mispredict will • Weaken a strong prediction • Change a weak predictionto the opposite strongprediction • Correct prediction will • Strengthen a weak prediction • Leave strong predictions strong (from book) Taken Not taken Predict taken Predict taken 11 10 Taken Taken Not taken Not taken Predict not taken Predict not taken 00 01 Taken Not taken

  14. Prediction Accuracy of a 4K-entry 2-bit Prediction Buffer These are good We can live with these This is bad!

  15. Correlating Branch Predictors • Correlating Branch Predictors • The behavior of branch b3 is correlated with the behavior of branches b1 and b2 (b1 & b2 both not taken  b3 will be taken); A predictor that uses only the behavior of a single branch to predict the outcome of that branch can never capture this behavior. • Branch predictors that use the behavior of other branches to make prediction are called correlating predictors or two-level predictors.

  16. Example • BNEZ R1, L1 ; branch b1 (d!=0) • DADDIU R1, R0, #1 • L1: DADDIU R3, R1, #-1 • BNEZ R3, L2 ; branch b2 • L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T

  17. (N,M) Correlating Predictors • Branch outcome correlates with the outcome of some recently executed branches • Use this in our prediction • Keep N bits of historyof recent outcomes • Use a different M-bitpredictor for each differenthistory • Note: N-bit history means2^N different predictors foreach branch Branch address 2-bit per-branch predictors 4 xx prediction xx 2-bit global branch history

  18. (m, n) Predictors • Use behavior of the last m branches • 2m n-bit predictors for each branch • Simple implementation • Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor

  19. Size of the Buffers • Number of bits in a (m,n) predictor • 2m x n x Number of entries in the table • Example – assume 8K bits in the BHT • (0,1): 8K entries • (0,2): 4K entries • (2,2): 1K entries • (12,2): 1 entry! • Does not use the branch address • Relies only on the global branch history

  20. Performance Comparison of 2-bit Predictors

  21. Branch-Target Buffers • Further reduce control stalls (hopefully to 0) • Store the predicted address in the buffer • Access the buffer during IF PC of instruction to fetch Look up Predicted PC Number of entries in branch-target buffer Branch predicted taken or untaken No: instruction is not predicted to be branch; proceed normally = Yes: then instruction is a taken branch and predicted PC should be used as the next PC

  22. Enter branch instruction PC and next PC into branch target buffer Prediction with BTF Send PC to memory and branch-target buffer IF No Yes Entry found in branch-target buffer? Send out predicted PC No Yes Is instruction ataken branch? ID No Yes Normal instruction execution Branch taken? Mispredicted branch, kill fetched instruction; restart fetch at other target; delete entry from target buffer Branch correctly predicted; continue execution with no stalls EX

  23. Target Instruction Buffers • Store target instructions instead of addresses • Advantages • BTB access can take longer than time between IFs and BTB can be larger • Branch folding • Zero-cycle unconditional branches • Replace branch with target instruction

  24. Performance Issues • Limitations of branch prediction schemes • Prediction accuracy (80% - 95%) • Type of program • Size of buffer • Penalty of misprediction • Fetch from both directions to reduce penalty • Memory system should: • Dual-ported • Have an interleaved cache • Fetch from one path and then from the other

  25. Branch Target Buffer • BTB indexed by instruction address • We don’t even know if it is a branch! • If address matches a BTB entry, it ispredicted to be a branch • BTB entry tells whether it is taken (direction) and where it goes if taken • BTB takes only the instruction address, sowhile we fetch one instruction in the IF stagewe are predicting where to fetch the next one from

  26. Return Address Stack (RAS) • Function returns are frequent, yet • Address is difficult to compute(have to wait until EX stage done to know it) • Address difficult to predict with BTB(function can be called from multiple places) • But return address is actually easy to predict • It is the address after the last call instructionthat we haven’t returned from yet • Hence the Return Address Stack

  27. Return Address Stack (RAS) • Call pushes return address into the RAS • When a return instruction decoded,pop the predicted return address from RAS • Accurate prediction even w/ small RAS

  28. Processor Branch Prediction Comparison Processor Released Accuracy Prediction Mechanism Cyrix 6x86 early '96 ca. 85% BHT associated with BTB Cyrix 6x86MX May '97 ca. 90% BHT associated with BTB AMD K5 mid '94 80% BHT associated with I-cache AMD K6 early '97 95% 2-level adaptive associated with BTIC and ALU Intel Pentium late '93 78% BHT associated with BTB Intel P6 mid '96 90% 2 level adaptive with BTB PowerPC750 mid '97 90% BHT associated with BTIC MC68060 mid '94 90% BHT associated with BTIC DEC Alpha early '97 95% Hybrid 2-level adaptive associated with I-cache HP PA8000 early '96 80% BHT associated with BTB SUN UltraSparc mid '95 88%int BHT associated with I-cache 94%FP

  29. 2.Speculation

  30. Speculation • Predict branches, then do everything(execute, write result, schedule instructions) • What do we do when we mispredict? • Two things • Allow things-before-the-branch to complete • Undo things-after-the-branch we have completed • Solution • At the end, put instructions in the correct order again

  31. Speculative Execution + Tomasulo’s Algorithm Speculation: The Picture Usually implemented as a circular buffer Store Results

  32. Speculation Pipeline • New Structure: Reorder Buffer (ROB) • Queues instructions in the original order • Use ROB entry number as “name” in renaming • ROB entry keeps the result after Write Result • New stage: Commit • Takes the oldest instruction in ROB • If instruction executed and result in ROB entry • Write result to registers • Free the ROB entry • Do this N times per cycle in a N-way superscalar

  33. Four Steps of Speculative Tomasulo Algorithm 1.Issue— (In-order) Get an instruction from Instruction Queue If a reservation station and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called “dispatch”) 2. Execution— (out-of-order) Operate on operands (EX) When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result— (out-of-order) Finish execution (WB) Write on Common Data Bus (CDB) to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit— (In-order) Update registers, memory with reorder buffer result • When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. • A mispredicted branch at the head of the reorder buffer flushes the reorder buffer (cancels speculated instructions after the branch)

  34. Hardware-Based Speculation Example Show speculated single-issue Tomasulo status when MUL.D is ready to commit Example on page 229

  35. Reorder buffer entry # for MUL.D Reorder buffer entry # for DIV.D speculated Tomasulo status when MUL.D is ready to commit (next cycle)

  36. Recovery From a Misprediction • Mispredicted branch eventually committed • Now precise state is in the registers • Everything before the branch done and in regs • Nothing after the branch is in regs yet • Flush all the other structures • Reservation stations, ROB, instruction queue • Restart fetch from correct destination • Precise exceptions? Same thing!

  37. Speculation: Stores • ROB takes over the role of the store queue • Stores go to memory when they commit • Commit is in-order, so store order is correct • Mispredictions do not affect memory state

  38. ROB vs. Register Renaming • How many ports do we need for the ROB? • Lots! Look at a single-issue processor: • Issue: read two entries and write one • Write Result: write one entry • Commit: read and write one entry • ROB has a dual role • Keeps results (names) • Keeps order

  39. ROB vs. Register Renaming • Keeping results: physical registers • Have a large physical register file • Keep architected-to-physical mapping in a table • Physical registers hold all values (names) • Keeping order: simplified ROB • Only keeps info needed to commit instructions • Reservation stations also simplified • No need to keep values • Called “instruction window” instead of RS

  40. How does it work? • Rename • Find in the rename RAT (Register Allocation Table)which physical registers are sources • Get a free physical register for destinationand change rename RAT • Dispatch • Wait in windowuntil all source registers have values, then • Read source values from registers • Write Result • Send result to destination register • Send destination register number to window

  41. Committing • Wait until oldest instruction done • Change commit RAT • Before it said Rn is in Pj • Now change it so Rn is in Pk (the destination) • Free physical register Pj • Everything that wants Pj is already committed • All future uses of Rn should use Pk

  42. Recovering Precise State • To get precise state after instruction X, we • Wait until X commits • The commit RAT is the precise state • E.g. recovery from branch misprediction • Wait until X commits • Rename map = commit map • Flush window & ROB, restart fetch

  43. Register Renaming Example • 8 architectural (logical) registers: R0..R7 • 16 physical registers (numbered 0..15), 6-instruction window • Single-issue, nine-stage pipeline • Fetch (also use BTB to predict next fetch addr) • Decode • Rename and put in instruction window • Also use RAS and direction predictor, calculate target address if not indirect • Schedule • Instruction stays in schedule stage until operands ready • Read Operands • Execute • Also calculate target address if indirect • Read Memory • Write Result • Commit • Instruction stays in commit stage until it can actually commit

  44. P0 R0 P0 R0 P8 I1: R1 P1 R1 P1 P9 I2: P2 P2 R2 P10 R2 I3: P3 P3 R3 R3 P11 I4: R4 P4 R4 P12 P4 I5: R5 P5 R5 P5 P13 I6: R6 P14 R6 P6 P6 I7: R7 R7 P7 P7 P15 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example reg alloc table RAT XOR R0, R0, R0 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) 0 LD.IMM R2, 4(R0) 0 LD.IMM R3, 400(R0) 0 AND R4, R0, R0 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) ROB no. #0 #0 Commit no. Cycle 3: Rename I1 1 R0 P8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ROB Physical regs

  45. P0 R0 R0 P8 I1: R1 R1 P1 P1 I2: R2 P2 R2 P2 I3: P3 P3 R3 R3 I4: P4 P4 R4 R4 I5: P5 R5 R5 P5 I6: R6 P6 P6 R6 I7: P7 P7 R7 R7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P9 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P10 0 LD.IMM R2, 4(R0) P11 0 LD.IMM R3, 400(R0) P12 0 AND R4, R0, R0 P13 0 LD R5, 0(R3) P14 0 ADD R4, R4, R5 P15 ADD R3, R3, R2 BNE R3, R1, -12(PC) #1 #0 End of Cycle 3 1 R0 P8 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  46. P0 R0 R0 P8 I1: R1 P1 R1 P9 I2: R2 R2 P2 P2 I3: R3 P3 R3 P3 I4: P4 P4 R4 R4 I5: R5 P5 R5 P5 I6: P6 R6 R6 P6 I7: R7 R7 P7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P10 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P11 1 LD.IMM #1 P9 P8 0 416 LD.IMM R2, 4(R0) P12 0 LD.IMM R3, 400(R0) P13 0 AND R4, R0, R0 P14 0 LD R5, 0(R3) P15 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #2 #0 Cycle 4: Renamed I2 1 R0 P8 0 1 R1 P9 0 0 0 0 0 0 0 0 0 0 0 0 0

  47. P0 R0 R0 P8 I1: R1 P1 R1 P9 I2: R2 P2 R2 P2 I3: P3 R3 R3 P3 I4: R4 P4 R4 P4 I5: P5 P5 R5 R5 I6: R6 P6 R6 P6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P10 1 XOR #0 P8 P0 1 P0 1 LD.IMM R1, 416(R0) P11 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P12 0 LD.IMM R3, 400(R0) P13 0 AND R4, R0, R0 P14 0 LD R5, 0(R3) P15 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #2 #0 Cycle 4: Schedule(the XOR is scheduled) 1 R0 P8 0 1 R1 P9 0 0 0 0 0 0 0 0 0 0 0 0 0

  48. P0 R0 P8 R0 I1: R1 P9 P1 R1 I2: R2 R2 P2 P10 I3: R3 P3 R3 P3 I4: R4 P4 P4 R4 I5: R5 P5 R5 P5 I6: R6 R6 P6 P6 I7: P7 P7 R7 R7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 0 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: Renamed I3 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 0 0 0 0 0 0 0 0 0 RR: 0 XOR #0 P8 P0 1 P0 1

  49. P0 R0 R0 P8 I1: R1 P9 P1 R1 I2: R2 P2 P10 R2 I3: P3 R3 P3 R3 I4: R4 P4 P4 R4 I5: P5 R5 P5 R5 I6: R6 P6 P6 R6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 0 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: Schedule 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 0 0 0 0 0 0 0 0 0 0 XOR #0 P8 P0 1 P0 1

  50. P0 R0 R0 P8 I1: R1 P9 P1 R1 I2: R2 P2 P10 R2 I3: P3 R3 P3 R3 I4: R4 P4 P4 R4 I5: P5 R5 P5 R5 I6: R6 P6 P6 R6 I7: P7 R7 R7 P7 I8: I9: P0 1 P1 1 P2 1 P3 1 P4 1 P5 1 P6 1 #0 P7 1 #1 P8 0 #2 P9 0 #3 P10 0 #4 P11 0 #5 P12 0 #6 P13 0 #7 P14 0 #8 P15 0 #9 #10 #11 #12 #13 Register Renaming Example XOR R0, R0, R0 P11 0 LD.IMM R1, 416(R0) P12 1 LD.IMM #1 P9 P8 0 1 416 LD.IMM R2, 4(R0) P13 1 LD.IMM #2 P10 P8 0 1 4 LD.IMM R3, 400(R0) P14 0 AND R4, R0, R0 P15 0 LD R5, 0(R3) 0 ADD R4, R4, R5 ADD R3, R3, R2 BNE R3, R1, -12(PC) #3 #0 Cycle 5: I1 Reads Regs 1 R0 P8 0 1 R1 P9 0 1 R2 P10 0 0 XOR #0 P8 P0 1 P0 1 0 0 0 0 0 0 0 0 0 0

More Related