1 / 30

Computer Architecture Principles Dr. Mike Frank

Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #22 Dynamic Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction (3.4). As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often

rupali
Download Presentation

Computer Architecture Principles Dr. Mike Frank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #22Dynamic Branch Prediction

  2. Dynamic Branch Prediction

  3. Dynamic Branch Prediction (3.4) • As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. • Branches come more often • An n-cycle delay postpones more instructions • Dynamic Hardware Branch Prediction • “Learns” which branches are taken, or not • Make the right guess (most of the time) about whether a branch is taken, or not. • Delay depends on whether prediction is correct, and whether branch is taken.

  4. Branch-Prediction Buffers (BPB) • Also called “branch history table” • Low-order n bits of branch address used to index a table of branch history data. • May have “collisions” between distant branches. • Associative tables also possible • In each entry, k bits of information about history of that branch are stored. • Common values of k: 1, 2, and larger • Entry is used to predict what branch will do. • Actual behavior of branch will update the entry.

  5. 1-bit Branch Prediction • The entry for a branch has only two states: • Bit = 1 • “The last time this branch was encountered, it was taken. I predict it will be taken next time.” • Bit = 0 • “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” • Will make 2 mistakes each time a loop is encountered. • At the end of the first & last iterations. • May always mispredict in pathological cases!

  6. 2-bit Branch Prediction • 2 bits  4 states • Commonly used to code the most recent branch outcome, & the most recent run of 2 consecutive identical outcomes. • Strategy: • Prediction mirrors mostrecent run of 2. • Only 1 mis-prediction perloop execution,after the first timethe loop is reached. • On last iteration

  7. State Transition Diagram State #3 (11) State #2 (10) State #1 (01) State #0 (00)

  8. Misprediction rate for 2-bit BPB (with 4,096 entries)

  9. n-bit Branch Prediction One commonly tried scheme: • Each entry contains an integer in [0, 2n1]. • After branch execution, if the branch was taken, • then: entry  min(entry+1, 2n1) ; increment • else: entry  max(entry1, 0) ; decrement • If entry < ½2n, then predict not taken, • else predict taken. • Effectively does the following: • Averages branch behavior over a long time, and • Predicts the more frequently occurring outcome • Empirically, not much better than 2-bit!

  10. Ideal Branch Prediction? • More sophisticated schemes are certainly theoretically possible… • Could recognize simple patterns of branches • E.g., TNTNTNTNTN (T=Taken, N=Not taken) • However, totally general, optimal prediction is essentially equivalent to the general learning problem in its difficulty! • Ideal branch prediction is uncomputable statically. • It is apparently impossible to even objectively define “ideal” dynamic prediction, • and it’s intractable to compute it under many specific (and subjectively motivated) definitions.

  11. Implementing Branch Histories • Separate “cache” accessed during IF • Extra bits in instruction cache • Problem with this approach, in the simple RISC pipeline we’ve been studying: • After fetch, don’t know whether the instruction is really a branch or not (until decoding) • Also don’t know the target address. • By the time you know these things (in ID), you already know whether it’s really taken! • Haven’t saved any time! • Branch-Target Buffers can fix this problem (later)...

  12. Branch-Prediction Performance • Contribution to cycle count depends on: • Branch frequency & misprediction frequency • Freqs. of taken/not taken, predicted/mispredicted. • Delay of taken/not taken, predicted/mispredicted. • How to reduce misprediction frequency? • Increase buffer size to avoid collisions. • Empirically, has little effect beyond ~4,096 entries. • Increase prediction accuracy • Increase # of bits/entry (little effect beyond 2) • Use a different prediction scheme • e.g., correlated predictors, which we will now discuss…

  13. Correlated Prediction - example • Code fragment from eqntott: if (aa==2) aa=0; if (bb==2) bb=0; if (aa!=bb) { … • Simple RISC code (aa=R1, bb=R2): SUBUI R3,R1,#2 ;(aa-2) BNEZ R3,L1 ;branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ;(bb-2) BNEZ R3,L2 ;branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ;(aa-bb) BEQZ R3,L3 … ;branch b3 (aa==bb) Note that if b1 and b2are both untaken, b3will be taken.

  14. Even simpler example b1 untakenimplies b2untaken • C code: if (d==0) d=1; if (d==1) … • Simple RISC code (d=R1): BNEZ R1,L1 ;b1: d!=0 ADDI R1,R0,#1 ;d=1 L1: SUBUI(3x) R3,R1,#1 ;(d-1) BNEZ R3,L2 ;b2: d!=1

  15. Behavior w. 1-bit predictor • Suppose initial value of d alternates between 2 and 0. • All branches are mispredicted!

  16. Correlating Predictors • Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. • Notation: _ / _ What to predictif the last branchwas NOT taken What to predictif the last branchwas TAKEN Prediction usedis shown in bold

  17. (m,n) correlated predictors • Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. • Each of these predictors records n bits of history information for any given branch. • On previous slide we saw a (1,1) predictor. • Easy to implement: • Behavior of last m branches: an m-bit shift register • Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.

  18. Correlated predictor schematic Rowselect Column select New branch outcome shifted in (shift register)

  19. Correlated Predictors Do Better

  20. Branch-Target Buffers (BTB) • How to know the address of the next instruction as soon as the current instruction is fetched? • Normally, an extra (ID) cycle is needed to: • Determine that the first instruction is a branch • Determine whether the branch is taken • Compute the target address PC+offset • Branch prediction alone doesn’t help DLX • What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched?  BTB

  21. BTB Schematic

  22. BTB Flowchart

  23. Penalties in different cases • Using a BTB. • If instruction not in BTB and branch not taken (case not shown), penalty is 0.

  24. Branch-Target Buffer Variants • Store target instructions instead of their addresses! • Saves on fetch time. • Permits branch folding - zero-cycle branches! • Substitute destination instruction for branch in pipeline! • Predicting register/indirect branches • E.g., abstract function calls, switch statements, procedure returns. • CPU-internal return-address stack

  25. Return Address Prediction Stack

  26. Branch Prediction Styles • Local predictors (e.g., simple 2-bit) look only at the history of the particular branch in question • Global (e.g., correlating) predictors also look at other events that have happened in context • e.g., history of recent branch outcomes • Tournament predictors operate several branch predictors in parallel, • e.g., 1 local and 1 global, • and dynamically learn which one performs best for a given branch. • Tournament predictors are one type of multilevel branch predictors • These have 2 or more levels of branch-prediction tables

  27. FSM for Tournament Predictor Predictor 1/2 result status: 1 = prediction correct0 = prediction incorrect Counter = 3 Counter = 0 Counter = 2 Counter = 1 If predictor 1 is correct, counter = min(counter+1,3); If predictor 2 is correct, counter = max(counter-1,0)

  28. Which predictor is selected?

  29. Comparing Predictor Types

  30. When the best predictor fails... • Even the best branch predictors have a non-zero miss rate! • What else can you do to improve these cases? • Another approach: Reduce miss penalty to zero. • One way to reduce miss penalty for branches: • Take both paths simultaneously! (Parallel speculative execution.) • Fetch (or pre-fetch) both possible next instructions • Begin executing both in parallel till choice is known • May only works for constant (immed./PC-relative) branches • A branch to a computed EA may have too many destinations • May have a large penalty in energy, area, clock speed…

More Related