1 / 45

ECE 4100/6100 Advanced Computer Architecture Lecture 5 Branch Prediction

ECE 4100/6100 Advanced Computer Architecture Lecture 5 Branch Prediction. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Predict What?. Direction (1-bit) Single direction for unconditional jumps and calls/returns

galvin
Download Presentation

ECE 4100/6100 Advanced Computer Architecture Lecture 5 Branch Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 4100/6100Advanced Computer ArchitectureLecture 5 Branch Prediction Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

  2. Predict What? • Direction (1-bit) • Single direction for unconditional jumps and calls/returns • Binary for conditional branches • Target (32-bit or 64-bit addresses) • Some are easy • One: Uni-directional jumps • Two: Fall through (Not Taken) vs. Taken • Many: Function Pointer or Indirect Jump (e.g. jr r31)

  3. Categorizing Branches Source: H&P using Alpha

  4. Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Schedule Dispatch Reg File Exec Flags Br Resolve Queue Single Issue

  5. Mispredict Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Schedule Dispatch Reg File Exec Flags Br Resolve Queue Single Issue

  6. Mispredict Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Schedule Dispatch Reg File Exec Flags Br Resolve Queue Single Issue (flush entailed instructions and refetch)

  7. Fetch the correct path Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Schedule Dispatch Reg File Exec Flags Br Resolve Queue Single Issue

  8. Branch Misprediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PC Next PC Fetch Drive Alloc Rename Schedule Dispatch Reg File Exec Flags Br Resolve Queue Single Issue Mispredict 8-issue Superscalar Processor (Worst case)

  9. Why Branch is Predictable? if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) …. for (i=0; i<100; i++) { …. } addi r2, r0, 2 bne r10, r2, L_bb xor r10, r10, r10 j L_exit L_bb: bne r11, r2, L_xx xor r11, r11, r11 j L_exit L_xx: beq r10, r11, L_exit … Lexit: addi r10, r0, 100 addi r1, r0, r0 L1: … … … … addi r1, r1, 1 bne r1, r10, L1 … …

  10. Control Speculation • Execute instruction beyond a branch before the branch is resolved  Performance • Speculative execution • What if mis-speculated? need • Recovery mechanism • Squash instructions on the incorrect path • Branch prediction: Dynamic vs. Static • What to predict?

  11. Static Branch Prediction • Uni-directional, always predict taken (or not taken) • Backward taken, Forward not taken • Need offset information (when?) • Compiler hints with branch annotation • When the info will be available? Post-decode?

  12. Simplest Dynamic Branch Predictor • Prediction based on latest outcome • Index by some bits in the branch PC • Aliasing NT 1-bit Branch History Table T T for (i=0; i<100; i++) { …. } NT T . . . 0x40010100 0x40010104 0x40010108 … 0x40010A04 0x40010A08 addi r10, r0, 100 addi r1, r1, r0 L1: … … … … addi r1, r1, 1 bne r1, r10, L1 … … T NT NT How accurate?

  13. Typical Table Organization PC(32 bits) Hash 2N entries . . . . . table update N bits FSM Update Logic Actual outcome Prediction

  14. Simplest Dynamic Branch Predictor for (i=0; i<100; i++) { if (a[i] == 0) { … } … } NT 1-bit Branch History Table T T NT 0x40010100 0x40010104 0x40010108 0x4001010c 0x40010110 0x40010210 0x40010B0c 0x40010B10 addi r10, r0, 100 addi r1, r1, r0 L1: add r21, r20, r1 lw r2, (r21) beq r2, r0, L2 … … j L3 L2: … … … L3: addi r1, r1, 1 bne r1, r10, L1 T . . . T NT NT

  15. 0 1 FSM of the Simplest Predictor • A 2-state machine • Change mind fast If branch taken If branch not taken 0 Predict not taken 1 Predict taken

  16. 1 1 1 1 1 1 1 1 1 0 0 Example using 1-bit branch history table addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. }            Pred 0 T T NT T NT T T T T T T Actual 60% accuracy

  17. 10/ WT 11/ ST 01/ WN 00/ SN 2-bit Saturating Up/Down Counter Predictor MSB: Direction bit LSB: Hysteresis bit Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Not Taken Predict Not taken Predict taken

  18. 11/ ST 10/ WT 01/ WN 00/ SN 2-bit Counter Predictor (Another Scheme) Taken ST: Strongly Taken WT: Weakly Taken WN: Weakly Not Taken SN: Strongly Not Taken Not Taken Predict Not taken Predict taken

  19. 1 11 10 11 11 11 11 11 11 10 10 Example using 2-bit up/down counter addi r10, r0, 4 addi r1, r1, r0 L1: … … addi r1, r1, 1 bne r1, r10, L1 for (i=0; i<4; i++) { …. }            Pred 01 T T NT T NT T T T T T T Actual 80% accuracy

  20. aa2 bb=0 aa=0 bb=0 aa=0 bb2 aa2 bb2 Branch Correlation b1 Code Snippet 1 (T) 0 (NT) if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3 ……. } b2 b2 1 0 1 0 b3 b3 b3 b3 Path: A:1-1 C:0-1 D:0-0 B:1-0 • Branch direction • Not independent • Correlated to the path taken • Example: Path 1-1 of b3 can be surely known beforehand • Track path using a 2-bit register

  21. X X . . . . . . . . . . . . . . . . 2-bit counter 2-bit counter 2-bit counter 2-bit counter Correlated Branch Predictor [PanSoRahmeh’92] • (M,N) correlation scheme • M: shift register size (# bits) • N: N-bit counter 2-bit shift register (global branch history) Subsequent branch direction select Branch PC Branch PC X X Prediction Prediction w hash hash . . . . 2w 2-bit counter (2,2) Correlation Scheme 2-bit Sat. Counter Scheme

  22. 1 1 1 0 Two-Level Branch Predictor [YehPatt91,92,93] • Generalized correlated branch predictor • 1st level keeps branch history in Branch History Register (BHR) • 2nd level segregates pattern history in Pattern History Table (PHT) Pattern History Table (PHT) 00…..00 2N entries 00…..01 Branch History Register (BHR) (Shift left when update) 00…..10 Rc-1 Rc-k . . . . . N Prediction 11…..10 Current State PHT update 11…..11 Branch History Pattern FSM Update Logic Rc: Actual Branch Outcome

  23. Branch History Register • An N-bit Shift Register = 2N patterns in PHT • Shift-in branch outcomes • 1 taken • 0  not taken • First-in First-Out • BHR can be • Global • Per-set • Local (Per-address)

  24. Pattern History Table • 2N entries addressed by N-bit BHR • Each entry keeps a counter (2-bit or more) for prediction • Counter update: the same as 2-bit counter • Can be initialized in alternate patterns (01, 10, 01, 10, ..) • Alias (or interference) problem

  25. . . . . . . . . . . . . . . . . . . . . . Global History Schemes GAg GAs GAp Per-set PHTs (SPHTs) Per-addr PHTs (PPHTs) SetP(B) Addr(B) Global PHT Global BHR Global BHR Global BHR .. .. Set can be determined by branch opcode, compiler classification, or branch PC address. * [PanSoRahmeh’92] similar to GAp

  26. GAs Two-Level Branch Prediction The 2 LSBs are insignificant for 32-bit instruction PHT 00000000 00000001 00000010 PC = 0x4001000C . . . 00110110 10 00110110 00110111 0110 . . BHR 11111101 11111110 11111111 MSB = 1 Predict Taken

  27. Predictor Update (Actually, Not Taken) PHT 00000000 00000001 00000010 Wrong Prediction PC = 0x4001000C . . . 00111100 00110110 10 01 00110110 decremented 00110111 1100 0110 . . 00111100 BHR 11111101 11111110 11111111 • Update Predictor after branch is resolved

  28. . . . . . . . . . . . . . . . . . . . . . Per-Address History Schemes PAp PAg PAs Per-set PHTs (SPHTs) Per-addr PHTs (PPHTs) SetP(B) Addr(B) Per-addr BHT (PBHT) Global PHT Per-addr BHT (PBHT) Per-set BHT (PBHT) . . . . . . . . . Addr(B) Addr(B) Addr(B) .. .. • Ex: Alpha 21264’s local predictor • Ex: P6, Itanium

  29. PHT PC = 1110 0000 1001 1001 0010 1100 1110 1000 00000000 00000001 00000010 . . . 11010110 000 001 110 010 11010101 11 11010110 011 . . 100 101 11111101 110 MSB = 1 Predict Taken 11111110 111 BHT 11111111 PAs Two-Level Branch Predictor

  30. . . . . . . . . . . . . . . . . . . . . . Per-Set History Schemes SAp SAg SAs Per-set PHTs (SPHTs) Per-addr PHTs (PPHTs) SetP(B) Addr(B) Per-set BHT (SBHT) Global PHT Per-set BHT (SBHT) Per-set BHT (SBHT) . . . . . . . . . SetH(B) SetH(B) SetH(B) .. ..

  31. PHT Indexing • Tradeoff between more history bits and address bits • Too many bits needed in Gselect  sparse table entries Insufficient History

  32. Gshare Branch Predictor [McFarling93] • Tradeoff between more history bits and address bits • Too many bits needed in Gselect  sparse table entries • Gshare  Not to lose global history bits • Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte’s SB-1 Gselect 4/4: Index PHT by concatenate low order 4 bits Gshare 8/8: Index PHT by {Branch address  Global history}

  33. . . . . . 1 1 1 0 1 1 . . . . . 1 0 1 . . . . . 0 1 0 0 Gshare Branch Predictor PHT PC Address . . .  00 . . Global BHR MSB = 0 Predict Not Taken

  34. Aliasing Example PHT (indexed by 10) PHT Gshare GAp 0000 0000 BHR 1101 PC 0110 ---- || 1001 0001 BHR 1101 PC 0110 ---- XOR 1011 0001 0010 0010 0011 0011 0100 0100 0101 0101 0110 0110 0111 0111 1000 1000 1001 1001 BHR 1001 PC 1010 ---- || 1001 BHR 1001 PC 1010 ---- XOR 0011 1010 1010 1011 1011 1100 1100 1101 1101 1110 1110 1111 1111

  35. Hybrid Branch Predictor [McFarling93] Branch PC P0 P1 . . . Final Prediction Choice (or Meta) Predictor • Some branches correlated to global history, some correlated to local history • Only update the meta-predictor when 2 predictors disagree

  36. Alpha 21264 (EV6) Hybrid Predictor • A “tournament branch predictor” • Multi-predictor scheme w/ • Local predictor (~PAg) • Self-correlation • Global predictor • Inter-correlation • Choice predictor as the decision maker: a 2-bit sat. counter to credit either local or global predictors. • Die size impact • History info tables ~2% • BTB ~ 2.7% (associated with I-$ on a per-line basis) • 2 cycle latency, we will discuss more later PC Global history 12 Local History Table 1024 x 10 bits Single Local Predictor 1024 x 3 bits Global Predictor 4096 x 2 bits Choice Predictor 4096 x 2 bits 10 Global prediction Local prediction Meta prediction Final Branch Prediction Next Line/set Prediction For Single-cycle Prediction L1 I-cache (64KB 2w) & TLB Virtual address 4 instr./cycle

  37. Alpha EV8 Branch Predictor • Real silicon never sees the daylight • Use a 2Bc-gskew predictor (one form of enhanced gskew) • Bimodal predictor used as (1) static biased predictor and (2) part of e-gskew predictor • Global predictors G0 and G1 are part of e-gskew predictor • Table sizes: 352Kbits in total (208Kbits for prediction table; 144Kbits for hysteresis table.) Branch PC Global history G0 G1 Meta Bimodal F3 F2 F1 F4 e-gskew predictor majority vote prediction

  38. Branch Target Prediction • Try the easy ones first • Direct jumps • Call/Return • Conditional branch (bi-directional) • Branch Target Buffer (BTB) • Return Address Stack (RAS)

  39. Branch Target Buffer (BTB) Branch PC BTB Tag Target Tag Target … Tag Target 4 + Predicted Branch Direction = = … = Branch Target 1 0

  40. Return Address Stack (RAS) • Different call sites make return address hard to predict • Printf() being called by many callers • The target of “return” instruction in printf() is a moving target • A hardware stack (LIFO) • Call will push return address on the stack • Return uses the prediction off of TOS

  41. Return PC BTB Return? Return Address Stack Call PC 4 BTB + Push Return Address • Does it always work? • Call depth • Setjmp/Longjmp • Speculative call? • May not know it is a return instruction prior to decoding • Rely on BTB for speculation • Fix once recognize Return

  42. Indirect Jump • Need Target Prediction • Many (potentially 230 for 32-bit machine) • In reality, not so many • Similar to predicting values • Tagless Target Prediction • Tagged Target Prediction

  43. . . . . . 1 1 1 0 Tagless Target Prediction [ChangHaoPatt’97] • Modify the PHT to be a “Target Cache” • (indirect jump) ? (from target cache) : (from BTB) • Alias? Target Cache (2N entries) PC  BHR Pattern 00…..00 00…..01 Branch PC 00…..10 Predicted Target Address Hash 11…..10 Branch History Register (BHR) 11…..11

  44. . . . . . 1 1 1 0 Tagged Target Prediction [ChangHaoPatt’97] • To reduce aliasing with set-associative target cache • Use branch PC and/or history for tags Target Cache (2n entries per way) 00…..00 00…..01 Branch PC Predicted Target Address 00…..10 Hash n =? 11…..10 BHR 11…..11 Tag Array

  45. Multiple Branch Prediction • For a really wide machine • Across several basic blocks • Need to predict multiple branches per cycle • How to fetch non-contiguous instructions in one cycle? • Prediction accuracy extremely critical (will be reduced geometrically)

More Related