CDA 5155

Week 4 Branch Prediction CDA 5155

M U X + + A L U target eq? M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext bpc target Control IF/ ID ID/ EX EX/ Mem Mem/ WB beq

Branch Target Buffer Fetch PC Send PC to BTB found? No Yes use target use PC+1 Predicted target PC

Branch prediction Predict not taken: ~50% accurate No BTB needed; always use PC+1 Predict backward taken: ~65% accurate BTB holds targets for backward branches (loops) Predict same as last time: ~80% accurate Update BTB for any taken branch

What about indirect branches? Could use same approach PC+1 unlikely indirect target Indirect jumps often have multiple targets (for same instruction) Switch statements Virtual function calls Shared library (DLL) calls

Indirect jump: Special Case Return address stack Function returns have deterministic behavior (usually) Return to different locations (BTB doesn’t work well) Return location known ahead of time In some register at the time of the call Build a specialize structure for return addresses Call instructions write return address to R31 AND RAS Return instructions pop predicted target off stack Issues: finite size (save or forget on overflow?); Issues: long jumps (clear when wrong?)

Costs of branch prediction/speculation Performance costs? Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! Power? Large: in very long/wide pipelines many instructions can be squashed Squashed = # mispredictions  pipeline length/width before target resolved

Costs of branch prediction/speculation Area? Can be large: predictors can get very big as we will see next time Complexity? Designs are more complex Testing becomes more difficult, but …

What else can be speculated? Dependencies I think this data is coming from that store instruction Values I think I will load a 0 value Accuracy? Branch prediction (direction) is Boolean (T,NT) Branch targets are stable or predictable (RAS) Dependencies are limited Values cover a huge space (0 – 4B)

Parts of the branch predictor Direction Predictor For conditional branches Predicts whether the branch will be taken Examples: Always taken; backwards taken Address Predictor Predicts the target address (use if predicted taken) Examples: BTB; Return Address Stack; Precomputed Branch Recovery logic Ref: The Precomputed BranchArchitecture

Characteristics of branches Individual branches differ Loops tend not to exit Unoptimized code: not-taken Optimized code: taken If-statements: Tend to be less predictable Unconditional branches Still need address prediction

Example gzip: gzip: loop branch A@ 0x1200098d8 Executed: 1359575 times Taken: 1359565 times Not-taken: 10 times % time taken: 99% - 100% Easy to predict (direction and address)

Example gzip: gzip: if branch B@ 0x12000fa04 Executed: 151409 times Taken: 71480 times Not-taken: 79929 times % time taken: ~49% Easy to predict? (maybe not/ maybe dynamically)

Example: gzip Easy to predict Easy to predict A B 0 100 Direction prediction: always taken Accuracy: ~73 %

Branch Backwards Most backward branches are heavily TAKEN Forward branches slightly more likely to be NOT-TAKEN Ref: The Effects of Predicated Execution on Branch Prediction

Using history 1-bit history (direction predictor) Remember the last direction for a branch NT T Branch History Table branchPC How big is the BHT?

Example: gzip A B 0 100 Direction prediction: always taken Accuracy: ~73 % How many times will branch A mispredict? How many times will branch B mispredict?

Using history 2-bit history (direction predictor) SN NT T ST Branch History Table branchPC How big is the BHT?

Example: gzip A B 0 100 Direction prediction: always taken Accuracy: ~73 % How many times will branch A mispredict? How many times will branch B mispredict?

Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010

Using history 1-bit history (direction predictor) Remember the last direction for a branch NT T Branch History Table branchPC How big is the BHT?

Using history 2-bit history (direction predictor) SN NT T ST Branch History Table branchPC How big is the BHT?

Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010

Local history NT T branchPC Branch History Table Pattern History Table 10101010 What is the prediction for this BHT 10101010? When do I update the tables?

Local history NT T branchPC Branch History Table Pattern History Table 01010101 On the next execution of this branch instruction, the branch history table is 01010101, pointing to a different pattern What is the accuracy of a flip/flop branch 0101010101010…?

Global history Pattern History Table Branch History Register 01110101 for (i=0; i<100; i++) for (j=0; j<3; j++) j<3 j = 1 1101  taken j<3 j = 2 1011  taken j<3 j = 3 0111  not taken i<100 1110  usually taken if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { … How can branches interfere with each other?

Gshare predictor Must read! branchPC Pattern History Table Branch History Register xor 01110101 Ref: Combining Branch Predictors

Bimod predictor Global history reg branchPC xor Choice predictor PHT skewed taken PHT skewed Not-taken mux

Hybrid predictors Global/gshare predictor (much more state) Local predictor (e.g. 2-bit) Prediction 1 Prediction 2 Selection table (2-bit state machine) Prediction How do you select which predictor to use? How do you update the various predictor/selector?

Overriding Predictors Big predictors are slow, but more accurate Use a single cycle predictor in fetch Start the multi-cycle predictor When it completes, compare it to the fast prediction. If same, do nothing If different, assume the slow predictor is right and flush pipline. Advantage: reduced branch penalty for those branches mispredicted by the fast predictor and correctly predicted by the slow predictor

Pipelined Gshare Predictor How can we get a pipelined global prediction by stage 1? Start in stage –2 Don’t have the most recent branch history… Access multiple entries E.g. if we are missing last three branches, get 8 histories and pick between them during fetch stage. Ref: Reconsidering Complex Branch Predictors

Exceptions Exceptions are events that are difficult or impossible to manage in hardware alone. Exceptions are usually handled by jumping into a service (software) routine. Examples: I/O device request, page fault, divide by zero, memory protection violation (seg fault), hardware failure, etc.

Taking and Exception Once an exception occurs, how does the processor proceed. Non-pipelined: don’t fetch from PC; save state; fetch from interrupt vector table Pipelined: depends on the exception Precise Interrupt: Must stop all instruction “after the exception” (squash) Divide by zero: flush fetch/decode Page fault: (fetch or mem stage?) Save state after last instruction before exception completes (PC, regs) Fetch from interrupt vector table

How Much ILP is There?

ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?

CDA 5155

CDA 5155

Presentation Transcript

CDA and CDA Equivalencies

CDA 4

CDA 3100

CDA 5155

CDA 5155

CDA 3100

CDA 3100

CDA 5155

CDA 3100

CDA, EFDA

CDA 5155 and 4150

CDA 5155

CDA 3100

CDA 5155

CDA 3100

CDA and CDA Equivalencies