450 likes | 653 Views
Advanced Microarchitecture. Lecture 5: Advanced Fetch. Branch Predictions Can Be Wrong. How/When do we detect a misprediction? What do we do about it? resteer fetch to correct address hunt down and squash instructions from the wrong path. Example Control Flow. br. A. correct path.
E N D
Advanced Microarchitecture Lecture 5: Advanced Fetch
Branch Predictions Can Be Wrong • How/When do we detect a misprediction? • What do we do about it? • resteer fetch to correct address • hunt down and squash instructions from the wrong path Lecture 5: Advanced Fetch
Example Control Flow br A correct path predicted path B C D E F G Lecture 5: Advanced Fetch
Multiple speculatively fetched basic blocks may be in-flight at the same time! Simple Pipeline Fetch (IF) Decode (ID) Dispatch (DP) Execute (EX) T br A br B A br D B A br … D B A Mispred Detected Lecture 5: Advanced Fetch
In More Detail Direction prediction, target prediction IF ID We know if branch is return, indirect jump, or phantom branch RAS iBTB Squash instructions in BP and I$-lookup Resteer BP to new target from RAS/iBTB DP If indirect target, can potentially read target from RF Squash instructions in BP, I$ and ID Resteer BP to target from RF Detect wrong direction, or wrong target (indirect) Squash instructions in BP, I$, ID and DP, plus RS and ROB Resteer BP to correct next PC EX Lecture 5: Advanced Fetch
I$ ADD BR XOR BR Phantom Branches • May occur when performing multiple bpreds 4 preds corresponding to 4 possible branches in the fetch group A B C D PC BPred N N T T X Z Fetch: ABCX… (C appears to be a branch) After fetch, we discover C cannot be taken because it is not even a branch! This is a phantom branch. Should have fetched: ABCDZ… Lecture 5: Advanced Fetch
Hardware Organization NPC PC I$ ID is indir is retn uncond br actual target no branch BPred != control BTB push on call pop on retn RAS + EX sizeof(I$-line) iBTB Lecture 5: Advanced Fetch
??? nop nop nop nop nop nop nop nop What about insts that are already in the RS, ROB, LSQ? EFGH nop nop nop nop nop’s are filtered out – no need to take up RS and ROB entries Recovery • Squashing instructions in front-end pipeline IF ID DS EX WXYZ QRST KLMN mispred! Lecture 5: Advanced Fetch
Wait for Drain • Squash in-order front-end (as before) • Stall dispatch (no new instructions ROB, RS) • Let OOO engine execute as usual • Let commit operate as usual except: • check for the mispredicted branch • cannot commit any instructions after it • but after mispredicted branch committed, any remaining instructions in ROB, RS, LSQ must be on the wrong path • flush the OOO engine • allow dispatch to continue Lecture 5: Advanced Fetch
What if this load has a cache miss and goes to main memory? D&W: LOAD - - - LOAD - - - ADD - - - ADD - - - BR - - - BR - - - X X XOR XOR junk junk X X LOAD LOAD junk junk X X SUB SUB junk junk X X ST ST junk junk X X BR BR junk junk Wait for Drain (2) • Simple to implement! • Performance degradation Ideal: LOAD ADD BR junk junk junk junk junk Lecture 5: Advanced Fetch
Branch Tags/IDs/Colors • Each instruction fetched is assigned the “current branch tag” • Each predicted branch causes a new branch tag to be allocated (and becomes the current tag) branch ROB Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 (Tags might not necessarily be in any particular order) Lecture 5: Advanced Fetch
7 5 3 mispred! ROB Branch Tags (2) Tags 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 7 7 7 7 7 5 3 3 3 3 Tag List 1 2 4 7 5 3 Lecture 5: Advanced Fetch
Overkill for ROB / LSQ • ROB and LSQ keep instructions in program order (more on this in future lecture) • All instruction physically after the mispredicted branch should be squashed … Simple! • Some sort of tagging/coloring useful for RS! • instructions in RS may be in arbitrary order • may be multiple sets of RS’s ROB Integer RS FP RS Lecture 5: Advanced Fetch
Height increases with num branch tags Width increases with num branch tags Hardware Complexity invalidate tag 0 invalidate tag 1 invalidate tag 2 my tag = = = Overall area overhead is quadratic in tag count squash Lecture 5: Advanced Fetch
Simplifications • For a ROB with n entries, could potentially have n different branches, each requiring a unique tag • In practice, only a fraction of insts are branches, so limit to k < n tags instead • If a k+1st branch is fetched, dispatch must be stalled until a tag has been deallocated Lecture 5: Advanced Fetch
Resume Fetch Simplifications (2) • For k tags, may need to broadcast all if oldest branch mispredicted, resulting in O(k2) overhead • Limit to only one (for example) broadcast per cycle 7 5 3 Lecture 5: Advanced Fetch
Branch Predictor Latency • To provide a continuous stream of instructions, the branch predictor must make one prediction every cycle • Pipelining? • Nope. If current prediction is NT, then next PC is A. If taken, then next PC is B. A dependency exists between successive predictions • Limits predictor size/latency • Smaller predictor is less accurate • Or clock frequency penalty Lecture 5: Advanced Fetch
Ahead Prediction • Normally: • PC1 PC2 PC3 PC4 PC5 … • Each “” is a prediction that takes a single cycle • PCi is predicted from PCi-1 • Instead: • PC1 PC3 PC5 … • and PC2 PC4 … • PCi is predicted from PCi-2, and so the prediction can take two cycles instead of one • In general, can k-ahead pipeline the predictor Lecture 5: Advanced Fetch
Cycle k+1 Cycle k Cycle k+2 PCi+1 PCi+2 PCi PCi+1 PCi PCi+2 PCi PCi+1 PCi-1 PCi PCi PCi+1 PCi+2 PCi+1 PCi+2 PCi+3 PCi+2 PCi+4 Ahead Prediction Timing 2-cycle ahead-pipelined branch predictor Fetch Address Lecture 5: Advanced Fetch
The address before NPC is the PC of the mispredicted branch mispredict! NPC N2PC PC NPC - - - PC NPC ??? PC NPC New PC sent to front-end Cycle k+1: NPC I$, PC predictor Cycle k+2: I$ bubble, NPC predictor Cycle k+3: N2PCI$, N2PC predictor PC next-next PC (N2PC) Ahead Prediction Misprediction PC PCwrong Cycle k: mispredict Lecture 5: Advanced Fetch
Overriding Branch Predictors • Use two branch predictors • 1st one has single-cycle latency (fast, medium accuracy) • 2nd one has multi-cycle latency, but more accurate • Second predictor can override the 1st prediction if it disagrees • Idea: better to pay for a small number of bubbles (difference in 1st and 2nd predictor latencies) than to pay for a full branch misprediction (full pipeline flush, 20+ cycles of delay) Lecture 5: Advanced Fetch
B Predict B Predict C Predict C’ Predict B’ Fetch A Fetch B Predict B’ Predict A’ Fetch A Predict A’ If A=A’ (both preds agree), done Overriding Predictors (2) Z A Predict A Predict A’ Fast 1st Pred 2-cycle Pipelined I$ If A != A’, flush A, B andC restart fetch with A’ Slower 2nd Pred Lecture 5: Advanced Fetch
Worst case, branch mispred penalty is worse than without overriding predictors! Benefit of Overriding Predictors • Assume • 1-cycle predictor 80% accuracy • 3-cycle predictor 95% accuracy • Misprediction penalty of 20 cycles • Fetch bubbles per branch • 1-cycle pred only: 0.80 +0.220 = 4 • 3-cycle pred only: 0.953 + 0.0520 = 3.85 • Overriding config: 0.80.950 + 0.20.953 + 0.20.0520 + 0.80.0523 = 1.69 Lecture 5: Advanced Fetch
Predict: A B C D E F G Update: A B C D E F G time Speculative Branch Update • Ideal branch prediction problem • Given PC, predict branch outcome • Given actual outcome, update/train predictor • Repeat • Actual problem • Streams of predictions and updates in parallel Lecture 5: Advanced Fetch
Speculative Branch Update (2) • BHR update cannot be delayed until branch retirement Predict: A B C D E F G Update: A B C D E F G • Can’t update BHR until commit because outcome not known until then BHR: 011010 011010 011010 011010 011010 110101 Branches B-E all predicted with The same stale BHR value Lecture 5: Advanced Fetch
Speculative Branch Update (3) • Update branch history using predictions • Speculative update • If predictions are correct, then BHR is correct • Effectively simulates alternating lookup and update w.r.t. the BHR • So what if there’s a misprediction? • Checkpoint and recover Lecture 5: Advanced Fetch
Recovery of Speculative BHR BPred Lookup 0110100100100… Speculative BHR BPred Update Retirement Mispredict! Retirement BHR Lecture 5: Advanced Fetch
Execution-Time Recovery • Commit-time recovery may substantially delay branch misprediction recovery $-miss to DRAM • Have every branch checkpoint the BHR at the time it predicted • On mispredict, recover the speculative BHR from this checkpoint Load Executed, but can’t recover until load retires Br Lecture 5: Advanced Fetch
Example Traces A B C H I J H I J K L A G H I J H I J K L A B C H I J Traces • A “Trace” is a dynamic stream of instructions A B C D E F G H I J K L Static Layout Observed paths through the program Lecture 5: Advanced Fetch
Trace Cache A B A B C D E F G H I J C D E F G A B C D E F G H I J H I J I$ Fetch (5 cycles) T$ Fetch (1 cycle) Trace Cache • Idea is to cache dynamic Traces instead of static instructions E F G H I J K A B C D I$ Lecture 5: Advanced Fetch
Tag, etc. Insts Hit Logic Line-Fill Buffer Trace Cache Fill Control Merge Logic Hardware Organization Fetch Address I$ BPred BTB BTB Logic Mask, exchange, Shift instruction latch to decoder Lecture 5: Advanced Fetch
Tags, etc. Tag Fall-thru Addr 3rd branch # Br. 2nd branch Target Addr 1st branch Branch Mask A 3 11,1 X Y Fetch: A Branches 1&2 both Taken in this trace Trace ends in a branch Lecture 5: Advanced Fetch
Multi-BPred N T T 0 1 Cond. AND Next Fetch Address Match Remaining Block(s) Trace hit Hit Logic, Next Address Selection Fetch: A Tag # Br. Mask Fall-thru Target A 3 11,1 X Y = = = Match 1st Block Lecture 5: Advanced Fetch
Generating Multiple Predictions BHR BPred BPred BPred Serialized access: incredibly slow Three predictions in parallel Predictor must be BHR-based only (no PC bits!) Lecture 5: Advanced Fetch
Associativity • Set-Associativity ABC ABC A B C X Y Z A B C XYZ XYZ Benefit: reduced miss rate Cost: access time, replacement complexity • Path/Trace-Associativity Benefit: possible reduced miss rate, trace-thrashing Cost: access time, replacement complexity, code duplication ABC A B D A B C ABD Lecture 5: Advanced Fetch
X Y A BHR bits B A A B C C D A A B D Works if path after AB consistently correlates with path before AB Provides similar benefit to path-assoc. Indexing T$ A A B C Lecture 5: Advanced Fetch
Build Trace at Retire ROB Instructions from Retire trace construction buffer T$ Store when trace complete Trace Fill Unit Placement Build Trace at Fetch I$ To Decode trace construction buffer T$ Store when trace complete Lecture 5: Advanced Fetch
Trace Fill Unit Placement (2) • At Fetch • Speculative traces (uses branch prediction - not verified) • Construction buffer management • Building ABC, detect mispredict should be ABD; need to find C in the buffer, clean it out, and then insert D • At Retire • Non-speculative, all traces are “correct” • No interaction with branch predictor • Simpler construction buffer • Slower response time • Time from fetching ABC retiring ABC may be long • Until retirement, ABC not in T$ and fetch must use I$ Lecture 5: Advanced Fetch
97% 3% Trace Selection • Some traces may have poor temporal locality • Storing ACD evicts ABD (assuming no path-assoc), but likely won’t be useful • Alternative, use a trace filtering mechanism • extra HW required A B C D Lecture 5: Advanced Fetch
Statistical Filtering [PACT 2005] • For each trace, insert with probability p < 1.0 • Example: p=0.05 (5% chance of insertion per trace) • Hot trace: ABC, seen 50 times • Cold trace: XYZ, seen twice • Probability of ABC getting inserted • 1.0 – P(not getting inserted) = 1.0 – (1.0-0.05)50 = 1.0 – 0.9550= 92.3% (good chance that ABC gets in the T$) • Probability of XYZ getting inserted • 1.0 – (1.0-0.1)2 = 1.0-0.92 = 9.75% (not so likely) Lecture 5: Advanced Fetch
Partial Matches Fetch: A Trace $ BPred I$ ABC ABD A Partial Hit? = Benefit: More insts Cost: More complex “hit” logic Squashing logic ABC AB Targets for intermediate branches AB A Lecture 5: Advanced Fetch
Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions Netburst (P4) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Lecture 5: Advanced Fetch
Trace Prediction • Each trace has a unique identifier, analogous to but different from a conventional PC • effectively starting PC plus intra-trace branch directions • Trace predictor takes a trace-id as input, and outputs a predicted next-trace-id • Trace cache is indexed with the trace-id, tag match against trace-id as well Lecture 5: Advanced Fetch
No I$, Decoded Trace Cache • No I$ means T$ miss must pay the latency for an L2 access • Severe performance penalty for applications with poor trace locality • Decoded instructions remove decode logic from branch misprediction penalty Misp Fetch Fetch Dec Dec Ren Disp Exec Mispredict Penalty Misp T$ T$ Ren Disp Exec Lecture 5: Advanced Fetch