180 likes | 293 Views
CS 7960-4 Lecture 8. The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000. Prediction Accuracy Vs. IPC. Fig.1 – IPC saturates at around 1.28, assuming single-cycle predictions
E N D
CS 7960-4 Lecture 8 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000
Prediction Accuracy Vs. IPC • Fig.1 – IPC saturates at around 1.28, assuming • single-cycle predictions • A 2KB predictor takes two cycles to access – • multi-cycle predictors can’t yield IPC > 1.0 • (reduced fetch bandwidth) • However, note that a single cycle predictor is • within 10% of optimal IPC (might not be true for • more aggressive o-o-o processors)
Long Latency Predictions • Total branch latency C = d + (r x p) • d = delay = 1 • r = mpred rate = 0.04 • p = penalty = 20 • Always better to reduce d than r • Note that correctly predicted branches are often • not on the program critical path
Branch Frequency • Branches are not as common as we think – on • average, they occur every six instructions, but • 61% of the time, there is at least one cycle of • separation (Fig.3) • Branches can be treated differently, based on • whether they can tolerate latency or not
Branch Predictor Cache • The cache is a subset of the 3-cycle predictor • and requires tags • ABP provides a prediction if there is a cache miss Xor of address and history 3-cycle PHT 1-cycle PHT Tags ABP Hit/Miss Prediction
Cascading Lookahead Prediction • Use the current PC to predict where the next • branch will go – initiate the look-up before you • see that branch • Use predictors with different latencies – when you • do see the branch, use the prediction available • to you • You can use a good prediction 60% of the time • and a poor prediction 40% of the time
Overriding Branch Predictor • Use a quick-and-dirty prediction • When you get the slow-and-clean prediction and • it disagrees, initiate recovery action • If prediction rates are 92% and 97%, 5% of all • branches see a 2-cycle mispredict penalty and • 3% see a 20-cycle penalty
Combining the Predictors? • Lookahead into a number of predictors • When you see a branch (after 3 cycles), use the • prediction from your cache (in case of a hit) or • the prediction from the regular 3-cycle predictor • (in case of a miss) • When you see the super-duper 5-cycle prediction, • let it override any previous incorrect prediction
Results (Fig.8) • The cache doesn’t seem to help at all (IPC of 1.1!) • (it is very surprising that the ABP and PHT have • matching predictions most of the time) • For the cascading predictor, the slow predictor is • used 45% of the time and it gives a better prediction • than the 1-cycle predictor 5.5% of the time • The overriding predictor disagrees 16.5% of the • time and yields an IPC of 1.2 – hmmm…
Alpha 21264 Predictor global history global history PC chooser PHT global predictor PHT local history 128 entries 512 entries PHT 128 entries 3200 bits
Alpha 21464 (EV8) • 352Kb! 2-cycle access time – 4 predictor arrays • accessed in parallel – overrides line prediction • 14-25 cycle mispredict penalty – 8-wide processor • -- 256 in-flight instructions
Predictor Sizes • All tables are indexed using combinations of • history and PC
2Bc-gskew BIM Address Pred G0 Vote Address+History G1 Meta
Rules • On a correct prediction • if all agree, no update • if they disagree, strengthen correct preds and chooser • On a misprediction • update chooser and recompute the prediction • on a correct prediction, strengthen correct preds • on a misprediction, update all preds
Design Choices • Local predictor was avoided because you need • up to 16 predictions in a cycle and it is hard • maintaining speculative local histories • You have no control over local histories – will need 16-ported PHT • Since global history is common for all 16 predictions, you can control indexing into PHT • They advocate the use of larger overriding • predictors for future technologies
Next Week’s Paper • “Trace Cache: A Low-Latency Approach to • High-Bandwidth Instruction Fetching”, • Rotenberg, Bennett, Smith, MICRO-29, 1996 • Combine common instruction traces in the I-cache
Title • Bullet