CS 7960-4 Lecture 8

CS 7960-4 Lecture 8 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000

Prediction Accuracy Vs. IPC • Fig.1 – IPC saturates at around 1.28, assuming • single-cycle predictions • A 2KB predictor takes two cycles to access – • multi-cycle predictors can’t yield IPC > 1.0 • (reduced fetch bandwidth) • However, note that a single cycle predictor is • within 10% of optimal IPC (might not be true for • more aggressive o-o-o processors)

Long Latency Predictions • Total branch latency C = d + (r x p) • d = delay = 1 • r = mpred rate = 0.04 • p = penalty = 20 • Always better to reduce d than r • Note that correctly predicted branches are often • not on the program critical path

Branch Frequency • Branches are not as common as we think – on • average, they occur every six instructions, but • 61% of the time, there is at least one cycle of • separation (Fig.3) • Branches can be treated differently, based on • whether they can tolerate latency or not

Branch Predictor Cache • The cache is a subset of the 3-cycle predictor • and requires tags • ABP provides a prediction if there is a cache miss Xor of address and history 3-cycle PHT 1-cycle PHT Tags ABP Hit/Miss Prediction

Cascading Lookahead Prediction • Use the current PC to predict where the next • branch will go – initiate the look-up before you • see that branch • Use predictors with different latencies – when you • do see the branch, use the prediction available • to you • You can use a good prediction 60% of the time • and a poor prediction 40% of the time

Overriding Branch Predictor • Use a quick-and-dirty prediction • When you get the slow-and-clean prediction and • it disagrees, initiate recovery action • If prediction rates are 92% and 97%, 5% of all • branches see a 2-cycle mispredict penalty and • 3% see a 20-cycle penalty

Combining the Predictors? • Lookahead into a number of predictors • When you see a branch (after 3 cycles), use the • prediction from your cache (in case of a hit) or • the prediction from the regular 3-cycle predictor • (in case of a miss) • When you see the super-duper 5-cycle prediction, • let it override any previous incorrect prediction

Latencies

Results (Fig.8) • The cache doesn’t seem to help at all (IPC of 1.1!) • (it is very surprising that the ABP and PHT have • matching predictions most of the time) • For the cascading predictor, the slow predictor is • used 45% of the time and it gives a better prediction • than the 1-cycle predictor 5.5% of the time • The overriding predictor disagrees 16.5% of the • time and yields an IPC of 1.2 – hmmm…

Alpha 21264 Predictor global history global history PC chooser PHT global predictor PHT local history 128 entries 512 entries PHT 128 entries 3200 bits

Alpha 21464 (EV8) • 352Kb! 2-cycle access time – 4 predictor arrays • accessed in parallel – overrides line prediction • 14-25 cycle mispredict penalty – 8-wide processor • -- 256 in-flight instructions

Predictor Sizes • All tables are indexed using combinations of • history and PC

2Bc-gskew BIM Address Pred G0 Vote Address+History G1 Meta

Rules • On a correct prediction • if all agree, no update • if they disagree, strengthen correct preds and chooser • On a misprediction • update chooser and recompute the prediction • on a correct prediction, strengthen correct preds • on a misprediction, update all preds

Design Choices • Local predictor was avoided because you need • up to 16 predictions in a cycle and it is hard • maintaining speculative local histories • You have no control over local histories – will need 16-ported PHT • Since global history is common for all 16 predictions, you can control indexing into PHT • They advocate the use of larger overriding • predictors for future technologies

Next Week’s Paper • “Trace Cache: A Low-Latency Approach to • High-Bandwidth Instruction Fetching”, • Rotenberg, Bennett, Smith, MICRO-29, 1996 • Combine common instruction traces in the I-cache

Title • Bullet

CS 7960-4 Lecture 8

CS 7960-4 Lecture 8

Presentation Transcript

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 0004 –Lecture 8

CS 7810 Lecture 8

CS 7960-4 Lecture 5

CS 140L Lecture 4

CS 425 Lecture 4

CS 201 Lecture 8:

CS 7960-4 Lecture 23

CS 7960-4 Lecture 2

CS 194: Lecture 8

CS 7960-4 Lecture 17

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7810 Lecture 8

CS 194: Lecture 8

CS 7960-4 Lecture 4

CS 7960-4 Lecture 20

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18

CS 584 Lecture 8