1 / 18

CS 7960-4 Lecture 8

CS 7960-4 Lecture 8. The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000. Prediction Accuracy Vs. IPC. Fig.1 – IPC saturates at around 1.28, assuming single-cycle predictions

kesia
Download Presentation

CS 7960-4 Lecture 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 8 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO-33 2000

  2. Prediction Accuracy Vs. IPC • Fig.1 – IPC saturates at around 1.28, assuming • single-cycle predictions • A 2KB predictor takes two cycles to access – • multi-cycle predictors can’t yield IPC > 1.0 • (reduced fetch bandwidth) • However, note that a single cycle predictor is • within 10% of optimal IPC (might not be true for • more aggressive o-o-o processors)

  3. Long Latency Predictions • Total branch latency C = d + (r x p) • d = delay = 1 • r = mpred rate = 0.04 • p = penalty = 20 • Always better to reduce d than r • Note that correctly predicted branches are often • not on the program critical path

  4. Branch Frequency • Branches are not as common as we think – on • average, they occur every six instructions, but • 61% of the time, there is at least one cycle of • separation (Fig.3) • Branches can be treated differently, based on • whether they can tolerate latency or not

  5. Branch Predictor Cache • The cache is a subset of the 3-cycle predictor • and requires tags • ABP provides a prediction if there is a cache miss Xor of address and history 3-cycle PHT 1-cycle PHT Tags ABP Hit/Miss Prediction

  6. Cascading Lookahead Prediction • Use the current PC to predict where the next • branch will go – initiate the look-up before you • see that branch • Use predictors with different latencies – when you • do see the branch, use the prediction available • to you • You can use a good prediction 60% of the time • and a poor prediction 40% of the time

  7. Overriding Branch Predictor • Use a quick-and-dirty prediction • When you get the slow-and-clean prediction and • it disagrees, initiate recovery action • If prediction rates are 92% and 97%, 5% of all • branches see a 2-cycle mispredict penalty and • 3% see a 20-cycle penalty

  8. Combining the Predictors? • Lookahead into a number of predictors • When you see a branch (after 3 cycles), use the • prediction from your cache (in case of a hit) or • the prediction from the regular 3-cycle predictor • (in case of a miss) • When you see the super-duper 5-cycle prediction, • let it override any previous incorrect prediction

  9. Latencies

  10. Results (Fig.8) • The cache doesn’t seem to help at all (IPC of 1.1!) • (it is very surprising that the ABP and PHT have • matching predictions most of the time) • For the cascading predictor, the slow predictor is • used 45% of the time and it gives a better prediction • than the 1-cycle predictor 5.5% of the time • The overriding predictor disagrees 16.5% of the • time and yields an IPC of 1.2 – hmmm…

  11. Alpha 21264 Predictor global history global history PC chooser PHT global predictor PHT local history 128 entries 512 entries PHT 128 entries 3200 bits

  12. Alpha 21464 (EV8) • 352Kb! 2-cycle access time – 4 predictor arrays • accessed in parallel – overrides line prediction • 14-25 cycle mispredict penalty – 8-wide processor • -- 256 in-flight instructions

  13. Predictor Sizes • All tables are indexed using combinations of • history and PC

  14. 2Bc-gskew BIM Address Pred G0 Vote Address+History G1 Meta

  15. Rules • On a correct prediction • if all agree, no update • if they disagree, strengthen correct preds and chooser • On a misprediction • update chooser and recompute the prediction • on a correct prediction, strengthen correct preds • on a misprediction, update all preds

  16. Design Choices • Local predictor was avoided because you need • up to 16 predictions in a cycle and it is hard • maintaining speculative local histories • You have no control over local histories – will need 16-ported PHT • Since global history is common for all 16 predictions, you can control indexing into PHT • They advocate the use of larger overriding • predictors for future technologies

  17. Next Week’s Paper • “Trace Cache: A Low-Latency Approach to • High-Bandwidth Instruction Fetching”, • Rotenberg, Bennett, Smith, MICRO-29, 1996 • Combine common instruction traces in the I-cache

  18. Title • Bullet

More Related