Two-level Adaptive Branch Prediction

Two-level Adaptive Branch Prediction Colin Egan University of Hertfordshire Hatfield U.K. c.egan@herts.ac.uk

Presentation Structure • Two-level Adaptive Branch Prediction • Cached Correlated Branch Prediction • Neural Branch Prediction • Conclusion and Discussion • Where next?

Two-level Adaptive Branch Prediction Schemes • First level: • History register(s) record the outcome of the last k branches encountered. • A global history register records information from other branches leading to the branch. • Local (Per-Address) history registers record information of specific branches.

Two-level Adaptive Branch Prediction Schemes • Second level: • Is termed the Pattern History Table (PHT). • The PHT consists of at least one array of two-bit up/down saturating counters that provide the prediction.

Two-level Adaptive Branch Prediction Schemes • Global schemes: • Exploit correlation between the outcome of the current branch and neighbouring branches that were executed leading to this branch. • Local schemes: • Exploit correlation between the outcome of the current branch and its past behaviour.

A global PHT branch address global HR k-bit HR BTC br-tag target address status 2k Global Two-level Adaptive Branch Prediction • GAg Predictor Implementation prediction

branch address n bits of branch address A local/set PHT BTC global HR k-bit HR br-tag target address status .. 2k prediction Global Two-level Adaptive Branch Prediction • GAp / GAs Predictor Implementation

A global PHT (Second-level) PC n bits of branch address BTC (BHT first-level) br-tag target address status HRl prediction Local Two-level Adaptive Branch Prediction • PAg

A local or set PHT (Second-level) PC (branch) address n bits of branch address BTC (BHT first-level) br-tag target address status HRl .. prediction Local Two-level Adaptive Branch Prediction • PAs / PAp

Problems with Two-level Adaptive Branch Prediction • Size of PHT • Increases exponentially as a function of HR length. • Use of uninitialised predictors • No tag fields are associated with PHT prediction counters. • Branch interference (aliasing) • In GAg and PAg all branches share a common PHT. • In GAs and PAs each branch is shared by a set of branches.

Cached Correlated Branch Prediction • Minimises the number of initial mispredictions. • Eliminates branch interference. • Is used in a disciplined manner. • Is cost-effective.

Cached Correlated Branch Prediction • Today we are going to look at two types of cached correlated predictors: • A Global Cached Correlated Predictor. • A Local Cached Correlated Predictor. • We have also developed a combined predictor that uses both global and local history information.

Cached Correlated Branch Prediction • The first level history register remains the same as conventional two-level predictors. • Uses a second level Prediction Cache instead of a PHT.

Cached Correlated Branch Prediction • Uses a secondary default predictor (BTC). • Both predictors provide a prediction. • A priority selector chooses the actual prediction.

Cached Correlated Branch Prediction • The Prediction Cache predicts on the past behaviour of the branch with the current history register pattern. • The Default Predictor predicts on the overall past behaviour of the branch.

Prediction Cache • Size is not a function of the history register length. • Size is determined by the number of prediction counters that are actually used. • Requires a tag-field. • Is cost effective as long as the cost of redundant counters removed from a conventional PHT exceeds the cost of the added tags.

BTC PC Global history register Prediction Cache hash br_tag br_trgt pred vld lru br_tag hrg_tag pred vld lru prediction selector prediction A Global Cached Correlated Branch Predictor

A Local Cached Correlated Branch Predictor • Problem • A Local predictor will require two sequential clock access: • One to access the BTC to furnish HRl. • Second to access the Prediction Cache. • Solution • Cache the next prediction for each branch in the BTC. • Only one clock access is therefore needed.

A Local Cached Correlated Branch Predictor PC BTC Prediction Cache hrl hash Default prediction Prediction Cache prediction Correlated hit prediction selector BTC hit actual prediction

Simulations • Stanford Integer Benchmark suite. • These benchmarks are difficult to predict. • Instruction traces were obtained from the Hatfield Superscalar Architecture (HSA).

Global Simulations • A comparative study of misprediction rates • A conventional GAg, a conventional GAs(16) and a conventional GAp. • Against a Global Cached Correlated predictor (1K – 64K).

Global Simulation Results

Global Simulation Results • For conventional global two-level predictors the best average misprediction rate of 9.23% is achieved by a GAs(16) predictor with a history register length of 26. • In general there is little benefit from increasing the history register length beyond 16-bits for GAg and 14-bits for GAs/GAp.

Global Simulation Results • A 32K entry Prediction Cache with a 30-bit history register achieved the best misprediction rate of 5.99% for the global cached correlated predictors. • This represents a 54% reduction over the best misprediction rate achieved by a conventional global two-level predictor.

Global Simulations • We repeated the same simulations without the default predictor.

Global Simulations Results(without default predictor)

Global Simulation Results(without default predictor) • The best misprediction rate is now 9.12%. • The high performance of the cached predictor depends crucially on the provision of the two-stage mechanism.

Local Simulations • A comparative study of misprediction rates. • A conventional PAg, a conventional PAs(16) and a conventional PAp. • Against a local cached correlated predictor (1K – 64K), with and without the default predictor.

Local Simulation Results(with default predictor)

Local Simulation Results(without default predictor)

Local Simulation Results • For conventional local two-level predictors the best average misprediction rate of 7.35% is achieved by a PAp predictor with a history register length of 30. • The best misprediction rate achieved by a local cached correlated predictor (64K HR= 28) is 6.19%. • This is a 19% improvement over the best conventional local two-level predictor. • However, without the default predictor the best misprediction rate achieved is 8.21% (32K HR=12).

Three-Stage Predictor • Since, the high performance of a cached predictor depends crucially on the provision of the two-stage mechanism we were led to the development of the three-stage predictor.

Three-Stage Predictor • Stages • Primary Prediction Cache. • Secondary Prediction Cache. • Default Predictor. • The predictions from the two Prediction Caches are stored in the BTC so that a prediction is furnished in a single clock cycle.

Three-Stage Predictor Simulations • We repeated the same set of simulations. • We varied the Primary Prediction Cache size (1 – 64K). • The Secondary Prediction Cache was always half the size of the Primary Prediction Cache and used exactly half of the history register bits.

Global Three-Stage Predictor Simulation Results

Global Three-Stage Predictor Simulation Results • The global three-stage predictor consistently outperforms the simpler global two-stage predictor. • The best misprediction rate is now 5.57% achieved with a 32K Prediction Cache and a 30-bit HR. • This represents a 7.5% improvement over the best global two-Stage predictor.

Local Three-Stage Predictor Simulation Results

Local Three-Stage Predictor Simulation Results • The local three-stage predictor consistently outperforms the simpler local two-stage predictor. • The best misprediction rate is now 6.00% achieved with a 64K Prediction Cache and a 28-bit HR. • This represents a 3.2% improvement over the best local two-stage predictor.

Conclusion So Far • Conventional PHTs use large amounts of hardware with increasing history register length. • The history register size of a cached correlated predictor does not determine cost. • A Prediction Cache can reduce the hardware cost over a conventional PHT.

Conclusion So Far • Cached correlated predictors provide better prediction accuracy than conventional two-level predictors. • The role of the default predictor in a cached correlated predictor is crucial. • Three-stage predictors consistently record a small but significant improvement over their two-stage counterparts.

Neural Network Branch Prediction • Dynamic branch prediction can be considered to be a specific instance of a general Time Series Prediction. • Two-level Adaptive Branch Prediction is a very specific solution to the branch prediction problem. • An alternative approach is to look at other applications areas and fields for novel solutions to the problem. • At Hatfield, we have examined the application of neural networks to the branch prediction problem.

Neural Network Branch Prediction • Two neural networks are considered: • A Learning Vector Quantisation (LVQ) Network, • A Backpropagation Network. • One of our main research objectives is to use neural networks to identify new correlations that can be exploited by branch predictors. • We also wish to determine whether more accurate branch prediction is possible and to gain a greater understanding of the underlying prediction mechanisms.

Neural Network Branch Prediction • As with Cached Correlated Branch Prediction, we retain the first level of a conventional two-level predictor. • The k-bit pattern of the history register is fed into the network as input. • In fact, we concatenate the 10 lsb of the branch address with the HR as input to the network.

LVQ prediction • The idea of using an LVQ predictor, was to see if respectable prediction rates could be delivered by a simple LVQ network that was dynamically trained after each branch prediction.

LVQ prediction • The LVQ predictor contains two “codebook” vectors: • Vt – is associated with a taken branch. • Vnt – is associated with a not taken branch. • The concatenated PC + HR form a single input vector. • We call this vector X.

LVQ prediction • Modified Hamming distances are then computed between X, Vt and Vnt. • The winning vector, Vw, is the vector with the smallest HD.

LVQ prediction • Vw is used to predict the branch. • If Vt wins then the branch is predicted as taken. • If Vnt wins then the branch is predicted as not taken.

LVQ prediction • LVQ network training • At branch resolution, Vw is adjusted: Vw (t + 1) = Vw (t) +/- a(t)[X(t) - Vw(t)] • To reinforce correct predictions, the vector is incremented whenever a prediction was proved to be correct, and decremented whenever a prediction was proved to be incorrect. • The factor a(t) represents the learning factor and was (usually) set to a small constant of <0.1. • The losing vector remains unchanged.

LVQ prediction • LVQ network training • Training is therefore dynamic. • Training is also adaptive since the “codebook” vectors reflect the outcomes of the most recently encountered branches.

output layer input layer hidden layers prediction PC + HR weights applied to inputs Backpropagation prediction • Prediction information is fed into a backpropagation network.

Two-level Adaptive Branch Prediction

Two-level Adaptive Branch Prediction

Presentation Transcript

Branch Prediction

Branch Prediction Logic

Branch Prediction

Instruction-Level Parallelism compiler techniques and branch prediction

Branch Prediction

Dynamic Branch Prediction

Lecture 21: Instruction Level Parallelism (Branch Prediction)

Dynamic Branch Prediction

Branch prediction

Branch Prediction

Two-Level Adaptive Dynamic Branch Prediction

Branch Prediction

Branch Prediction

Branch prediction

Branch Prediction Techniques

Branch Prediction

Branch Prediction Logic