Two-level Adaptive Branch Prediction Colin Egan University of Hertfordshire Hatfield U.K. email@example.com
Presentation Structure • Two-level Adaptive Branch Prediction • Cached Correlated Branch Prediction • Neural Branch Prediction • Conclusion and Discussion • Where next?
Two-level Adaptive Branch Prediction Schemes • First level: • History register(s) record the outcome of the last k branches encountered. • A global history register records information from other branches leading to the branch. • Local (Per-Address) history registers record information of specific branches.
Two-level Adaptive Branch Prediction Schemes • Second level: • Is termed the Pattern History Table (PHT). • The PHT consists of at least one array of two-bit up/down saturating counters that provide the prediction.
Two-level Adaptive Branch Prediction Schemes • Global schemes: • Exploit correlation between the outcome of the current branch and neighbouring branches that were executed leading to this branch. • Local schemes: • Exploit correlation between the outcome of the current branch and its past behaviour.
A global PHT branch address global HR k-bit HR BTC br-tag target address status 2k Global Two-level Adaptive Branch Prediction • GAg Predictor Implementation prediction
branch address n bits of branch address A local/set PHT BTC global HR k-bit HR br-tag target address status .. 2k prediction Global Two-level Adaptive Branch Prediction • GAp / GAs Predictor Implementation
A global PHT (Second-level) PC n bits of branch address BTC (BHT first-level) br-tag target address status HRl prediction Local Two-level Adaptive Branch Prediction • PAg
A local or set PHT (Second-level) PC (branch) address n bits of branch address BTC (BHT first-level) br-tag target address status HRl .. prediction Local Two-level Adaptive Branch Prediction • PAs / PAp
Problems with Two-level Adaptive Branch Prediction • Size of PHT • Increases exponentially as a function of HR length. • Use of uninitialised predictors • No tag fields are associated with PHT prediction counters. • Branch interference (aliasing) • In GAg and PAg all branches share a common PHT. • In GAs and PAs each branch is shared by a set of branches.
Cached Correlated Branch Prediction • Minimises the number of initial mispredictions. • Eliminates branch interference. • Is used in a disciplined manner. • Is cost-effective.
Cached Correlated Branch Prediction • Today we are going to look at two types of cached correlated predictors: • A Global Cached Correlated Predictor. • A Local Cached Correlated Predictor. • We have also developed a combined predictor that uses both global and local history information.
Cached Correlated Branch Prediction • The first level history register remains the same as conventional two-level predictors. • Uses a second level Prediction Cache instead of a PHT.
Cached Correlated Branch Prediction • Uses a secondary default predictor (BTC). • Both predictors provide a prediction. • A priority selector chooses the actual prediction.
Cached Correlated Branch Prediction • The Prediction Cache predicts on the past behaviour of the branch with the current history register pattern. • The Default Predictor predicts on the overall past behaviour of the branch.
Prediction Cache • Size is not a function of the history register length. • Size is determined by the number of prediction counters that are actually used. • Requires a tag-field. • Is cost effective as long as the cost of redundant counters removed from a conventional PHT exceeds the cost of the added tags.
BTC PC Global history register Prediction Cache hash br_tag br_trgt pred vld lru br_tag hrg_tag pred vld lru prediction selector prediction A Global Cached Correlated Branch Predictor
A Local Cached Correlated Branch Predictor • Problem • A Local predictor will require two sequential clock access: • One to access the BTC to furnish HRl. • Second to access the Prediction Cache. • Solution • Cache the next prediction for each branch in the BTC. • Only one clock access is therefore needed.
A Local Cached Correlated Branch Predictor PC BTC Prediction Cache hrl hash Default prediction Prediction Cache prediction Correlated hit prediction selector BTC hit actual prediction
Simulations • Stanford Integer Benchmark suite. • These benchmarks are difficult to predict. • Instruction traces were obtained from the Hatfield Superscalar Architecture (HSA).
Global Simulations • A comparative study of misprediction rates • A conventional GAg, a conventional GAs(16) and a conventional GAp. • Against a Global Cached Correlated predictor (1K – 64K).
Global Simulation Results • For conventional global two-level predictors the best average misprediction rate of 9.23% is achieved by a GAs(16) predictor with a history register length of 26. • In general there is little benefit from increasing the history register length beyond 16-bits for GAg and 14-bits for GAs/GAp.
Global Simulation Results • A 32K entry Prediction Cache with a 30-bit history register achieved the best misprediction rate of 5.99% for the global cached correlated predictors. • This represents a 54% reduction over the best misprediction rate achieved by a conventional global two-level predictor.
Global Simulations • We repeated the same simulations without the default predictor.
Global Simulation Results(without default predictor) • The best misprediction rate is now 9.12%. • The high performance of the cached predictor depends crucially on the provision of the two-stage mechanism.
Local Simulations • A comparative study of misprediction rates. • A conventional PAg, a conventional PAs(16) and a conventional PAp. • Against a local cached correlated predictor (1K – 64K), with and without the default predictor.
Local Simulation Results • For conventional local two-level predictors the best average misprediction rate of 7.35% is achieved by a PAp predictor with a history register length of 30. • The best misprediction rate achieved by a local cached correlated predictor (64K HR= 28) is 6.19%. • This is a 19% improvement over the best conventional local two-level predictor. • However, without the default predictor the best misprediction rate achieved is 8.21% (32K HR=12).
Three-Stage Predictor • Since, the high performance of a cached predictor depends crucially on the provision of the two-stage mechanism we were led to the development of the three-stage predictor.
Three-Stage Predictor • Stages • Primary Prediction Cache. • Secondary Prediction Cache. • Default Predictor. • The predictions from the two Prediction Caches are stored in the BTC so that a prediction is furnished in a single clock cycle.
Three-Stage Predictor Simulations • We repeated the same set of simulations. • We varied the Primary Prediction Cache size (1 – 64K). • The Secondary Prediction Cache was always half the size of the Primary Prediction Cache and used exactly half of the history register bits.
Global Three-Stage Predictor Simulation Results • The global three-stage predictor consistently outperforms the simpler global two-stage predictor. • The best misprediction rate is now 5.57% achieved with a 32K Prediction Cache and a 30-bit HR. • This represents a 7.5% improvement over the best global two-Stage predictor.
Local Three-Stage Predictor Simulation Results • The local three-stage predictor consistently outperforms the simpler local two-stage predictor. • The best misprediction rate is now 6.00% achieved with a 64K Prediction Cache and a 28-bit HR. • This represents a 3.2% improvement over the best local two-stage predictor.
Conclusion So Far • Conventional PHTs use large amounts of hardware with increasing history register length. • The history register size of a cached correlated predictor does not determine cost. • A Prediction Cache can reduce the hardware cost over a conventional PHT.
Conclusion So Far • Cached correlated predictors provide better prediction accuracy than conventional two-level predictors. • The role of the default predictor in a cached correlated predictor is crucial. • Three-stage predictors consistently record a small but significant improvement over their two-stage counterparts.
Neural Network Branch Prediction • Dynamic branch prediction can be considered to be a specific instance of a general Time Series Prediction. • Two-level Adaptive Branch Prediction is a very specific solution to the branch prediction problem. • An alternative approach is to look at other applications areas and fields for novel solutions to the problem. • At Hatfield, we have examined the application of neural networks to the branch prediction problem.
Neural Network Branch Prediction • Two neural networks are considered: • A Learning Vector Quantisation (LVQ) Network, • A Backpropagation Network. • One of our main research objectives is to use neural networks to identify new correlations that can be exploited by branch predictors. • We also wish to determine whether more accurate branch prediction is possible and to gain a greater understanding of the underlying prediction mechanisms.
Neural Network Branch Prediction • As with Cached Correlated Branch Prediction, we retain the first level of a conventional two-level predictor. • The k-bit pattern of the history register is fed into the network as input. • In fact, we concatenate the 10 lsb of the branch address with the HR as input to the network.
LVQ prediction • The idea of using an LVQ predictor, was to see if respectable prediction rates could be delivered by a simple LVQ network that was dynamically trained after each branch prediction.
LVQ prediction • The LVQ predictor contains two “codebook” vectors: • Vt – is associated with a taken branch. • Vnt – is associated with a not taken branch. • The concatenated PC + HR form a single input vector. • We call this vector X.
LVQ prediction • Modified Hamming distances are then computed between X, Vt and Vnt. • The winning vector, Vw, is the vector with the smallest HD.
LVQ prediction • Vw is used to predict the branch. • If Vt wins then the branch is predicted as taken. • If Vnt wins then the branch is predicted as not taken.
LVQ prediction • LVQ network training • At branch resolution, Vw is adjusted: Vw (t + 1) = Vw (t) +/- a(t)[X(t) - Vw(t)] • To reinforce correct predictions, the vector is incremented whenever a prediction was proved to be correct, and decremented whenever a prediction was proved to be incorrect. • The factor a(t) represents the learning factor and was (usually) set to a small constant of <0.1. • The losing vector remains unchanged.
LVQ prediction • LVQ network training • Training is therefore dynamic. • Training is also adaptive since the “codebook” vectors reflect the outcomes of the most recently encountered branches.
output layer input layer hidden layers prediction PC + HR weights applied to inputs Backpropagation prediction • Prediction information is fed into a backpropagation network.