440 likes | 658 Views
Alpha EV8 (cancelled june 2001). SMT: 4 threadswide-issue superscalar processor:8-way issue512 registers. Single process performance is the goalMultiprocess performance is the bonus 5-10 % overhead for SMT. Challenges on the EV8 conditional branch predictor. High accuracy is needed:
E N D
1. Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA
Stephen Felix, Intel
Venkata Krishnan, Stargen Inc
Yiannakis Sazeides, University of Cyprus
2. Alpha EV8 (cancelled june 2001) SMT: 4 threads
wide-issue superscalar processor:
8-way issue
512 registers
3. Challenges on the EV8 conditional branch predictor High accuracy is needed:
14 cycles minimum miss penalty
silicon budget is large, but don ’t waste it
Up to 16 predictions per cycle:
from two non-contiguous fetch blocks!
history vector(s) must be updated with 0 to 16 branch outcomes !
Branch history information is 3 fetch blocks old
Various implementation constraints:
master the number of physical memory arrays
use of single-ported memory cells
timing constraints on indexing functions
4. Alpha EV8 front-end pipeline Fetches up to two, 8-instruction blocks per cycle from the I-cache:
a block ends either on an aligned 8-instruction end or on a taken control flow
up to 16 conditional branches fetched and predicted per cycle
Next two block addresses must be predicted on a single cycle:
critical path: use of a line predictor backed with a complex PC address generator: conditional branch predictor, RAS, jump predictor ..
5. instruction fetch blocks on EV8
6. PC address generation pipeline
7. Minimum background in branch prediction (0) Only direction is an issue:
targets are recomputed on the fly
8. Minimum background in branch prediction (1) Use the past to predict the current branch:
read and update tables:
2-bit counters: prediction + hysteresis
index:
address
global branch history: what happened with the last few branches
a single history vector
or local branch history: what happened the last times this branch was executed
must maintain a table of histories
9. Minimum background on branch prediction (2) Problems:
global or local schemes ?
which precised scheme ?
Interferences in tables may ruin accuracy
10. Global vs local history 16 local history reads:
2-ported history read
16-ported prediction table ?
Speculative history:
up to 256 branches inflight
3 fetch blocks old history:
tight loops ?
SMT: sharing is disastrous Global history:
bank-interleaved prediction tables:
even no arbitration !
Speculative history:
3 fetch blocks old:
not a real issue !
use of path !
SMT: sharing may be constructive
11. EV8 predictor: (derived from) (2Bc-gskew)
12. 2Bc-gskew: hybrid skewed predictor
Leverages:
de-aliased predictor e-gskew
bimodal for easy-to-predict branches
State-of-the-art global history branch predictor
at least (published) in 1999 :-)
along with the YAGS predictor :-)
13. 2Bc-gskew: degrees of freedom partial update policy on correct predictions, only updates correct components:
do not destruct other predictions
+ two tricks to further minimize write pressure
better accuracy !
USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !!
On correct predictions:
prediction bit is only read
hysteresis bit is only written
14. 2Bc-gskew: degrees of freedom (2) sharing hysteresis bits Using 2-bit counters:
strong states occur more often than weak states
very small loss of accuracy when sharing hysteresis bit between 2 or 4 counters
15. 2Bc-gskew: degrees of freedom (3) Different applications: different optimal history lengths
use of different indexing functions allows to use different history lengths for the predictor tables
smoothens the difficulty
Different prediction table sizes:
the bimodal table may be smaller than the other tables :-)
16. EV8 predictor
17. Dealing with implementation constraints
18. Issues on global history
19. Block compressed history lghist
Incorporate at most one bit in the history per fetch block:
0, 1 or 2 bits to be incorporated in history vector per cycle
Which bit ?
Direction of the last conditional branch in the block
previous ones are not taken
XORed with position (1st half/ 2nd half) in the block
more uniform distribution of the history vectors
20. instruction fetch blocks on EV8
21. The EV8 branch predictor information vector History information is not available for the three last blocks A, B, and C
but, addresses are available !!
22. Using single-ported memory arrays
23. Different arrays for hysteresis and predictions Prediction uses the most significant bit of a 2-bit counter
Partial update on correct prediction: only strengthening the hysteresis bit (not even always)
only a WRITE on the hysteresis
Misprediction:
update = read of hysteresis + write of (prediction + hysteresis) for a single branch
24. Bank-interleaved or double-ported branch predictor ? Reads of predictions for 2 8-instructions blocks:
double-porting: memory cells twice as large
losing half of the entries ?
bank-interleaving: need for arbitration
longer critical electrical path
losing throughput
short loops fitting in a single 8-instruction block !?
25. Conflict free interleaved bank predictor
26. Conflict free bank-interleaved predictor (2) Conflicts are avoided by construction
Bank number is computed one cycle ahead
not on the critical path
27. « Logical view » vs real implementation 4 tables * 4 banks * 2 (pred. +hyst.):
32 memory arrays
Indexing functions are computed, then arrays are accessed 4 banks * 2 (pred. + hyst.)
4 tables in a single array
8 memory arrays
No time to lose:
start access and compute part of the index in //
28. Reading the branch prediction tables
29. Reading the branch prediction tables (2) Span over 5/2 cycles:
Cycle -1:
bank number computation
bank selection
Cycle 0:
phase 0: wordline selection
phase 1: column selection
Cycle 1:
phase 0: unshuffle permutation
30. Constraints on the different parts in the indices Strong: Wordline bits:
immediate availability
common to the four logical tables
Medium: Column bits
a single 2-entry XOR gate
Weak: Unshuffle bits:
near complete freedom, a full tree of XOR gates if needed
31. Designing the indexing functions (1)6 wordline bits Must be available at the beginning of the cycle:
block address bits
3-block old lghist bits
path bits
Tradeoff:
address bits for emphasizing bimodal component behavior
lghist bits are more uniformly distributed
32. Designing the indexing functions (2)Column selection and unshuffle Favor independance of the four indexing functions:
if two (address,history) pairs conflict on a table then try to avoid repeating the conflict on an other table
Guarantee that for a single address, two histories differing by only one or two bits will not map on the same entry
Favor usage of the whole table:
lghist bits are more uniformly distributed than address bits
33. EV8 branch predictor configuration 208 Kbits for prediction and 144 Kbits for hysteresis
«BIM»: 16K + 16K, 4 lghist bits (+ 3-block path)
G0: 64K + 32 K, 13 lghist bits
G1: 64K + 64 K, 21 lghist bits
Meta: 64 K +32 K, 17 lghist bits
4 prediction banks and 4 hysteresis banks
34. Performance evaluation Sorry,
SPEC 95 :-)
35. Benchmarks characteristics Highly optimized SPECint 95:
much more not-taken than taken
ratio lghist/ghist length:
from 1.12 to 1.59
from 8.9 to 16.2 branches per 100 instructions
36. 2Bc-gskew vs other global history predictors
37. History should be longer than log2(size)
38. Quality of information vector
39. Quality of information vector (2) Lghist vs ghist : no significant loss of information
Using path or no path does not appear (here) as an issue
3-block old lghist: slight loss accuracy
EV8 info vector:
path info on the last blocks recovers part of the loss associated with 3-block old lghist
40. Reducing some table sizes no significant impact
41. Qualities of the indexing functions
42. Qualities of the indexing functions (2) Using history bits in the wordline selection is a good tradeoff
Embedding path information in lghist is beneficial
EV8 indexing functions are as good as indexing functions designed with no constraints
43. Pushing the limits !?
44. Conclusion Design of a real branch predictor leads to challenges ignored in acamedic studies:
3-block old history vector
impossibility to maintain a complete history
multiple // accesses to the predictor
minimization of the number of memory arrays
timing constraints on the indexing functions
45. Summary of the contributions Efficient information vector can be built with mixing path and compressed history:
don’t focus on the info vector, use what is convenient!
Use of different table sizes, history lengths in the predictor.
Sharing of hysteresis bits
Conflict free parallel access scheme for the predictor
Engineering of indexing functions