Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

1. Design tradeoffs for the Alpha EV8 Conditional Branch Predictor Andr� Seznec, IRISA/INRIA Stephen Felix, Intel Venkata Krishnan, Stargen Inc Yiannakis Sazeides, University of Cyprus

2. Alpha EV8 (cancelled june 2001) SMT: 4 threads wide-issue superscalar processor: 8-way issue 512 registers

3. Challenges on the EV8 conditional branch predictor High accuracy is needed: 14 cycles minimum miss penalty silicon budget is large, but don��t waste it Up to 16 predictions per cycle: from two non-contiguous fetch blocks! history vector(s) must be updated with 0 to 16 branch outcomes ! Branch history information is 3 fetch blocks old Various implementation constraints: master the number of physical memory arrays use of single-ported memory cells timing constraints on indexing functions

4. Alpha EV8 front-end pipeline Fetches up to two, 8-instruction blocks per cycle from the I-cache: a block ends either on an aligned 8-instruction end or on a taken control flow up to 16 conditional branches fetched and predicted per cycle Next two block addresses must be predicted on a single cycle: critical path: use of a line predictor backed with a complex PC address generator: conditional branch predictor, RAS, jump predictor ..

5. instruction fetch blocks on EV8

6. PC address generation pipeline

7. Minimum background in branch prediction (0) Only direction is an issue: targets are recomputed on the fly

8. Minimum background in branch prediction (1) Use the past to predict the current branch: read and update tables: 2-bit counters: prediction + hysteresis index: address global branch history: what happened with the last few branches a single history vector or local branch history: what happened the last times this branch was executed must maintain a table of histories

9. Minimum background on branch prediction (2) Problems: global or local schemes ? which precised scheme ? Interferences in tables may ruin accuracy

10. Global vs local history 16 local history reads: 2-ported history read 16-ported prediction table ? Speculative history: up to 256 branches inflight 3 fetch blocks old history: tight loops ? SMT: sharing is disastrous Global history: bank-interleaved prediction tables: even no arbitration ! Speculative history: 3 fetch blocks old: not a real issue ! use of path ! SMT: sharing may be constructive

11. EV8 predictor: (derived from) (2Bc-gskew)

12. 2Bc-gskew: hybrid skewed predictor Leverages: de-aliased predictor e-gskew bimodal for easy-to-predict branches State-of-the-art global history branch predictor at least (published) in 1999 :-) along with the YAGS predictor :-)

13. 2Bc-gskew: degrees of freedom partial update policy on correct predictions, only updates correct components: do not destruct other predictions + two tricks to further minimize write pressure better accuracy ! USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !! On correct predictions: prediction bit is only read hysteresis bit is only written

14. 2Bc-gskew: degrees of freedom (2) sharing hysteresis bits Using 2-bit counters: strong states occur more often than weak states very small loss of accuracy when sharing hysteresis bit between 2 or 4 counters

15. 2Bc-gskew: degrees of freedom (3) Different applications: different optimal history lengths use of different indexing functions allows to use different history lengths for the predictor tables smoothens the difficulty Different prediction table sizes: the bimodal table may be smaller than the other tables :-)

16. EV8 predictor

17. Dealing with implementation constraints

18. Issues on global history

19. Block compressed history lghist Incorporate at most one bit in the history per fetch block: 0, 1 or 2 bits to be incorporated in history vector per cycle Which bit ? Direction of the last conditional branch in the block previous ones are not taken XORed with position (1st half/ 2nd half) in the block more uniform distribution of the history vectors

20. instruction fetch blocks on EV8

21. The EV8 branch predictor information vector History information is not available for the three last blocks A, B, and C but, addresses are available !!

22. Using single-ported memory arrays

23. Different arrays for hysteresis and predictions Prediction uses the most significant bit of a 2-bit counter Partial update on correct prediction: only strengthening the hysteresis bit (not even always) only a WRITE on the hysteresis Misprediction: update = read of hysteresis + write of (prediction + hysteresis) for a single branch

24. Bank-interleaved or double-ported branch predictor ? Reads of predictions for 2 8-instructions blocks: double-porting: memory cells twice as large losing half of the entries ? bank-interleaving: need for arbitration longer critical electrical path losing throughput short loops fitting in a single 8-instruction block !?

25. Conflict free interleaved bank predictor

26. Conflict free bank-interleaved predictor (2) Conflicts are avoided by construction Bank number is computed one cycle ahead not on the critical path

27. ��Logical view�� vs real implementation 4 tables * 4 banks * 2 (pred. +hyst.): 32 memory arrays Indexing functions are computed, then arrays are accessed 4 banks * 2 (pred. + hyst.) 4 tables in a single array 8 memory arrays No time to lose: start access and compute part of the index in //

28. Reading the branch prediction tables

29. Reading the branch prediction tables (2) Span over 5/2 cycles: Cycle -1: bank number computation bank selection Cycle 0: phase 0: wordline selection phase 1: column selection Cycle 1: phase 0: unshuffle permutation

30. Constraints on the different parts in the indices Strong: Wordline bits: immediate availability common to the four logical tables Medium: Column bits a single 2-entry XOR gate Weak: Unshuffle bits: near complete freedom, a full tree of XOR gates if needed

31. Designing the indexing functions (1)6 wordline bits Must be available at the beginning of the cycle: block address bits 3-block old lghist bits path bits Tradeoff: address bits for emphasizing bimodal component behavior lghist bits are more uniformly distributed

32. Designing the indexing functions (2)Column selection and unshuffle Favor independance of the four indexing functions: if two (address,history) pairs conflict on a table then try to avoid repeating the conflict on an other table Guarantee that for a single address, two histories differing by only one or two bits will not map on the same entry Favor usage of the whole table: lghist bits are more uniformly distributed than address bits

33. EV8 branch predictor configuration 208 Kbits for prediction and 144 Kbits for hysteresis �BIM�: 16K + 16K, 4 lghist bits (+ 3-block path) G0: 64K + 32 K, 13 lghist bits G1: 64K + 64 K, 21 lghist bits Meta: 64 K +32 K, 17 lghist bits 4 prediction banks and 4 hysteresis banks

34. Performance evaluation Sorry, SPEC 95 :-)

35. Benchmarks characteristics Highly optimized SPECint 95: much more not-taken than taken ratio lghist/ghist length: from 1.12 to 1.59 from 8.9 to 16.2 branches per 100 instructions

36. 2Bc-gskew vs other global history predictors

37. History should be longer than log2(size)

38. Quality of information vector

39. Quality of information vector (2) Lghist vs ghist : no significant loss of information Using path or no path does not appear (here) as an issue 3-block old lghist: slight loss accuracy EV8 info vector: path info on the last blocks recovers part of the loss associated with 3-block old lghist

40. Reducing some table sizes no significant impact

41. Qualities of the indexing functions

42. Qualities of the indexing functions (2) Using history bits in the wordline selection is a good tradeoff Embedding path information in lghist is beneficial EV8 indexing functions are as good as indexing functions designed with no constraints

43. Pushing the limits !?

44. Conclusion Design of a real branch predictor leads to challenges ignored in acamedic studies: 3-block old history vector impossibility to maintain a complete history multiple // accesses to the predictor minimization of the number of memory arrays timing constraints on the indexing functions

45. Summary of the contributions Efficient information vector can be built with mixing path and compressed history: don�t focus on the info vector, use what is convenient! Use of different table sizes, history lengths in the predictor. Sharing of hysteresis bits Conflict free parallel access scheme for the predictor Engineering of indexing functions

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Presentation Transcript

A 256 Kbits L-TAGE branch predictor

Design Tradeoffs for SSD Performance

The O-GEHL branch predictor

A Penalty-Sensitive Branch Predictor

Global-Local Combined Branch History The Alternative Way to Improve TAGE Branch Predictor

Naming System Design Tradeoffs

Temporal Stream Branch Predictor (TS Predictor)

A 64 Kbytes ITTAGE indirect branch predictor

Bimode Cascading: Adaptive Rehashing for ITTAGE Indirect Branch Predictor

Branch Predictor Interface

Looking for limits in branch prediction with the GTL predictor

Exploring Efficient SMT Branch Predictor Design

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

A 256 Kbits L-TAGE branch predictor

Branch Predictor Design for AE64000

Branch Design

THE PREDICTOR

Static Conditional Branch Prediction

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

The O-GEHL branch predictor