1 / 44

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Alpha EV8 (cancelled june 2001). SMT: 4 threadswide-issue superscalar processor:8-way issue512 registers. Single process performance is the goalMultiprocess performance is the bonus 5-10 % overhead for SMT. Challenges on the EV8 conditional branch predictor. High accuracy is needed:

july
Download Presentation

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata Krishnan, Stargen Inc Yiannakis Sazeides, University of Cyprus

    2. Alpha EV8 (cancelled june 2001) SMT: 4 threads wide-issue superscalar processor: 8-way issue 512 registers

    3. Challenges on the EV8 conditional branch predictor High accuracy is needed: 14 cycles minimum miss penalty silicon budget is large, but don ’t waste it Up to 16 predictions per cycle: from two non-contiguous fetch blocks! history vector(s) must be updated with 0 to 16 branch outcomes ! Branch history information is 3 fetch blocks old Various implementation constraints: master the number of physical memory arrays use of single-ported memory cells timing constraints on indexing functions

    4. Alpha EV8 front-end pipeline Fetches up to two, 8-instruction blocks per cycle from the I-cache: a block ends either on an aligned 8-instruction end or on a taken control flow up to 16 conditional branches fetched and predicted per cycle Next two block addresses must be predicted on a single cycle: critical path: use of a line predictor backed with a complex PC address generator: conditional branch predictor, RAS, jump predictor ..

    5. instruction fetch blocks on EV8

    6. PC address generation pipeline

    7. Minimum background in branch prediction (0) Only direction is an issue: targets are recomputed on the fly

    8. Minimum background in branch prediction (1) Use the past to predict the current branch: read and update tables: 2-bit counters: prediction + hysteresis index: address global branch history: what happened with the last few branches a single history vector or local branch history: what happened the last times this branch was executed must maintain a table of histories

    9. Minimum background on branch prediction (2) Problems: global or local schemes ? which precised scheme ? Interferences in tables may ruin accuracy

    10. Global vs local history 16 local history reads: 2-ported history read 16-ported prediction table ? Speculative history: up to 256 branches inflight 3 fetch blocks old history: tight loops ? SMT: sharing is disastrous Global history: bank-interleaved prediction tables: even no arbitration ! Speculative history: 3 fetch blocks old: not a real issue ! use of path ! SMT: sharing may be constructive

    11. EV8 predictor: (derived from) (2Bc-gskew)

    12. 2Bc-gskew: hybrid skewed predictor Leverages: de-aliased predictor e-gskew bimodal for easy-to-predict branches State-of-the-art global history branch predictor at least (published) in 1999 :-) along with the YAGS predictor :-)

    13. 2Bc-gskew: degrees of freedom partial update policy on correct predictions, only updates correct components: do not destruct other predictions + two tricks to further minimize write pressure better accuracy ! USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !! On correct predictions: prediction bit is only read hysteresis bit is only written

    14. 2Bc-gskew: degrees of freedom (2) sharing hysteresis bits Using 2-bit counters: strong states occur more often than weak states very small loss of accuracy when sharing hysteresis bit between 2 or 4 counters

    15. 2Bc-gskew: degrees of freedom (3) Different applications: different optimal history lengths use of different indexing functions allows to use different history lengths for the predictor tables smoothens the difficulty Different prediction table sizes: the bimodal table may be smaller than the other tables :-)

    16. EV8 predictor

    17. Dealing with implementation constraints

    18. Issues on global history

    19. Block compressed history lghist Incorporate at most one bit in the history per fetch block: 0, 1 or 2 bits to be incorporated in history vector per cycle Which bit ? Direction of the last conditional branch in the block previous ones are not taken XORed with position (1st half/ 2nd half) in the block more uniform distribution of the history vectors

    20. instruction fetch blocks on EV8

    21. The EV8 branch predictor information vector History information is not available for the three last blocks A, B, and C but, addresses are available !!

    22. Using single-ported memory arrays

    23. Different arrays for hysteresis and predictions Prediction uses the most significant bit of a 2-bit counter Partial update on correct prediction: only strengthening the hysteresis bit (not even always) only a WRITE on the hysteresis Misprediction: update = read of hysteresis + write of (prediction + hysteresis) for a single branch

    24. Bank-interleaved or double-ported branch predictor ? Reads of predictions for 2 8-instructions blocks: double-porting: memory cells twice as large losing half of the entries ? bank-interleaving: need for arbitration longer critical electrical path losing throughput short loops fitting in a single 8-instruction block !?

    25. Conflict free interleaved bank predictor

    26. Conflict free bank-interleaved predictor (2) Conflicts are avoided by construction Bank number is computed one cycle ahead not on the critical path

    27. « Logical view » vs real implementation 4 tables * 4 banks * 2 (pred. +hyst.): 32 memory arrays Indexing functions are computed, then arrays are accessed 4 banks * 2 (pred. + hyst.) 4 tables in a single array 8 memory arrays No time to lose: start access and compute part of the index in //

    28. Reading the branch prediction tables

    29. Reading the branch prediction tables (2) Span over 5/2 cycles: Cycle -1: bank number computation bank selection Cycle 0: phase 0: wordline selection phase 1: column selection Cycle 1: phase 0: unshuffle permutation

    30. Constraints on the different parts in the indices Strong: Wordline bits: immediate availability common to the four logical tables Medium: Column bits a single 2-entry XOR gate Weak: Unshuffle bits: near complete freedom, a full tree of XOR gates if needed

    31. Designing the indexing functions (1) 6 wordline bits Must be available at the beginning of the cycle: block address bits 3-block old lghist bits path bits Tradeoff: address bits for emphasizing bimodal component behavior lghist bits are more uniformly distributed

    32. Designing the indexing functions (2) Column selection and unshuffle Favor independance of the four indexing functions: if two (address,history) pairs conflict on a table then try to avoid repeating the conflict on an other table Guarantee that for a single address, two histories differing by only one or two bits will not map on the same entry Favor usage of the whole table: lghist bits are more uniformly distributed than address bits

    33. EV8 branch predictor configuration 208 Kbits for prediction and 144 Kbits for hysteresis «BIM»: 16K + 16K, 4 lghist bits (+ 3-block path) G0: 64K + 32 K, 13 lghist bits G1: 64K + 64 K, 21 lghist bits Meta: 64 K +32 K, 17 lghist bits 4 prediction banks and 4 hysteresis banks

    34. Performance evaluation Sorry, SPEC 95 :-)

    35. Benchmarks characteristics Highly optimized SPECint 95: much more not-taken than taken ratio lghist/ghist length: from 1.12 to 1.59 from 8.9 to 16.2 branches per 100 instructions

    36. 2Bc-gskew vs other global history predictors

    37. History should be longer than log2(size)

    38. Quality of information vector

    39. Quality of information vector (2) Lghist vs ghist : no significant loss of information Using path or no path does not appear (here) as an issue 3-block old lghist: slight loss accuracy EV8 info vector: path info on the last blocks recovers part of the loss associated with 3-block old lghist

    40. Reducing some table sizes no significant impact

    41. Qualities of the indexing functions

    42. Qualities of the indexing functions (2) Using history bits in the wordline selection is a good tradeoff Embedding path information in lghist is beneficial EV8 indexing functions are as good as indexing functions designed with no constraints

    43. Pushing the limits !?

    44. Conclusion Design of a real branch predictor leads to challenges ignored in acamedic studies: 3-block old history vector impossibility to maintain a complete history multiple // accesses to the predictor minimization of the number of memory arrays timing constraints on the indexing functions

    45. Summary of the contributions Efficient information vector can be built with mixing path and compressed history: don’t focus on the info vector, use what is convenient! Use of different table sizes, history lengths in the predictor. Sharing of hysteresis bits Conflict free parallel access scheme for the predictor Engineering of indexing functions

More Related