1 / 39

Effective ahead pipelining of instruction block address generation

Effective ahead pipelining of instruction block address generation. André Seznec and Antony Fraboulet IRISA/ INRIA. Instruction fetch on wide issue superscalar processors. Fetching 6-10 instructions in // on each cycle: Fetch can be pipelined I-cache can be banked

manon
Download Presentation

Effective ahead pipelining of instruction block address generation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA

  2. Instruction fetch on wide issue superscalar processors • Fetching 6-10 instructions in // on each cycle: • Fetch can be pipelined • I-cache can be banked • Instruction streams are “relatively long” • Next block address generation is critical Not the real issueµ

  3. Instruction block address generation • One block per cycle • Speculative: accuracy is critical • Accuracy comes with hardware complexity: • Conditional branch predictor • Sequential block address computation • Return address stack read • Jump prediction • Branch target prediction/computation • Final address selection

  4. Using a complex instruction address generator (IAG) • PentiumPro, Alpha EV6, Alpha EV8: • Fast (relatively) inaccurate IAG responding in a single cycle backed with a complex multicycle IAG • Loss of a significant part of instruction bandwidth • Overfetching implemented on Alpha EV8 • Trend to deeper pipelines: • Smaller line predictor: less accurate • Deeper pipelined IAG: longer misfetch penalty • 10 % misfetches, 3 cycles penalty : 30 % bandwidth loss

  5. Ahead pipelining the IAG • Suggested with multiple block ahead prediction in Seznec, Jourdan, Sainrat and Michaud (ASPLOS 96): • Conventional IAG: • use information available with block A to predict block B • Multiple block ahead prediction: • use information available with A to predict block C (or D) • This paper: • How to really do it

  6. Fetch blocks for us Rc: B inst /fetch block 1st inst Rs: ends with the cache block Rl ends with the second block

  7. Fetch blocks (2) 1st inst 1nt: 1NT bypassing Cond Cond aNT: all NT bypassing 0nt: no NT bypassing

  8. Cond 0nt 0NT: no extra CTI Cond Uncond 0nt 0NT 0NT+: no extra cond Fetch blocks (3)

  9. Technological hypothesis • Alpha EV8 front-end + twice faster clock: • Cycle = 1 EV8 cycle phase • Cycle = time to cross a 8 to 10 entries multiplexor • Cycle = time to read a 2Kb table and route back the read data as an index. • 2 cycles for reading a 16Kb table and routing back the read data as an index

  10. Hierarchical IAG • Complex IAG + Line predictor • Conventional complex IAG spans over four cycles: • 3 cycles for conditional branch prediction • 3 cycles for I-cache read and branch target computation • Jump prediction , return stack read • + 1 cycle for final address selection • Line prediction: • a single 2Kb table + 1-bit direction table • select between fallthrough and line predictor read

  11. Hierarchical IAG (2) Cond. Jump Pred Final Selection RAS Pred Check LP Branch target addresses + decode info

  12. Ahead pipelining the IAG • Same functionalities required: • Final selection: • uses the last cycle in the address generation • Conditional branch predictor • Jump predictor • Return address stack: • Branch target prediction • Recomputation impossible ! • Use of a Branch target buffer • Decode information: MUST BE PREDICTED

  13. Ahead pipelining the IAG (2) • Initiate table reads N cycles ahead with information available: • N-block ahead address • N-block ahead branch history • Problem: significant loss of accuracy Use of N-block ahead (address, history) + intermediate path !

  14. Inflight path incorporation • Pipelined table access + final logic: • Column selection • Wordline selection • Final logic • Insert one bit of inflight information per intermediate block • Not the same bit for each table in the predictor !

  15. Ahead conditional branch predictor • Global history branch predictor • 512 Kbits • 2bc-gskew 4 cycles prediction delay: • Indexed using 5 blocks ahead (address + history) • One bit per table per intermediate block Accuracy equivalent to conventional conditional branch predictors

  16. Ahead Branch Target Prediction • Use of a BTB: • Tradeoffs: • Size: 2Kb/16Kb 1 vs 2 cycles • tagless or tags (+1 cycle) • Associativity: +1 cycle ? • Difficulty: • The longer the pipeline read, the larger the number of possible pathes especially if nottaken branches are bypassed

  17. Ahead Branch target prediction (2) • 2Kb is too small: • 16 Kb 2 cycles access time • Associativity is required, but extra cycle for tag check is disastrous: • tagless + way-prediction ! • 2-way skewed-associativity: • to incorporate different inflight information on the two ways

  18. Ahead Branch Target Prediction (3) • Bypassing not-taken branches: • N possible branches : N possible targets • N targets per entry: waste of space • Many single entry used • Read of N contiguous entries

  19. Ahead jump predictor, return stack • Jump predictor: history + address • 3 cycles ahead + inflight information • Return address stack as usual: • Direct access to the previous cycle top • Access to the possible address to be pushed

  20. Decode information !! • Needs the decode information of the current block to select among: • Fallthrough • Branch targets • Jump target • Return address stack top Cannot be got from the I-cache, must be predicted !

  21. Decode information !! • Needs the decode information of the current block to select among: • Fallthrough • Branch targets • Jump target • Return address stack top Cannot be got from the I-cache, must be predicted !

  22. Ahead pipelined IAG and decode Cond. Jump Final Selection BTB Predicted Decode RAS Decode Info ??

  23. Decode information (2) • 1st try (not in the paper): in the BTB • Entries without targets (more capacity misses) • Duplicated decode information if multiple targets • Decode mispredictions, but correct target prediction by jump predictor, RAS, fallthrough Not very convincing results

  24. Predicting Decode information Principle: Decode information associated with a block is associated with the block address itself whenever possible whenever possible: • BTB, jump predictor • But not for returns and fallthrough blocks

  25. Predicting decode information (2) • Return block decode prediction: • Return decode prediction table indexed with (one cycle ahead) top of the return stack • Systematic decode misprediction if chaining call and associated return in two successive blocks. Sorry !

  26. Fallthrough block decode prediction • Just after a taken control-flow instruction, no time to read an extra table before the final selection stage: • Store decode information for block A+1 with address A in BTB, jump predictor • Fall through after fall through: • 2-block ahead decode prediction table: • A to read A+2 decode info

  27. 0 -4 1 -3 2 -2 3 -1 Hierarchical IAG vs ahead pipelined IAG LP IAG init. LP check selection Decode pred check completed BTB JP Ret dec 2A dec CB selection

  28. Recovering after a misprediction • Generation of address for the next block after recovery should have begun 4 cycles before recovery: • 4 cycles extra misprediction penalty !? • Unacceptable • Checkpoint/repair mechanism should provide information for a smooth restart

  29. Recovering after a misprediction (2) predicting the next block • There is no time to read any of the table in the IAG, only time to cross the final selection stage. • The checkpoint must provide everything that was at the entry of the final stage one cycle ahead. Less than 200 bits for all policies, except bypassing all not-takenbranches

  30. Recovering from misprediction (3) 3rd block and subsequent 3rd block • BTB and Jump predictor cannot provide targets in time: • All possible targets must come from checkpoint • Conditional branch predictions are not available 4th and 5th block • Conditional branch predictions are not available (approximately) 4 possible sets of entries for the final selection stage in the checkpoint repair, but 2 cycles access time.

  31. Recovering from misprediction (4) one or two bubbles restart • If full speed I-fetch resuming is too costly in the checkpoint mechanism then: • Two bubbles restart: All possible targets recomputed/predicted Only conditional predictions for next and third block to be checkpointed • One bubble restart: only one set of exits from BTB, jump predictor + conditional branch predictions to be checkpointed

  32. Performance evaluation • SPEC 2000 int + float • Traces of 100 million instructions after skipping the initialization • Immediate update

  33. Average fetch block size (integer applications) • Instruction streams: 9.79 instructions • 2 * 8 inst/cache line • No bypassing nottaken branches: 5.81 instructions • Bypassing one taken branch: 7.28 instructions • Bypassing all not taken branches: 7.40 instructions Bypassing a single taken branch is sufficient

  34. Accuracy of IAGs (integer benchmarks) misp/KI Very similar accuracies

  35. Misfetches : IAG faults corrected at decode time misfetches/KI Integer applications

  36. Misfetches (2) misfetches/KI floating-point applications

  37. (Maximum) Instruction fetch bandwidth integer applications

  38. Summary • Ahead pipelining the IAG is a valid alternative to the use of a hierarchy of IAGs: • Accuracy in the same range • Significantly higher instruction bandwidth • Main contributions: • Decode prediction • Checkpoint/repair analysis

  39. Future works • Sufficient to feed 4-to-6 way processors • May be a little bit short for 8-way or more processors • We plan to extend the study to: • Decoupled instruction fetch front end (Reinmann et al ) • Multiple (non-contiguous) block address generation • Trace caches

More Related