1 / 15

CS 7810 Lecture 7

CS 7810 Lecture 7. Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29 1996. Fetching Multiple Blocks. Aggressive o-o-o processors will perform poorly

eddah
Download Presentation

CS 7810 Lecture 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29 1996

  2. Fetching Multiple Blocks • Aggressive o-o-o processors will perform poorly • if they only fetch a single basic block every cycle • Solution: • Predict multiple branches and targets in a cycle • Fetch multiple cache lines in the cycle • Initiate the next set of fetches in the next cycle

  3. Without the Trace Cache • Stage 1 requires identification of predictions and • target addresses • Stage 2 requires multi-ported access of the I-cache • Stage 3 requires shifting and alignment

  4. Trace Cache A 0 1 A C F 1 0 B C A 1 0 A B E 0 1 0 1 0 1 A B D 0 0 D E F G • Takes advantage of temporal locality and biased branches • Does not require multiple I-cache accesses

  5. Base Case • In each cycle, fetch up to three sequential basic • blocks

  6. Multiple Branch Predictor k-bit global history M U X / k-1 / k PHT

  7. Trace Cache Design • The branch predictions can be used to index • into the trace cache or for tag comparison (Fig.4) • Keep track of next address (taken and not-taken) • Line buffer and merge logic assembles traces

  8. Trace Cache

  9. Design Alternatives • Associativity (including paths) • Partial matches – use all instructions till the first • mispredict • Multiple line-fill buffers • Trace selection to reduce conflicts • Multi-cycle trace caches?

  10. Branch Address Cache • The BTB maintains 14 addresses (tree of basic • blocks) • Based on the branch prediction, three addresses • are forwarded to the I-Cache • BTB extension that allows multiple target • prediction • adds pipeline stages • can still have I-Cache bank contention

  11. Collapsing Buffer • Can detect taken branches within a single cache • line • Also suffers from merge logic and bank contention

  12. Methodology • Very aggressive o-o-o processor – large window • (2048 instrs), unlimited resources, no artificial • dependences, no cache misses • SPEC92-Int and Instruction Benchmark Suite (IBS) • Trace cache – 64 entries, 16 instrs and 3 branches • per entry – 712 tag bytes and 4KB worth of • instructions – ICache is 128KB

  13. Results • Fetching three sequential basic blocks (SEQ.3) • is not much more complex than fetching one – • IPC improvement of ~15% • Trace cache outperforms BAC and CB – note • that the latter can’t handle all kinds of trace • patterns and suffer from ICache bank contention • TC outperforms SEQ.3 by 12% • BAC and CB do worse than SEQ.3 if they increase • front-end latency

  14. Ideal Fetch • The trace cache is within 20% of ideal fetch • The trace miss rate is fairly high – 18-76% • Up to 60% of instructions do not come from the • trace cache • A larger trace cache comes within 10% of ideal • fetch – note that the front-end is the bottleneck • in this processor

  15. Title • Bullet

More Related