1 / 51

A Scalable Front-End Architecture for Fast Instruction Delivery

A Scalable Front-End Architecture for Fast Instruction Delivery. Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong. Conventional Pipeline Architecture. High-performance processors can be broken down into two parts Front-end : fetches and decodes instructions

aden
Download Presentation

A Scalable Front-End Architecture for Fast Instruction Delivery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

  2. Conventional Pipeline Architecture • High-performance processors can be broken down into two parts • Front-end: fetches and decodes instructions • Execution core: executes instructions

  3. Front-End and Pipeline Simple Front-End Fetch Decode … Fetch Decode Fetch

  4. Front-End with Prediction Simple Front-End Fetch Predict Decode … Fetch Predict Decode Fetch Predict

  5. Front-End Issues I • Flynn’s bottleneck: • IPC is bounded by the number of Instructions fetched per cycle • Implies: As execution performance increases, the front-end must keep up to ensure overall performance

  6. Front-End Issues II • Two opposing forces • Designing a faster front-end • Increase I-cache size • Interconnect Scaling Problem • Wire performance does not scale with feature size • Decrease I-cache size

  7. Key Contributions I

  8. Key Contributions:Fetch Target Queue • Objective: • Avoid using large cache with branch prediction • Purpose • Decouple I-cache from branch prediction • Results • Improves throughput

  9. Key Contributions:Fetch Target Buffer • Objective • Avoid large caches with branch prediction • Implementation • A multi-level buffer • Results • Deliver performance is 25% better than single level • Scales better with “future” feature size

  10. Outline • Scalable Front-End and Components • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

  11. Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Simple Front-End Fetch Predict Fetch Predict

  12. Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Front-End with FTQ Predict Predict Predict Fetch Predict Fetch Fetch

  13. Fetch Target Queue • Fetch and predict can have different latencies • Allows for I-cache to be pipelined • As long as they have the same throughput

  14. Fetch Blocks • FTQ stores fetch block • Sequence of instructions • Starting at branch target • Ending at a strongly biased branch • Instructions are directly fed into pipeline

  15. Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

  16. Fetch Target Buffer:Outline • Review: Branch Target Buffer • Fetch Target Buffer • Fetch Blocks • Functionality

  17. Review: Branch Target Buffer I • Previous Work (Perleberg and Smith [2]) • Makes fetch independent of predict Fetch Predict Simple Front-End Fetch Predict Fetch Predict Fetch Predict With Branch Target Buffer Fetch Predict Fetch Predict

  18. Review: Branch Target Buffer II • Characteristics • Hash table • Makes predictions • Caches prediction information

  19. Review: Branch Target Buffer III PC

  20. FTP Optimizations over BTB • Multi-level • Solves conundrum • Need a small cache • Need enough space to successfully predict branches

  21. FTP Optimizationsover BTB • Oversize bit • Indicates if a block is larger than cache line • With multi-port cache • Allows several smaller blocks to be loaded at the same time

  22. FTP Optimizationsover BTB • Only stores partial fall-through address • Fall-through address is close to the current PC • Only need to store an offset

  23. FTP Optimizations over BTB • Doesn’t store every blocks: • Fall-through blocks • Blocks that are seldom taken

  24. Fetch Target Buffer • Target: of branch • Type: conditional, subroutine call/return • Oversize: if block size > cache line Next PC

  25. Fetch Target Buffer

  26. PC used as index into FTB

  27. L1 Hit HIT!

  28. Branch NOT Taken HIT! HIT! NOT TAKEN

  29. Branch NOT Taken HIT! HIT! NOT TAKEN

  30. Branch Taken HIT! TAKEN

  31. L1 Miss L1: MISS FALL THROUGH

  32. L1 Miss L1: MISS L2: HIT After N cycle Delay FALL THROUGH

  33. L1 and L2 Miss L1: MISS L2: MISS FALL THROUGH: eventually mispredicts

  34. Hybrid branch prediction • Meta-predictor selects between • Local history predictor • Global history predictor • Bimodal predictor

  35. Branch Prediction Global Predictor

  36. Branch Prediction

  37. Committing Results When full, SHQ commits oldest value to local history or global history

  38. Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Methodology • Results • Analysis and Conclusion

  39. Experimental Methodology I • Baseline Architecture • Processor • 8 instruction fetch with 16 instruction issue per cycle • 128 entry reorder buffer with 32 entry load/store buffer • 8 cycle minimum branch mis-prediction penalty • Cache • 64k 2-way instruction cache • 64k 4 way data cache (pipelined)

  40. Experimental Methodology II • Timing Model • Cacti cache compiler • Models on-chip memory • Modified for 0.35 um, 0.188 um and 0.10 um processes • Test set • 6 SPEC95 benchmarks • 2 C++ Programs

  41. Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

  42. Comparing FTB to BTB • FTB provides slightly better performance • Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better

  43. Comparing Multi-level FTB to Single-Level FTB • Two-level FTB Performance • Smaller fetch size • 2 Level Average Size: 6.6 • 1 Level Average Size: 7.5 • Higher accuracy on average • Two-Level: 83.3% • Single: 73.1 % • Higher performance • 25% average speedup over single

  44. Fall-through Bits Used Better • Number of fall-through bits: 4-5 • Because fetch distances 16 instructions do not improve performance

  45. FTQ Occupancy Better • Roughly indicates throughput • On average, FTQ is • Empty: 21.1% • Full: 10.7% of the time

  46. Scalability Better • Two level FTB scale well with features size • Higher slope is better

  47. Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

  48. Analysis • 25% improvement in IPC over best performing single-level designs • System scales well with feature size • On average, FTQ is non-empty 21.1% of the time • FTB Design requires at most 5 bits for fall-through address

  49. Conclusion • FTQ and FTB design • Decouples the I-cache from branch prediction • Produces higher throughput • Uses multi-level buffer • Produces better scalability

  50. References • [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999 • [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

More Related