1 / 16

CS252 Project Presentation Optimizing the Leon Soft Core

CS252 Project Presentation Optimizing the Leon Soft Core. Project Outline. Goal: Reduce the size of Leon on FPGAs Our motivation for using Leon: RAMP research: emulation of multiprocessors Analysis: LUT breakdown Optimizations: Circuit Level Architectural Level. Leon Overview.

foster
Download Presentation

CS252 Project Presentation Optimizing the Leon Soft Core

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS252 Project PresentationOptimizing the Leon Soft Core

  2. Project Outline • Goal: Reduce the size of Leon on FPGAs • Our motivation for using Leon: • RAMP research: emulation of multiprocessors • Analysis: • LUT breakdown • Optimizations: • Circuit Level • Architectural Level

  3. Leon Overview • 32-bit SPARC V8 compliant processor • 7 stage pipeline, in-order • Separate L1 Instruction & Data caches • Configurable cache size, associativity, replacement policy • Optional Memory Management Unit • AMBA bus interface to memory and peripherals • Supports Symmetric Multiprocessing • Open-source (Gaisler Research)

  4. Area analysis • Configuration • MMU: Combined I/D-TLB, 2-entry only • Integer MUL/DIV enable • Cache: Direct-map I/D cache • Variables • DSU - Debug support unit • Target clock • 20 MHz - easy to achieve • 200 MHz - over constrained

  5. Resource break down

  6. Why it’s BIG • Debugging Support • More MUXes • One additional pipeline stage • Useful for RAMP emulation / bootstrapping • IU is over 50% • Barrel shifter • Pipeline control (forwarding)

  7. Circuit Level Optimizations • Store LRU bits in Block RAMs instead of Flip Flops • Also saves LUTs • One-hot encoding for signals • Synthesis tool does a good job of 1-hot encoding for many signals (e.g., state encoding) • Applied this to the cache output • Instead of data(set), we can use data(0) or data(1) or data(2) or data(3) • Useful only for multiway caches • LUT savings: ~ 100 LUTs

  8. Circuit Level Optimizations • Use fast-carry chain logic • Provided 30% savings in LUT usage for TLB entries • Multipliers for barrel shifter • Right shift by b is same as multiplication by 2^b • Savings of ~ 100 LUTs

  9. LUTs for Integer Mul / Div • 2195 / 18429* for entire two core system (12%) • 11.5% of Leon3 core • *(Xilinx ISE)

  10. Didn’t your mother teach you to share? • Savings of ~350 LUTs for prototype • Only multiplier shared • Only two cores • 10% could become 5%..2.5%...1%…. • Even more for MAC

  11. Operand MUXes: 32 bit, 7 to 1 MUX 32 bit, 5 to 1 MUX

  12. Operand MUXes • 313 LUTs + 64 MUX /each

  13. Integer Pipeline Changes • Remove all forwarding • Single thread: Just stall • Fine Grain Multithreading could boost performance • LUTs saved: 27-37 % • Maximum Freq improvement: 20%

  14. Conclusions • CAD tools already perform many optimizations • Remove unused logic • Infer technology dependent logic from HDL source, e.g. Fast carry chain logic • Optimize logic globally

  15. Conclusions • Optimization is possible • Higher levels yield (much) greater savings • Circuit Level: 200-300 LUTs • Architectural Level: 1000+ of LUTs • Sharing: ~700 per core • Total: 35-40% savings

More Related