Cs252 project presentation optimizing the leon soft core
Download
1 / 16

CS252 Project Presentation Optimizing the Leon Soft Core - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

CS252 Project Presentation Optimizing the Leon Soft Core. Project Outline. Goal: Reduce the size of Leon on FPGAs Our motivation for using Leon: RAMP research: emulation of multiprocessors Analysis: LUT breakdown Optimizations: Circuit Level Architectural Level. Leon Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS252 Project Presentation Optimizing the Leon Soft Core' - foster


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs252 project presentation optimizing the leon soft core

CS252 Project PresentationOptimizing the Leon Soft Core


Project outline
Project Outline

  • Goal: Reduce the size of Leon on FPGAs

    • Our motivation for using Leon:

      • RAMP research: emulation of multiprocessors

  • Analysis:

    • LUT breakdown

  • Optimizations:

    • Circuit Level

    • Architectural Level


Leon overview
Leon Overview

  • 32-bit SPARC V8 compliant processor

    • 7 stage pipeline, in-order

    • Separate L1 Instruction & Data caches

      • Configurable cache size, associativity, replacement policy

    • Optional Memory Management Unit

    • AMBA bus interface to memory and peripherals

  • Supports Symmetric Multiprocessing

  • Open-source (Gaisler Research)


Area analysis
Area analysis

  • Configuration

    • MMU: Combined I/D-TLB, 2-entry only

    • Integer MUL/DIV enable

    • Cache: Direct-map I/D cache

  • Variables

    • DSU - Debug support unit

    • Target clock

      • 20 MHz - easy to achieve

      • 200 MHz - over constrained



Why it s big
Why it’s BIG

  • Debugging Support

    • More MUXes

    • One additional pipeline stage

    • Useful for RAMP emulation / bootstrapping

  • IU is over 50%

    • Barrel shifter

    • Pipeline control (forwarding)


Circuit level optimizations
Circuit Level Optimizations

  • Store LRU bits in Block RAMs instead of Flip Flops

    • Also saves LUTs

  • One-hot encoding for signals

    • Synthesis tool does a good job of 1-hot encoding for many signals (e.g., state encoding)

    • Applied this to the cache output

      • Instead of data(set), we can use data(0) or data(1) or data(2) or data(3)

      • Useful only for multiway caches

      • LUT savings: ~ 100 LUTs


Circuit level optimizations1
Circuit Level Optimizations

  • Use fast-carry chain logic

    • Provided 30% savings in LUT usage for TLB entries

  • Multipliers for barrel shifter

    • Right shift by b is same as multiplication by 2^b

    • Savings of ~ 100 LUTs


Luts for integer mul div
LUTs for Integer Mul / Div

  • 2195 / 18429* for entire two core system (12%)

  • 11.5% of Leon3 core

  • *(Xilinx ISE)


Didn t your mother teach you to share
Didn’t your mother teach you to share?

  • Savings of ~350 LUTs for prototype

    • Only multiplier shared

    • Only two cores

  • 10% could become 5%..2.5%...1%….

  • Even more for MAC


Operand MUXes:

32 bit, 7 to 1 MUX

32 bit, 5 to 1 MUX


Operand muxes
Operand MUXes

  • 313 LUTs + 64 MUX /each


Integer pipeline changes
Integer Pipeline Changes

  • Remove all forwarding

    • Single thread: Just stall

    • Fine Grain Multithreading could boost performance

    • LUTs saved: 27-37 %

    • Maximum Freq improvement: 20%


Conclusions
Conclusions

  • CAD tools already perform many optimizations

    • Remove unused logic

    • Infer technology dependent logic from HDL source, e.g. Fast carry chain logic

    • Optimize logic globally


Conclusions1
Conclusions

  • Optimization is possible

    • Higher levels yield (much) greater savings

      • Circuit Level: 200-300 LUTs

      • Architectural Level: 1000+ of LUTs

      • Sharing: ~700 per core

      • Total: 35-40% savings


ad