Cs252 project presentation optimizing the leon soft core
Download
1 / 16

CS252 Project Presentation Optimizing the Leon Soft Core - PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on

CS252 Project Presentation Optimizing the Leon Soft Core. Project Outline. Goal: Reduce the size of Leon on FPGAs Our motivation for using Leon: RAMP research: emulation of multiprocessors Analysis: LUT breakdown Optimizations: Circuit Level Architectural Level. Leon Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS252 Project Presentation Optimizing the Leon Soft Core ' - foster


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs252 project presentation optimizing the leon soft core

CS252 Project PresentationOptimizing the Leon Soft Core


Project outline
Project Outline

  • Goal: Reduce the size of Leon on FPGAs

    • Our motivation for using Leon:

      • RAMP research: emulation of multiprocessors

  • Analysis:

    • LUT breakdown

  • Optimizations:

    • Circuit Level

    • Architectural Level


Leon overview
Leon Overview

  • 32-bit SPARC V8 compliant processor

    • 7 stage pipeline, in-order

    • Separate L1 Instruction & Data caches

      • Configurable cache size, associativity, replacement policy

    • Optional Memory Management Unit

    • AMBA bus interface to memory and peripherals

  • Supports Symmetric Multiprocessing

  • Open-source (Gaisler Research)


Area analysis
Area analysis

  • Configuration

    • MMU: Combined I/D-TLB, 2-entry only

    • Integer MUL/DIV enable

    • Cache: Direct-map I/D cache

  • Variables

    • DSU - Debug support unit

    • Target clock

      • 20 MHz - easy to achieve

      • 200 MHz - over constrained



Why it s big
Why it’s BIG

  • Debugging Support

    • More MUXes

    • One additional pipeline stage

    • Useful for RAMP emulation / bootstrapping

  • IU is over 50%

    • Barrel shifter

    • Pipeline control (forwarding)


Circuit level optimizations
Circuit Level Optimizations

  • Store LRU bits in Block RAMs instead of Flip Flops

    • Also saves LUTs

  • One-hot encoding for signals

    • Synthesis tool does a good job of 1-hot encoding for many signals (e.g., state encoding)

    • Applied this to the cache output

      • Instead of data(set), we can use data(0) or data(1) or data(2) or data(3)

      • Useful only for multiway caches

      • LUT savings: ~ 100 LUTs


Circuit level optimizations1
Circuit Level Optimizations

  • Use fast-carry chain logic

    • Provided 30% savings in LUT usage for TLB entries

  • Multipliers for barrel shifter

    • Right shift by b is same as multiplication by 2^b

    • Savings of ~ 100 LUTs


Luts for integer mul div
LUTs for Integer Mul / Div

  • 2195 / 18429* for entire two core system (12%)

  • 11.5% of Leon3 core

  • *(Xilinx ISE)


Didn t your mother teach you to share
Didn’t your mother teach you to share?

  • Savings of ~350 LUTs for prototype

    • Only multiplier shared

    • Only two cores

  • 10% could become 5%..2.5%...1%….

  • Even more for MAC


Operand MUXes:

32 bit, 7 to 1 MUX

32 bit, 5 to 1 MUX


Operand muxes
Operand MUXes

  • 313 LUTs + 64 MUX /each


Integer pipeline changes
Integer Pipeline Changes

  • Remove all forwarding

    • Single thread: Just stall

    • Fine Grain Multithreading could boost performance

    • LUTs saved: 27-37 %

    • Maximum Freq improvement: 20%


Conclusions
Conclusions

  • CAD tools already perform many optimizations

    • Remove unused logic

    • Infer technology dependent logic from HDL source, e.g. Fast carry chain logic

    • Optimize logic globally


Conclusions1
Conclusions

  • Optimization is possible

    • Higher levels yield (much) greater savings

      • Circuit Level: 200-300 LUTs

      • Architectural Level: 1000+ of LUTs

      • Sharing: ~700 per core

      • Total: 35-40% savings


ad