1 / 30

The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture of FPGA-Based Soft Processors. Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering. FPGA. Our goal is to study the architecture of soft processors. Processors and FPGAs. Processors present in many digital systems.

stormy
Download Presentation

The Microarchitecture of FPGA-Based Soft Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Microarchitectureof FPGA-Based Soft Processors Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering

  2. FPGA Our goal is to study the architecture of soft processors Processors and FPGAs • Processors present in many digital systems Processor Custom Logic • Soft processors - implemented in FPGA fabric

  3. Motivation for understanding soft processor architecture • Soft processors are popular • 16% of FPGA designs use a soft processor • FPGA Journal, November 2003 • This number has and will continue to increase • Soft processors are end-user customizable • Application-specific architectural tradeoffs • Can be tuned by designers

  4. Must revisit processor architecture in FPGA context Don’t we already understand processor architecture? • Not accurately/completely • Accurate cycle-to-cycle behaviour • Estimated area/power • No clock frequency impact • Not in FPGA domain • Lookup tables vs transistors • Dedicated RAMs and Multipliers fast

  5. Explore soft processor architecture experimentally Research Goals • Generate soft processor implementations • System for generating RTL • Develop measurement methodology • Metrics for comparing soft processors • Develop understanding of architectural tradeoffs • Analyze area/performance/power space

  6. ISA • Datapath SPREE RTL Soft Processor Rapid Exploration Environment (SPREE)

  7. RTL ISA currently fixed (subset of MIPS I) Input: Instruction Set Architecture (ISA) Description • Graph of Generic Operations (GENOPs) • Edges indicate flow of data • ISA • Datapath MIPS ADD – add rd, rs, rt FETCH SPREE RFREAD RFREAD ADD RFWRITE

  8. Mul Ifetch Reg file Write Back ALU RTL Data Mem Limited to simple in-order issue pipelines Input: Datapath Description • Interconnection of hand-coded components • Allows efficient synthesis • Described using C++ • ISA • Datapath Ifetch Reg File Ifetch Reg File SPREE Mul Data Mem Mul Shifter ALU Write Back ALU SPREE Component Library

  9. Mul RTL Reg File Ifetch Write Back RFREAD FETCH ALU ADD RFREAD RFWRITE Data Mem Step 1.ISA vs Datapath Verification • ISA • Datapath • Components described using GENOPs Verify FETCH SPREE RFREAD RFREAD ADD RFWRITE

  10. Mul RTL Reg File Ifetch Write Back ALU Data Mem Step 2.Datapath Instantiation • ISA • Datapath • Multiplexer insertion • Unused connection/component removal SPREE

  11. RTL Laborious step performed automatically Step 3.Control Generation • ISA • Datapath Control Control Control Control Mul Reg File Ifetch Write Back SPREE ALU Data Mem

  12. Output: Verilog RTL Description • ISA • Datapath Verilog RTL Control Control Control Control Mul Reg File SPREE Ifetch Write Back ALU RTL Data Mem

  13. RTL In this work we can measure each accurately! Back-end Infrastructure Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 4.2 CAD Software Stratix 1S40 • Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power

  14. Metrics for Measurement • Area: Equivalent Stratix Logic Elements (LEs) • Relative silicon areas used for RAMs/Multipliers • Performance: Wall clock time • Cycle count ÷ clock frequency • Arithmetic mean across benchmark set • Energy: Dynamic Energy (eg. nJ/instr) • Excluding I/O

  15. All generated soft processors are verified this way Trace-Based Verification • Ensure SPREE generates functional processors Trace RTL 110100 101011 111101 Modelsim (RTL Simulator)  Compare Benchmark Applications Trace  MINT (Instruction-set Simulator) 110100 101011 111101

  16. Architectural Exploration Results

  17. Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding

  18. We believe the comparison is meaningful Validation of SPREE Through Comparison to Altera’s Nios II • Has three variations: • Nios II/e – unpipelined, no HW multiplier • Nios II/s – 5-stage, with HW multiplier • Nios II/f – 6-stage, dynamic branch prediction • Caveats – not completely fair comparison • Very similar but tweaked ISA • Nios II Supports exceptions, OS, and caches • We do not and save on the hardware costs

  19. Competitive and can dominate (9% smaller, 11% faster) SPREE vs Nios II faster • 3-stage pipe • HW multiply • Multiply-based • shifter smaller

  20. Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding

  21. Total energy wasted if few multiply instructions, saved if many Hardware vs Software Multiplication • Hardware multiply is fast but not always needed • Wastes area (220 LEs) and can waste energy

  22. Shifter Implementation • Shifters are expensive in FPGAs • We explore three implementations: • Serial shifter (shift register) • Multiplier-based barrel shifter (hard multiplier) • LUT-based barrel shifter (multiplexer tree)

  23. Multplier-based shifter is a good compromise Performance-Area of Different Shifter Implementations faster smaller

  24. Pipeline Depth • Explored between 2 and 7 stages • 1-stage and 6-stage pipeline not interesting F/D/R/EX/M WB 2-stage F/D R/EX/M WB 3-stage F D R/EX/M WB 4-stage F D R/EX EX/M WB 5-stage F D R EX EX EX/M WB (new) 7-stage

  25. 2-stage pipeline and 7-stage pipeline suffers from nuances 3,4, and 5-stage pipelines perform the same Pipeline Depth and Performance

  26. 4-stage (B) is 15% faster but requires up to 70 more LEs Pipeline Organization Tradeoff 4-stage (A) F D R/EX/M WB 4-stage (B) F/D R/EX EX/M WB

  27. Pipeline Forwarding • Prevent stalls when data hazards occur • MIPS has two source operands (rs & rt) • Four forwarding configuration are possible: • No forwarding • Forward rs • Forward rt • Forward both rs and rt F D/R EX M WB

  28. 9% 20% Up to 20% speed improvement for both operands The rs operand benefits more than rt (9% faster) Pipeline Forwarding

  29. Summary of Presented Architectural Conclusions • Hardware multiplication can be wasteful • Multiplier-based shifter is a sweet spot • 3-stage pipelines are attractive • Tradeoffs exist within pipeline organization • Forwarding • Improves performance by 20% • Favours the rs operand

  30. Future Work • Explore other exciting architectural axes • Branch prediction, aggressive forwarding • ISA changes • VLIW datapaths • Caches and memory hierarchy • Compiler optimizations • Port to other devices • Explore aggressive customization • Add exceptions and OS support

More Related