1 / 34

Improving Pipelined Soft Processors with Multithreading

ECE Dept. University of Toronto. Presented at RAAW 2006, Orlando , FL. Improving Pipelined Soft Processors with Multithreading. Martin Labrecque Gregory Steffan. Processors and FPGAs. FPGA. Processor. Custom Logic. Soft processors are: Easier to program than HDL Customizable.

virgo
Download Presentation

Improving Pipelined Soft Processors with Multithreading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL Improving Pipelined Soft Processors with Multithreading Martin Labrecque Gregory Steffan

  2. Processors and FPGAs FPGA Processor Custom Logic • Soft processors are: • Easier to program than HDL • Customizable • FPGAs increasingly implement SoCs, with CPUs • Soft processors: processors in theFPGA fabric

  3. Soft processors in Embedded Systems Instr. Count xx Frequency Performance  Cycle Count x Area Area We trade-off 4 criteria (soft proc. power is related to area) • What do designers care about? • Minimizing area? • Matching frequency? • Hitting performance target? • Area efficiency: a combined metric MIPS 1000 LEs

  4. Multithreading Million Instr. xx Frequency # Cycles x Area • Replace processor stalls • Fill them with instructions from other threads • When to switch thread? • Every instruction (e.g. Sun’s Niagara) • Convenient technique for in-order processors Fine-grained multithreading: 1 instr. per thread in round-robin

  5. Avoiding processor stall cycles • Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 F F F F F F F Ideally, eliminates all stalls E E E E E E E AFTER 3 stages W W W W W W W Time F F F F • Data and control hazards create stall cycles Traditional execution E E E E 3 stages BEFORE W W W W Time

  6. How useful is multithreading? • Commercial SPs: single-threaded (NIOS-II,Microblaze) • Fort et al. [FCCM’06] have shown: • multithreaded SP smaller than multiple SPs • with some performance degradation • We go further by showing that: the Area-Efficiency of Multithreaded SP is GREATER THAN the Area-Efficiency of Single-Threaded SP Not straightforward, here is how we did it

  7. Outline Architectural Support for Multiple Threads • Architectural Support for Multiple Threads • Soft Processor Infrastructure • Improvements to Baseline Multithreading

  8. Single-Threaded Processor (simplified) Forwarding lines Data Mem P C Reg. Array Instr. Mem ALU +4 Hazard Detection Logic

  9. 2-Threaded Processor (simplified) Replicate state for each thread Hazard Detection Logic Data Mem P C Reg. Array Instr. Mem ALU P C Ctrl. +4 • Simplify control logic

  10. Additional storage for multiple threads More efficiently done in FPGA than in ASIC Increase memory size while preserving frequency Program counters Data mem. Registers N x Multithreading builds on the strengths of FPGAs

  11. Outline • Architectural Support for Multiple Threads • Soft Processor Infrastructure • Improvements to baseline multithreading

  12. Measurement Infrastructure RTL Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 5.0 CAD Software Stratix 1S40C5 Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately Single-Thread ProcessorsSPREE System [FPGA’06]

  13. Evaluation methodology • Same benchmark running on all threads • Some mixed benchmarks results in the paper • Run until completion of the last thread • Same instruction space • We present results with fixed latency on-chip RAM • We are implementing a solution for off-chip RAM

  14. Processors: 3, 5 and 7 stages Pipe3 Pipe3 F/D R/EX/M WB Pipe5 Pipe5 F D R/EX1 EX2/M WB Pipe7 EX1 WB2 F D R EX2/M EX3/WB1 Pipe7 F: Fetch D: Decode R: Register EX: Execute M: Memory WB: Writeback 1174 LEs 78.3 MHz 1283 LEs 86.79 MHz 1557 LEs, 100.59 MHz Best of each pipeline depth generated by SPREE By default: thread count = number of pipeline stages

  15. Area efficiency results 77% 33% 106% 3-stage 5-stage 7-stage • Area efficiency is most improved with deeper pipelines • 3- and 7-stages have similar area efficiency

  16. IPC results for 3, 5 and 7 stages Ideal IPC = 1 IPC versus single-threaded proc. 24%, 45% and 104% more instructions per cycle, respectively

  17. Improvements to the Baseline Multithreaded Soft Processors • Optimize away unpipelined multi-cycle paths • Selection of architectural features • Multiplier implementation • Number of registers • Number of threads • Optimize away unpipelined multi-cycle paths Combination of techniques optimizing area efficiency

  18. 1- Changing multiplication support • 3-operand multiplies (NIOS2 and Microblaze) • Two instructions compute high and low parts • Avoids replicating Hi and Lo registers support • Default MIPS has Hi/Lo registers Hi/Lo Register file Multiplier MUX

  19. 2- Reducing the register file Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks 1..N 1..N 1..N-k 1..N-k 2N-2k 2N • Applicable to the 5-stage processor • Increases slightly cycle count due to increased register pressure • Allows area and frequency improvements

  20. Reducing the Number of Threads Usually: # threads = # pipeline stages Last stage: writeback to non-conflicting register F F F F E E E E W W W W Legend Thread1 Thread2 Thread3 F F E E 3 stages W W Time Positive effect on the 5 and 7-stage processors Helps meet processing latency deadline (shorter round-robin) Gives designers more flexibility

  21. Conclusions • Multithreaded SPs outperforms Single-threaded • Assumes independent threads • Assumes use of on-chip memory • 33%, 77% and 106% increase in area-efficiency • Demonstrated that benefits increase with pipeline depth • Techniques to optimize away unpipelined multi-cycle paths • Selection and combination of architectural features • Multiplier support • Number of threads • Number of registers Commercial FPGA makers should have a Multi-Threaded SP

  22. Long term goals Multiple multithreaded soft processors Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people • Experimental Testbed: NetFPGA • Virtex-II Pro • 4 x 1 Gbps Ethernet • PCI board • 64 MB DDR2 DRAM • Stanford/Xilinx platform • Collaboration with network researchers Perform real high bandwidth experiments

  23. Thank you ECE Dept. University of Toronto Martin Labrecque (martinl@eecg.utoronto.ca) Gregory Steffan

  24. Where do threads come from? • Event processing • e.g. multiple sources of interrupts • Packet processing • e.g. CAN, RS-485, Ethernet, etc. • Systems handling requests • e.g. bus controllers For now, we consider independent threads

  25. SPREE vs Nios II [IEEE TCAD’07] faster smaller

  26. Architectural Parameters Used in SPREE We focus on core microarchitecture (for now) • Multiplication Support • Hardware FU or software routine • Shifter implementation • Flipflops, multiplier, or LUTs • Pipelining • Depth • (2-7 stages) • Forwarding lines

  27. Contributions on Multithreaded Soft Processors • Multithreaded SP dominate single-threaded • processors in area and IPC • Demonstrated that these benefits • Increase with the # of pipeline stages • Explained techniques to optimize away • unpipelined multi-cycle paths • Selection of architectural features • Number of threads • Number of registers • Multiplier support Combination of techniques that optimize area efficiency

  28. Unpipelined Multicycle Paths F/D F/D R/EX R/EX EX WB M WB Example of 3-stage pipeline with multicycle on load, store, shift and multiplies • ST • MT Not practical in ST because of hazard detection Important source of IPC improvement

  29. Changing multiplication support 3-stage 5-stage 7-stage For multithreaded SPs, 3op-multiplies always win

  30. Reducing the Number of Threads Positive effect on the 5 and 7-stage processors

  31. SPREE System(Soft Processor Rapid Exploration Environment) Processor Description ISA Datapath SPREE RTL • Input: Processor description • Made of hand-coded components • SPREE System • Verify ISA against datapath • Datapath Instantiation • Control Generation • Output: Synthesizable Verilog

  32. Multithreading Million Instr. xx Frequency # Cycles x Area Interleaved instructions in pipeline T1 T2 T3 T1 T2 T3 Time • Replace processor stalls • Fill them with instructions from other threads • When to switch thread? • Multiple techniques • Most common: every instruction (e.g. Sun’s Niagara) Fine-grained multithreading: 1 instr. per thread in round-robin

  33. Experimental Testbed: NetFPGA • Virtex-II Pro • 4 x 1 Gbps Ethernet • PCI board • 64 MB DDR2 DRAM • Stanford/Xilinx platform • Collaboration with network researchers Perform real high bandwidth experiments

  34. Removed load and branch delay slots in the code

More Related