1 / 46

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

ECE Dept. University of Toronto. FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS. Martin Labrecque Gregory Steffan. FPL 2009 – Prague, Czech Republic. NetThreads Project. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA

isanne
Download Presentation

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE Dept. University of Toronto FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASEDMULTITHREADED PROCESSORS Martin Labrecque Gregory Steffan FPL 2009 – Prague, Czech Republic

  2. NetThreads Project • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • e.g. new traffic shaping, encapsulation, subscription protocols • Not in coding Verilog • Want to use GigE link at maximum capacity • Easy to program system • Efficient system Easiest way to describe an application?

  3. FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze Are soft processors fast enough?

  4. Measure of Throughput Too fast • Fastest constant input packet rate • Processing time may vary • Do not drop any packet Too slow • Gigabit link • 2 processors running at 125 MHz • Cycle budget: • 152 to3060 cycles per 64 to 1518byte packets Time Soft processors: non-trivial processing at line rate! How can they be efficiently organized?

  5. Accelerator Hardware Accelerator Accelerator processor processor I$ I$ 4-threads 4-threads Make this system: • Deliverthroughput • Easier to program Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem. • - Overcomes the 2-port limitation of FPGA block RAMs • Shared data cache is not the main bottleneck in our experiments • - Complex applications are the bottleneck, not the architecture

  6. processor processor I$ I$ 4-threads 4-threads Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem.

  7. Multithreading Multithreading  Synchronization Thread Scheduling

  8. Conventional Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading

  9. Legend Thread1 Thread2 Thread3Thread4 F F F F F R R R R R Ideally, eliminates all stalls E E E E E 5 stages AFTER M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard R R R Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline

  10. Multithreading is Area Efficient Replicate state for each thread Hazard Detection Logic Data Cache P C Reg. Array Instr. Cache ALU P C Ctrl. +4 • Simplify control logic • 77% more area efficient than single-threaded [FPL’07]

  11. MultithreadingEvaluation

  12. Infrastructure • Compilation: • MIPS-I instruction set • modified GCC 4.0.2 and Binutils 2.16 • Platform: • Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz • 64 MB DDR2 SDRAM @ 200 MHz • Small caches, would be larger on a more modern FPGA Real system executing real applications

  13. Our benchmarks Realistic non-trivial applications, dominated by control flow

  14. Cycles Breakdown - Multithreading is effective at hiding memory stalls - 18% cycles are wasted while blocked on synchronization - Why is there so much time waiting for a packet?

  15. Packet Backlog due to Synchronization Serializing Tasks Throughput Defined by Bursts of Activity Let’s focus on the underlying problem: Synchronization

  16. Addressing Synchronization Overhead

  17. Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?

  18. Only one thread wants a lock Release lock F F F F F F R R R R R R F F E E E E E E Acquire lock R R M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F R E 5 stages M W Time Threads continue with no stall What happens when more threads want the same lock?

  19. F F R R E E M M W W Synchronization Wrecks Round-Robin Multithreading All threads want the lock F R Release lock E 5 stages M Acquire lock W Time • Only 1 thread makes progress • 1/4 of expected throughput • Can we use idle time to help the green thread make progress?

  20. F F F F F F F F F F F F F F R R R R R R R R R R R R R R E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization Thread3Thread4 WAITING FOR LOCK F F F F F F F F R R R R R R R R E E E E E E E E BEFORE 5 stages M M M M M M M M W W W W W W W W Time

  21. F R E M W F R E M W F R E M W F R E M W Time Thread scheduler • Suspend any thread waiting for a lock • Round-robin among other threads to hide hazards • Unlock operation resumes threads across processors - Fewer active threads requires hazard detection But, hazard detection was on critical path of single threaded processor

  22. Fetch Thread Selection Register Read Execute Writeback Memory Typical Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?

  23. Hazard distance 0 1 Schedule another thread F R E M W or r1,r1,r8 F R E M W or r2,r2,r9 0 0 Time Static Hazard Detection • Hazards can be determined at compile time F R E M W F R E M W • Hazard distances are encoded in the instructions Static hazard detection allows scheduling without an extra pipeline stage

  24. processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR FPGA-Efficient Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?

  25. Instruction Compaction 36  32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline

  26. Thread Scheduler Evaluation

  27. CAD results - Preserved 125 MHz operation - Modest logic and memory footprint

  28. Results on 3 benchmark applications ) Thread scheduling improves throughput by 63%, 31%, and 41%

  29. Better Cycle Breakdown UDHCP Classifier NAT - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization --- future work

  30. Conclusions Made performance from parallelism easy to obtain Parallel threads hide stalls on one thread Reduce synchronization cost with thread scheduling Efficient hardware scheduler Transparent to programmer Low hardware overhead, capitalizes on FPGA Throughput improvements of 63%, 31% and 41% On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads 30

  31. ECE Dept. University of Toronto Martin Labrecque Gregory Steffan martinL@eecg.utoronto.ca NetThreads: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

  32. Backup

  33. Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead

  34. Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior

  35. Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization

  36. Conclusions • Recently build transactional multi-processor • Single-threaded based • Thread limitation due to signature size • Promising performance results

  37. Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?

  38. Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?

  39. Key Design Features

  40. 1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor

  41. Multithreading on 3 benchmark applications Why isn’t the 2nd processor always improving throughput?

  42. Cycle Breakdown in Simulation Most of the time is spent waiting for a packet

  43. System Under-utilized Consequence of the ZERO packet drop policy

  44. Impact of allowing packet drops NAT benchmark The processors can process packets actually much faster

  45. Impact of allowing packet drops t NAT benchmark

  46. Fixed packet rate

More Related