FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

ECE Dept. University of Toronto FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASEDMULTITHREADED PROCESSORS Martin Labrecque Gregory Steffan FPL 2009 – Prague, Czech Republic

NetThreads Project • Hardware: • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA • Collaboration with CS researchers • Interested in performing network experiments • e.g. new traffic shaping, encapsulation, subscription protocols • Not in coding Verilog • Want to use GigE link at maximum capacity • Easy to program system • Efficient system Easiest way to describe an application?

FPGA Processor DDR controller Ethernet MAC Ethernet MAC Ethernet MAC DDR controller • Easier to program than HDL • Customizable Soft Processors in FPGAs • Soft processors: processors in theFPGA fabric • FPGAs increasingly implement SoCs with CPUs • Commercial soft processors: NIOS-II and Microblaze Are soft processors fast enough?

Measure of Throughput Too fast • Fastest constant input packet rate • Processing time may vary • Do not drop any packet Too slow • Gigabit link • 2 processors running at 125 MHz • Cycle budget: • 152 to3060 cycles per 64 to 1518byte packets Time Soft processors: non-trivial processing at line rate! How can they be efficiently organized?

Accelerator Hardware Accelerator Accelerator processor processor I$ I$ 4-threads 4-threads Make this system: • Deliverthroughput • Easier to program Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem. • - Overcomes the 2-port limitation of FPGA block RAMs • Shared data cache is not the main bottleneck in our experiments • - Complex applications are the bottleneck, not the architecture

processor processor I$ I$ 4-threads 4-threads Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Minimalist Multiprocessor System Synch. Unit Instr. Data Input mem. Output mem.

Multithreading Multithreading  Synchronization Thread Scheduling

Conventional Single-Threaded Processors • Single-issue, in order pipeline • Should commit 1 instruction every cycle, but: • stall on instruction dependences • stall on memory, I/O, accelerators accesses • Throughput depends on sequential execution: • packet processing • device control • event monitoring many concurrent threads Solution to Avoid Stalls: Multithreading

Legend Thread1 Thread2 Thread3Thread4 F F F F F R R R R R Ideally, eliminates all stalls E E E E E 5 stages AFTER M M M M M W W W W W Time • Multithreading: execute streams of independent instructions Avoiding Processor Stall Cycles F F F Data or control hazard R R R Single-Thread Traditional execution E E E 5 stages BEFORE M M M W W W Time • 4 threads eliminate hazards in 5-stage pipeline

Multithreading is Area Efficient Replicate state for each thread Hazard Detection Logic Data Cache P C Reg. Array Instr. Cache ALU P C Ctrl. +4 • Simplify control logic • 77% more area efficient than single-threaded [FPL’07]

MultithreadingEvaluation

Infrastructure • Compilation: • MIPS-I instruction set • modified GCC 4.0.2 and Binutils 2.16 • Platform: • Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz • 64 MB DDR2 SDRAM @ 200 MHz • Small caches, would be larger on a more modern FPGA Real system executing real applications

Our benchmarks Realistic non-trivial applications, dominated by control flow

Cycles Breakdown - Multithreading is effective at hiding memory stalls - 18% cycles are wasted while blocked on synchronization - Why is there so much time waiting for a packet?

Packet Backlog due to Synchronization Serializing Tasks Throughput Defined by Bursts of Activity Let’s focus on the underlying problem: Synchronization

Addressing Synchronization Overhead

Real Threads Synchronize • All threads execute the same code • Concurrent threads may access shared data • Critical sections ensure correctness Thread1Thread2Thread3Thread4 Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads?

Only one thread wants a lock Release lock F F F F F F R R R R R R F F E E E E E E Acquire lock R R M M M M M M E E W W W W W W M M W W Multithreaded processor with Synchronization F R E 5 stages M W Time Threads continue with no stall What happens when more threads want the same lock?

F F R R E E M M W W Synchronization Wrecks Round-Robin Multithreading All threads want the lock F R Release lock E 5 stages M Acquire lock W Time • Only 1 thread makes progress • 1/4 of expected throughput • Can we use idle time to help the green thread make progress?

F F F F F F F F F F F F F F R R R R R R R R R R R R R R E E E E E E E E E E E E E E AFTER 5 stages M M M M M M M M M M M M M M W W W W W W W W W W W W W W Time DESCHEDULE Thread3Thread4 Better Handling of Synchronization Thread3Thread4 WAITING FOR LOCK F F F F F F F F R R R R R R R R E E E E E E E E BEFORE 5 stages M M M M M M M M W W W W W W W W Time

F R E M W F R E M W F R E M W F R E M W Time Thread scheduler • Suspend any thread waiting for a lock • Round-robin among other threads to hide hazards • Unlock operation resumes threads across processors - Fewer active threads requires hazard detection But, hazard detection was on critical path of single threaded processor

Fetch Thread Selection Register Read Execute Writeback Memory Typical Thread Scheduling • Add pipeline stage to pick hazard-free instruction • Result: • Increased instruction latency • Increased hazard window • Increased branch mis-prediction cost MUX Add hazard detection without an extra pipeline stage?

Hazard distance 0 1 Schedule another thread F R E M W or r1,r1,r8 F R E M W or r2,r2,r9 0 0 Time Static Hazard Detection • Hazards can be determined at compile time F R E M W F R E M W • Hazard distances are encoded in the instructions Static hazard detection allows scheduling without an extra pipeline stage

processor processor I$ I$ 4-threads 4-threads x 36 bits x 36 bits x 32 bits Off-chip DDR FPGA-Efficient Implementation • Where to store the hazard distance bits? • Block RAMs are multiple of 9 bits wide • 36 bits word leaves 4 bits available • Also encode lock and unlock flags 32 Bits 4 Bits How to convert instructions from 36 bits to 32 bits?

Instruction Compaction 36  32 bits R-Type Instructions Example: add rd, rs, rt J-Type Instructions Example: j label I-Type Instructions Example: addi rt, rs, immediate - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline

Thread Scheduler Evaluation

CAD results - Preserved 125 MHz operation - Modest logic and memory footprint

Results on 3 benchmark applications ) Thread scheduling improves throughput by 63%, 31%, and 41%

Better Cycle Breakdown UDHCP Classifier NAT - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization --- future work

Conclusions Made performance from parallelism easy to obtain Parallel threads hide stalls on one thread Reduce synchronization cost with thread scheduling Efficient hardware scheduler Transparent to programmer Low hardware overhead, capitalizes on FPGA Throughput improvements of 63%, 31% and 41% On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads 30

ECE Dept. University of Toronto Martin Labrecque Gregory Steffan martinL@eecg.utoronto.ca NetThreads: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

Backup

Future Work • Adding custom hardware accelerators • Same interconnect as processors • Same synchronization interface • Evaluate speculative threading • Alleviate need for fine grained-synchronization • Reduce conservative synchronization overhead

Software Network Processing • Not meant for: • Straightforward tasks accomplished at line speed in hardware • E.g. basic switching and routing • Advantages compared to Hardware • Complex applications are best described in a high-level software • Easier to design and fast time-to-market • Can interface with custom accelerators, controllers • Can be easily updated • Our focus: stateful applications • Data structures modified by most packets • Difficult to pipeline the code into balanced stages • Run-to-Completion/Pool-of-Threads model for parallelism: • Each thread processes a packet from beginning to end • No thread-specific behavior

Cycle Breakdown in Simulation Classifier NAT UDHCP - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization

Conclusions • Recently build transactional multi-processor • Single-threaded based • Thread limitation due to signature size • Promising performance results

Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Processor Envisioned System (Someday) data-level parallelism Hardware Accelerator Hardware Accelerator • Many Compute Engines • Delivers the expected performance • Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator control-flow parallelism Processors inside an FPGA?

Performance In Packet Processing • The application defines the throughput required Edge routing (≥ 1 Gbps/link) Home networking (~100 Mbps/link) Scientific instruments (< 100 Mbps/link) • Our measure of throughput: • Bisection search of the minimum packet inter-arrival • Must not drop any packet Are soft processors fast enough?

Key Design Features

1 Memory system with specialized memories 2 Multiple processors support Efficient Network Processing 3 Multithreaded soft processor

Multithreading on 3 benchmark applications Why isn’t the 2nd processor always improving throughput?

Cycle Breakdown in Simulation Most of the time is spent waiting for a packet

System Under-utilized Consequence of the ZERO packet drop policy

Impact of allowing packet drops NAT benchmark The processors can process packets actually much faster

Impact of allowing packet drops t NAT benchmark

Fixed packet rate

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

Presentation Transcript

9. Code Scheduling for ILP-Processors

Multithreaded Processors

Thread Scheduling and Dispatching

Conjoining Soft-Core FPGA Processors

Chapter 2 Process ， thread, and scheduling —— Solaris Multithreaded Process

Compiler Scheduling for a Wide-Issue Multithreaded FPGA-Based Compute Engine

GPU Computing: Pervasive Massively Multithreaded Processors

Multithreaded Processors

The Microarchitecture of FPGA-Based Soft Processors

Process/Thread/VM Scheduling

Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Multithreaded Processors

12. Multithreaded Processors

The Microarchitecture of FPGA-Based Soft Processors

Thread Scheduling in Linux

Multithreaded Processors

Multithreaded Processors

FPGA-based Dedispersion for Fast Transient Search

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors

Soft Real-Time Scheduling on Simultaneous Multithreaded Processors

Multithreaded Processors

Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks