Netthreads programming netfpga with threaded software
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

NetThreads: Programming NetFPGA with Threaded Software PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on
  • Presentation posted in: General

NetThreads: Programming NetFPGA with Threaded Software. Geoff Salmon Monia Ghobadi Yashar Ganjali. Martin Labrecque Gregory Steffan. ECE Dept. CS Dept. University of Toronto. Real-Life Customers. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA

Download Presentation

NetThreads: Programming NetFPGA with Threaded Software

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Netthreads programming netfpga with threaded software

NetThreads: Programming NetFPGA with Threaded Software

Geoff Salmon

Monia Ghobadi

Yashar Ganjali

Martin Labrecque

Gregory Steffan

ECE Dept.

CS Dept.

University of Toronto


Real life customers

Real-Life Customers

  • Hardware:

    • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA

  • Collaboration with CS researchers

    • Interested in performing network experiments

    • Not in coding Verilog

    • Want to use GigE link at maximum capacity

  • Requirements:

  • Easy to program system

  • Efficient system

What would the ideal solution look like?


Envisioned system someday

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Envisioned System (Someday)

data-level parallelism

Hardware

Accelerator

Hardware

Accelerator

  • Many Compute Engines

  • Delivers the expected performance

  • Hardware handles communication and synchronizaton

Hardware

Accelerator

Hardware

Accelerator

Hardware

Accelerator

Hardware

Accelerator

control-flow

parallelism

Processors inside an FPGA?


Netthreads programming netfpga with threaded software

FPGA

Processor

DDR controller

Ethernet MAC

Ethernet MAC

Ethernet MAC

DDR controller

  • Easier to program than HDL

  • Customizable

Soft Processors in FPGAs

  • Soft processors: processors in theFPGA fabric

  • FPGAs increasingly implement SoCs with CPUs

  • Commercial soft processors: NIOS-II and Microblaze

What is the performance requirement?


Performance in packet processing

Performance In Packet Processing

  • The application defines the throughput required

Edge routing

(≥ 1 Gbps/link)

Home networking

(~100 Mbps/link)

Scientific instruments

(< 100 Mbps/link)

  • Our measure of throughput:

    • Bisection search of the minimum packet inter-arrival

    • Must not drop any packet

Are soft processors fast enough?


Realistic goals

Realistic Goals

  • 109 bps stream with normal inter-frame gap of 12 bytes

  • 2 processors running at 125 MHz

  • Cycle budget:

    • 152 cycles for minimally-sized 64B packets;

    • 3060 cycles for maximally-sized 1518B packets

Soft processors: non-trivial processing at line rate!

How can they efficiently be organized?


Key design features

Key Design Features


Efficient network processing

1

Memory system with specialized memories

2

Multiple processors support

Efficient Network Processing

3

Multithreaded soft processor


Multiprocessor system diagram

processor

processor

I$

I$

4-threads

4-threads

Multiprocessor System Diagram

Synch. Unit

Instr.

Data

Input mem.

Output mem.

Input

Buffer

Data

Cache

Output

Buffer

packet

output

packet

input

Off-chip DDR

- Overcomes the 2-port limitation of block RAMs

- Shared data cache is not the main bottleneck in our experiments


Performance of single threaded processors

Performance of Single-Threaded Processors

  • Single-issue, in order pipeline

  • Should commit 1 instruction every cycle, but:

    • stall on instruction dependences

    • stall on memory, I/O, accelerators accesses

  • Throughput depends on sequential execution:

    • packet processing

    • device control

    • event monitoring

many concurrent threads

Solution to Avoid Stalls: Multithreading


Avoiding processor stall cycles

Legend

Thread1

Thread2

Thread3Thread4

F

F

F

F

F

D

D

D

D

Ideally, eliminates all stalls

D

E

E

E

E

5 stages

AFTER

E

M

M

M

M

M

W

W

W

W

W

Time

  • Multithreading: execute streams of independent instructions

Avoiding Processor Stall Cycles

F

F

F

Data or control hazard

D

D

D

Single-Thread

Traditional execution

E

E

E

5 stages

BEFORE

M

M

M

W

W

W

Time

  • 4 threads eliminate hazards in 5-stage pipeline

  • 5-stage pipeline is 77% more area efficient [FPL’07]


Multithreading evaluation

Multithreading Evaluation


Infrastructure

Infrastructure

  • Compilation:

    • modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA

  • Timing:

    • no free PLL: processors run at the speed of the Ethernet MACs, 125MHz

  • Platform:

    • 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM

    • Virtex II Pro 50 (speed grade 7ns)

    • 16KB private instruction caches and shared data write-back cache

    • Capacity would be increased on a more modern FPGA

  • Validation:

    • Reference trace from MIPS simulator

    • Modelsim and online instruction trace collection

- PC server can send ~0.7 Gbps maximally size packets

- Simple packet echo application can keep up

- Complex applications are the bottleneck, not the architecture


Our benchmarks

Our benchmarks

Realistic non-trivial applications, dominated by control flow


What is limiting performance

What is limiting performance?

Packet Backlog

due to Synchronization

Serializing Tasks

Let’s focus on the underlying problem: Synchronization


Addressing synchronization overhead

Addressing Synchronization Overhead


Real threads synchronize

Real Threads Synchronize

  • All threads execute the same code

  • Concurrent threads may access shared data

  • Critical sections ensure correctness

Thread1Thread2Thread3Thread4

Lock();

shared_var = f();

Unlock();

Impact on round-robin scheduled threads?


Multithreaded processor with synchronization

F

F

F

F

F

F

D

D

D

D

D

D

F

F

E

E

E

E

E

E

D

D

M

M

M

M

M

M

E

E

W

W

W

W

W

W

M

M

W

W

Multithreaded processor with Synchronization

F

D

Release

lock

E

5 stages

M

Acquire

lock

W

Time


Synchronization wrecks round robin multithreading

F

F

D

D

E

E

M

M

W

W

Synchronization Wrecks Round-Robin Multithreading

F

D

Release

lock

E

5 stages

M

Acquire

lock

W

Time

With round-robin thread scheduling and contention on locks:

< 4 threads execute concurrently

> 18% cycles are wasted while blocked on synchronization


Better handling of synchronization

F

F

F

F

F

F

F

F

F

F

F

F

F

F

D

D

D

D

D

D

D

D

D

D

D

D

D

D

E

E

E

E

E

E

E

E

E

E

E

E

E

E

AFTER

5 stages

M

M

M

M

M

M

M

M

M

M

M

M

M

M

W

W

W

W

W

W

W

W

W

W

W

W

W

W

Time

DESCHEDULE Thread3Thread4

Better Handling of Synchronization

F

F

F

F

F

F

D

D

D

D

D

D

E

E

E

E

E

E

BEFORE

5 stages

M

M

M

M

M

M

W

W

W

W

W

W

Time


Thread scheduler

Thread scheduler

  • Suspend any thread waiting for a lock

  • Round-robin among the remaining threads

  • Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across active threads

- Fewer than N threads requires hazard detection

But, hazard detection was on

critical path of single threaded processor

Is there a low cost solution?


Static hazard detection

Static Hazard Detection

  • Hazards can be determined at compile time

- Hazard distances are encoded as part of the instructions

Static hazard detection allows scheduling without an extra pipeline stage

Very low area overhead (5%), no frequency penalty


Thread scheduler evaluation

Thread Scheduler Evaluation


Results on 3 benchmark applications

Results on 3 benchmark applications

- Thread scheduling improves throughput by 63%, 31%, and 41%

- Why isn’t the 2nd processor always improving throughput?


Cycle breakdown in simulation

Cycle Breakdown in Simulation

Classifier

NAT

UDHCP

- Removed cycles stalled waiting for a lock

- What is the bottleneck?


Impact of allowing packet drops

Impact of Allowing Packet Drops

- System still under-utilized

- Throughput still dominated by serialization


Future work

Future Work

  • Adding custom hardware accelerators

    • Same interconnect as processors

    • Same synchronization interface

  • Evaluate speculative threading

    • Alleviate need for fine grained-synchronization

    • Reduce conservative synchronization overhead


Conclusions

Conclusions

  • Efficient multithreaded design

    • Parallel threads hide stalls on one thread

    • Thread scheduler mitigates synchronization costs

  • System Features

    • System is easy to program in C

    • Performance from parallelism is easy to get

On the lookout for relevant applications suitable for benchmarking

NetThreads available with compiler at:

http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads


Netthreads programming netfpga with threaded software

Geoff Salmon

Monia Ghobadi

Yashar Ganjali

Martin Labrecque

Gregory Steffan

ECE Dept.

CS Dept.

University of Toronto

NetThreads available with compiler at:

http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads


Backup

Backup


Software network processing

Software Network Processing

  • Not meant for:

    • Straightforward tasks accomplished at line speed in hardware

    • E.g. basic switching and routing

  • Advantages compared to Hardware

    • Complex applications are best described in a high-level software

    • Easier to design and fast time-to-market

    • Can interface with custom accelerators, controllers

    • Can be easily updated

  • Our focus: stateful applications

    • Data structures modified by most packets

    • Difficult to pipeline the code into balanced stages

  • Run-to-Completion/Pool-of-Threads model for parallelism:

    • Each thread processes a packet from beginning to end

    • No thread-specific behavior


Impact of allowing packet drops1

Impact of allowing packet drops

t

NAT benchmark


Cycle breakdown in simulation1

Cycle Breakdown in Simulation

Classifier

NAT

UDHCP

- Removed cycles stalled waiting for a lock

- Throughput still dominated by serialization


More sophisticated thread scheduling

Fetch

Thread

Selection

Register

Read

Execute

Writeback

Memory

More Sophisticated Thread Scheduling

  • Add pipeline stage to pick hazard-free instruction

  • Result:

    • Increased instruction latency

    • Increased hazard window

    • Increased branch mis-prediction cost

MUX

Add hazard detection without an extra pipeline stage?


Implementation

processor

processor

I$

I$

4-threads

4-threads

x 36 bits

x 36 bits

x 32 bits

Off-chip DDR

Implementation

  • Where to store the hazard distance bits?

    • Block RAMs are multiple of 9 bits wide

    • 36 bits word leaves 4 bits available

  • Also encode lock and unlock flags

32 Bits

4 Bits

How to convert instructions from 36 bits to 32 bits?


Instruction compaction 36 32 bits

Instruction Compaction 36  32 bits

R-Type Instructions

Example: add rd, rs, rt

J-Type Instructions

Example: j label

I-Type Instructions

Example: addi rt, rs, immediate

- De-compaction: 2 block RAMs + some logic between DDR and cache

- Not a critical path of the pipeline


  • Login