netthreads programming netfpga with threaded software
Download
Skip this Video
Download Presentation
NetThreads: Programming NetFPGA with Threaded Software

Loading in 2 Seconds...

play fullscreen
1 / 36

NetThreads: Programming NetFPGA with Threaded Software - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

NetThreads: Programming NetFPGA with Threaded Software. Geoff Salmon Monia Ghobadi Yashar Ganjali. Martin Labrecque Gregory Steffan. ECE Dept. CS Dept. University of Toronto. Real-Life Customers. Hardware: NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' NetThreads: Programming NetFPGA with Threaded Software' - aaron


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
netthreads programming netfpga with threaded software
NetThreads: Programming NetFPGA with Threaded Software

Geoff Salmon

Monia Ghobadi

Yashar Ganjali

Martin Labrecque

Gregory Steffan

ECE Dept.

CS Dept.

University of Toronto

real life customers
Real-Life Customers
  • Hardware:
    • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
  • Collaboration with CS researchers
    • Interested in performing network experiments
    • Not in coding Verilog
    • Want to use GigE link at maximum capacity
  • Requirements:
  • Easy to program system
  • Efficient system

What would the ideal solution look like?

envisioned system someday

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Envisioned System (Someday)

data-level parallelism

Hardware

Accelerator

Hardware

Accelerator

  • Many Compute Engines
  • Delivers the expected performance
  • Hardware handles communication and synchronizaton

Hardware

Accelerator

Hardware

Accelerator

Hardware

Accelerator

Hardware

Accelerator

control-flow

parallelism

Processors inside an FPGA?

slide4

FPGA

Processor

DDR controller

Ethernet MAC

Ethernet MAC

Ethernet MAC

DDR controller

  • Easier to program than HDL
  • Customizable

Soft Processors in FPGAs

  • Soft processors: processors in theFPGA fabric
  • FPGAs increasingly implement SoCs with CPUs
  • Commercial soft processors: NIOS-II and Microblaze

What is the performance requirement?

performance in packet processing
Performance In Packet Processing
  • The application defines the throughput required

Edge routing

(≥ 1 Gbps/link)

Home networking

(~100 Mbps/link)

Scientific instruments

(< 100 Mbps/link)

  • Our measure of throughput:
    • Bisection search of the minimum packet inter-arrival
    • Must not drop any packet

Are soft processors fast enough?

realistic goals
Realistic Goals
  • 109 bps stream with normal inter-frame gap of 12 bytes
  • 2 processors running at 125 MHz
  • Cycle budget:
    • 152 cycles for minimally-sized 64B packets;
    • 3060 cycles for maximally-sized 1518B packets

Soft processors: non-trivial processing at line rate!

How can they efficiently be organized?

efficient network processing

1

Memory system with specialized memories

2

Multiple processors support

Efficient Network Processing

3

Multithreaded soft processor

multiprocessor system diagram

processor

processor

I$

I$

4-threads

4-threads

Multiprocessor System Diagram

Synch. Unit

Instr.

Data

Input mem.

Output mem.

Input

Buffer

Data

Cache

Output

Buffer

packet

output

packet

input

Off-chip DDR

- Overcomes the 2-port limitation of block RAMs

- Shared data cache is not the main bottleneck in our experiments

performance of single threaded processors
Performance of Single-Threaded Processors
  • Single-issue, in order pipeline
  • Should commit 1 instruction every cycle, but:
    • stall on instruction dependences
    • stall on memory, I/O, accelerators accesses
  • Throughput depends on sequential execution:
    • packet processing
    • device control
    • event monitoring

many concurrent threads

Solution to Avoid Stalls: Multithreading

avoiding processor stall cycles

Legend

Thread1

Thread2

Thread3Thread4

F

F

F

F

F

D

D

D

D

Ideally, eliminates all stalls

D

E

E

E

E

5 stages

AFTER

E

M

M

M

M

M

W

W

W

W

W

Time

  • Multithreading: execute streams of independent instructions
Avoiding Processor Stall Cycles

F

F

F

Data or control hazard

D

D

D

Single-Thread

Traditional execution

E

E

E

5 stages

BEFORE

M

M

M

W

W

W

Time

  • 4 threads eliminate hazards in 5-stage pipeline
  • 5-stage pipeline is 77% more area efficient [FPL’07]
infrastructure
Infrastructure
  • Compilation:
    • modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA
  • Timing:
    • no free PLL: processors run at the speed of the Ethernet MACs, 125MHz
  • Platform:
    • 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM
    • Virtex II Pro 50 (speed grade 7ns)
    • 16KB private instruction caches and shared data write-back cache
    • Capacity would be increased on a more modern FPGA
  • Validation:
    • Reference trace from MIPS simulator
    • Modelsim and online instruction trace collection

- PC server can send ~0.7 Gbps maximally size packets

- Simple packet echo application can keep up

- Complex applications are the bottleneck, not the architecture

our benchmarks
Our benchmarks

Realistic non-trivial applications, dominated by control flow

what is limiting performance
What is limiting performance?

Packet Backlog

due to Synchronization

Serializing Tasks

Let’s focus on the underlying problem: Synchronization

real threads synchronize
Real Threads Synchronize
  • All threads execute the same code
  • Concurrent threads may access shared data
  • Critical sections ensure correctness

Thread1Thread2Thread3Thread4

Lock();

shared_var = f();

Unlock();

Impact on round-robin scheduled threads?

multithreaded processor with synchronization

F

F

F

F

F

F

D

D

D

D

D

D

F

F

E

E

E

E

E

E

D

D

M

M

M

M

M

M

E

E

W

W

W

W

W

W

M

M

W

W

Multithreaded processor with Synchronization

F

D

Release

lock

E

5 stages

M

Acquire

lock

W

Time

synchronization wrecks round robin multithreading

F

F

D

D

E

E

M

M

W

W

Synchronization Wrecks Round-Robin Multithreading

F

D

Release

lock

E

5 stages

M

Acquire

lock

W

Time

With round-robin thread scheduling and contention on locks:

< 4 threads execute concurrently

> 18% cycles are wasted while blocked on synchronization

better handling of synchronization

F

F

F

F

F

F

F

F

F

F

F

F

F

F

D

D

D

D

D

D

D

D

D

D

D

D

D

D

E

E

E

E

E

E

E

E

E

E

E

E

E

E

AFTER

5 stages

M

M

M

M

M

M

M

M

M

M

M

M

M

M

W

W

W

W

W

W

W

W

W

W

W

W

W

W

Time

DESCHEDULE Thread3Thread4

Better Handling of Synchronization

F

F

F

F

F

F

D

D

D

D

D

D

E

E

E

E

E

E

BEFORE

5 stages

M

M

M

M

M

M

W

W

W

W

W

W

Time

thread scheduler
Thread scheduler
  • Suspend any thread waiting for a lock
  • Round-robin among the remaining threads
  • Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across active threads

- Fewer than N threads requires hazard detection

But, hazard detection was on

critical path of single threaded processor

Is there a low cost solution?

static hazard detection
Static Hazard Detection
  • Hazards can be determined at compile time

- Hazard distances are encoded as part of the instructions

Static hazard detection allows scheduling without an extra pipeline stage

Very low area overhead (5%), no frequency penalty

results on 3 benchmark applications
Results on 3 benchmark applications

- Thread scheduling improves throughput by 63%, 31%, and 41%

- Why isn’t the 2nd processor always improving throughput?

cycle breakdown in simulation
Cycle Breakdown in Simulation

Classifier

NAT

UDHCP

- Removed cycles stalled waiting for a lock

- What is the bottleneck?

impact of allowing packet drops
Impact of Allowing Packet Drops

- System still under-utilized

- Throughput still dominated by serialization

future work
Future Work
  • Adding custom hardware accelerators
    • Same interconnect as processors
    • Same synchronization interface
  • Evaluate speculative threading
    • Alleviate need for fine grained-synchronization
    • Reduce conservative synchronization overhead
conclusions
Conclusions
  • Efficient multithreaded design
    • Parallel threads hide stalls on one thread
    • Thread scheduler mitigates synchronization costs
  • System Features
    • System is easy to program in C
    • Performance from parallelism is easy to get

On the lookout for relevant applications suitable for benchmarking

NetThreads available with compiler at:

http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

slide29

Geoff Salmon

Monia Ghobadi

Yashar Ganjali

Martin Labrecque

Gregory Steffan

ECE Dept.

CS Dept.

University of Toronto

NetThreads available with compiler at:

http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads

software network processing
Software Network Processing
  • Not meant for:
    • Straightforward tasks accomplished at line speed in hardware
    • E.g. basic switching and routing
  • Advantages compared to Hardware
    • Complex applications are best described in a high-level software
    • Easier to design and fast time-to-market
    • Can interface with custom accelerators, controllers
    • Can be easily updated
  • Our focus: stateful applications
    • Data structures modified by most packets
    • Difficult to pipeline the code into balanced stages
  • Run-to-Completion/Pool-of-Threads model for parallelism:
    • Each thread processes a packet from beginning to end
    • No thread-specific behavior
cycle breakdown in simulation1
Cycle Breakdown in Simulation

Classifier

NAT

UDHCP

- Removed cycles stalled waiting for a lock

- Throughput still dominated by serialization

more sophisticated thread scheduling

Fetch

Thread

Selection

Register

Read

Execute

Writeback

Memory

More Sophisticated Thread Scheduling
  • Add pipeline stage to pick hazard-free instruction
  • Result:
    • Increased instruction latency
    • Increased hazard window
    • Increased branch mis-prediction cost

MUX

Add hazard detection without an extra pipeline stage?

implementation

processor

processor

I$

I$

4-threads

4-threads

x 36 bits

x 36 bits

x 32 bits

Off-chip DDR

Implementation
  • Where to store the hazard distance bits?
    • Block RAMs are multiple of 9 bits wide
    • 36 bits word leaves 4 bits available
  • Also encode lock and unlock flags

32 Bits

4 Bits

How to convert instructions from 36 bits to 32 bits?

instruction compaction 36 32 bits
Instruction Compaction 36  32 bits

R-Type Instructions

Example: add rd, rs, rt

J-Type Instructions

Example: j label

I-Type Instructions

Example: addi rt, rs, immediate

- De-compaction: 2 block RAMs + some logic between DDR and cache

- Not a critical path of the pipeline

ad