an efficient programmable 10 gigabit ethernet network interface card
Download
Skip this Video
Download Presentation
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Loading in 2 Seconds...

play fullscreen
1 / 26

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai. Designing a 10 Gigabit NIC. Programmability for performance Computation offloading improves performance NICs have power, area concerns

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An Efficient Programmable 10 Gigabit Ethernet Network Interface Card' - enoch


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an efficient programmable 10 gigabit ethernet network interface card

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

designing a 10 gigabit nic
Designing a 10 Gigabit NIC
  • Programmability for performance
    • Computation offloading improves performance
  • NICs have power, area concerns
    • Architecture solutions should be efficient
  • Above all, must support 10 Gb/s links
    • What are the computation, memory requirements?
    • What architecture efficiently meets them?
    • What firmware organization should be used?
mechanisms for an efficient programmable 10 gb s nic
Mechanisms for an Efficient Programmable 10 Gb/s NIC
  • A partitioned memory system
    • Low-latency access to control structures
    • High-bandwidth, high-capacity access to frame data
  • A distributed task-queue firmware
    • Utilizes frame-level parallelism to scale across many simple, low-frequency processors
  • New RMW instructions
    • Reduce firmware frame-ordering overheads by 50% and reduce clock frequency requirement by 17%
outline
Outline
  • Motivation
  • How Programmable NICs work
  • Architecture Requirements, Design
  • Frame-parallel Firmware
  • Evaluation
how programmable nics work
How Programmable NICs Work

Memory

PCI

Interface

Ethernet

Interface

Processor(s)

Bus

Ethernet

per frame requirements
Per-frame Requirements

Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions

aggregate requirements 10 gb s max sized frames
Aggregate Requirements10 Gb/s - Max Sized Frames

1514-byte Frames at 10 Gb/s

812,744 Frames/s

meeting 10 gb s requirements with hardware
Meeting 10 Gb/s Requirements with Hardware
  • Processor Architecture
    • At least 435 MIPS within embedded device
    • Does NIC firmware have ILP?
  • Memory Architecture
    • Low latency control data
    • High bandwidth, high capacity frame data
    • … both, how?
ilp processors for nic firmware
ILP Processors for NIC Firmware?
  • ILP limited by data, control dependences
  • Analysis of dynamic trace reveal dependences
processors 1 wide in order
Processors: 1-Wide, In-order
  • 2x performance costly
    • Branch prediction, reorder buffer, renaming logic, wakeup logic
    • Overheads translate to greater than 2x core power, area costs
    • Great for a GP processor; not for an embedded device
  • Other opportunities for parallelism? YES!
    • Many steps to process a frame - run them simultaneously
    • Many frames need processing - process simultaneously
  • Use parallel single-issue cores
memory architecture
Memory Architecture
  • Competing demands
    • Frame data: High bandwidth, high capacity for many offload mechanisms
    • Control data: Low latency; coherence among processors, PCI Interface, and Ethernet Interface
  • The traditional solution: Caches
    • Advantages: low latency, transparent to the programmer
    • Disadvantages: Hardware costs (tag arrays, coherence)
    • In many applications, advantages outweigh costs
are caches effective
Are Caches Effective?

SMPCache trace analysis of a 6-processor NIC architecture

choosing a better organization
Choosing a Better Organization

A Partitioned Organization

Cache Hierarchy

putting it all together
Putting it All Together

Instruction Memory

I-Cache 0

I-Cache 1

I-Cache P-1

CPU 0

CPU 1

CPU P-1

(P+4)x(S) Crossbar (32-bit)

Scratchpad 0

Scratchpad 1

S-pad S-1

PCI

Interface

Ethernet

Interface

Ext. Mem. Interface

(Off-Chip)

PCI Bus

DRAM

parallel firmware
Parallel Firmware
  • NIC processing steps already well-defined
  • Previous Gigabit NIC firmware divides steps between 2 processors
  • … but does this mechanism scale?
task assignment with an event register

Processor(s) pass data to Ethernet Interface

Processor(s) inspect transactions

Processor(s) need to enqueue TX Data

PCI Interface Finishes Work

Task Assignment with an Event Register

PCI Read Bit

SW Event Bit

… Other Bits

0

1

0

1

0

task level parallel firmware

Process

DMAs

0-4

Idle

1

Process

DMAs

5-9

Idle

Task-level Parallel Firmware

Function Running (Proc 0)

Function Running (Proc 1)

PCI Read HW Status

PCI Read Bit

Transfer

DMAs 0-4

0

Idle

Idle

Transfer

DMAs 5-9

1

Time

1

0

frame level parallel firmware

Build Event

Process

DMAs

0-4

Build Event

Process

DMAs

5-9

Frame-level Parallel Firmware

Function Running (Proc 0)

Function Running (Proc 1)

PCI RD HW Status

Transfer

DMAs 0-4

Idle

Idle

Transfer

DMAs 5-9

Idle

Time

evaluation methodology
Evaluation Methodology
  • Spinach: A library of cycle-accurate LSE simulator modules for network interfaces
    • Memory latency, bandwidth, contention modeled precisely
    • Processors modeled in detail
    • NIC I/O (PCI, Ethernet Interfaces) modeled in detail
    • Verified when modeling the Tigon 2 Gigabit NIC (LCTES 2004)
  • Idea: Model everything inside the NIC
    • Gather performance, trace data
processor performance
Processor Performance
  • Achieves 83% of theoretical peak IPC
  • Small I-Caches work
  • Sensitive to mem stalls
    • Half of loads are part of a load-to-use sequence
    • Conflict stalls could be reduced with more ports, more banks
reducing frame ordering overheads
Reducing Frame Ordering Overheads
  • Firmware ordering costly - 30% of execution
  • Synchronization, bitwise check/updates occupy processors, memory
  • Solution: Atomic bitwise operations that also update a pointer according to last set location
maintaining frame ordering
Maintaining Frame Ordering

Index 0

Index 1

Index 3

Index 4

… more bits

Frame Status Array

0

1

0

1

1

0

1

0

LOCK

CPU C Detects Completed Frames

Ethernet

Interface

CPU A prepares frames

CPU B prepares frames

Iterate

Notify Hardware

UNLOCK

rmw instructions reduce clock frequency
RMW Instructions Reduce Clock Frequency
  • Performance: 6x166 MHz = 6x200 MHz
    • Performance is equivalent at all frame sizes
    • 17% reduction in frequency requirement
  • Dynamically tasked firmware balances the benefit
    • Send cycles reduced by 28.4%
    • Receive cycles reduced by 4.7%
conclusions a programmable 10 gb s nic
ConclusionsA Programmable 10 Gb/s NIC
  • This NIC architecture relies on:
    • Data Memory System - Partitioned organization, not coherent caches
    • Processor Architecture - Parallel scalar processors
    • Firmware - Frame-level parallel organization
    • RMW Instructions - reduce ordering overheads
  • A programmable NIC: A substrate for offload services
ad