Task partitioning for multi core network processors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Task Partitioning for Multi-Core Network Processors PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on
  • Presentation posted in: General

Task Partitioning for Multi-Core Network Processors. Rob Ennals, Richard Sharp Intel Research, Cambridge Alan Mycroft Programming Languages Research Group, University of Cambridge Computer Laboratory. Talk Overview. Network Processors What they are, and why they are interesting

Download Presentation

Task Partitioning for Multi-Core Network Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Task partitioning for multi core network processors

Task Partitioning for Multi-Core Network Processors

Rob Ennals, Richard Sharp

Intel Research, Cambridge

Alan Mycroft

Programming Languages Research Group,

University of Cambridge Computer Laboratory


Talk overview

Talk Overview

  • Network Processors

    • What they are, and why they are interesting

  • Architecture Mapping Scripts (AMS)

    • How to separate your high level program from low level details

  • Task Pipelining

    • How it can go wrong, and how to make sure it goes right


Network processors

Network Processors

  • Designed for high speed packet processing

    • Up to 40Gb/s

    • High performance per watt

    • ASIC performance with CPU programmability

  • Highly parallel

    • Multiple programmable cores

    • Specialised co-processors

    • Exploit the inherent parallelism of packet processing

  • Products available from many manufacturers

    • Intel, Broadcom, Hifn, Freescale, EZChip, Xelerated, etc


Lots of parallelism

Lots of Parallelism

  • Intel IXP 2800: 16 cores, each with 8 threads

  • EZChip NP-1c: 5 different types of cores

  • Agere APP: several specialised cores

  • FreeScale C-5: 16 cores, 5 co-processors

  • Hifn 5NP4G: 16 cores

  • Xelerated X10: 200 VLIW packet engines

  • BroadCom BCM1480: 4 cores


Pipelined programming model

Pipelined Programming Model

  • Used by many NP designs

  • Packets flow between cores

  • Why do this?

    • Cores may have different functional units

    • Cores may maintain state tables locally

    • Cores may have limited code space

    • Reduce contention for shared resources

    • Makes it easier to preserve packet ordering

Core

Core

Core

Core


An example ixp2800

An Example: IXP2800

  • 16 microengine cores

    • Each with 8 concurrent threads

    • Each with local memory and specialised functional units

  • Pipelined programming model

    • Dedicated datapath between adjacent microengines

  • Exposed IO Latency

    • Separate operations to schedule IO, and to wait for it to finish

  • No cache hierarchy

    • Must manually cache data in faster memories

  • Very powerful, but hard to program


Ixp2800

72

72

72

IXP2800

Stripe/byte align

RDRAM

1

RDRAM

2

RDRAM

3

MEv2

1

MEv2

2

MEv2

3

MEv2

4

Rbuf

64 @ 128B

S

P

I

4

or

C

S

I

X

16b

MEv2

8

MEv2

7

MEv2

6

MEv2

5

XScale

Core

32K IC

32K DC

G

A

S

K

E

T

PCI

(64b)

66 MHz

64b

Tbuf

64 @ 128B

16b

MEv2

9

MEv2

10

MEv2

11

MEv2

12

Hash

64/48/128

Scratch

16KB

QDR

SRAM

1

QDR

SRAM

2

QDR

SRAM

3

QDR

SRAM

4

MEv2

16

MEv2

15

MEv2

14

MEv2

13

CSRs

-Fast_wr-UART

-Timers-GPIO

-BootROM/SlowPort

E/D Q

E/D Q

E/D Q

E/D Q

18

18

18

18

18

18

18

18


Ixdp 2400

IXP2400

IXP2400

CSIX Fabric

IXDP-2400

  • Things are even harder in practice…

  • Systems contain multiple NPs!

Packets from

network

Packets to

network


What people do now

What People Do Now

  • Design their programs around the architecture

    • Explicitly program each microengine thread

    • Explicity access low level functional units

    • Manually hoist IO operations to be early

  • THIS SUCKS!

    • High level program gets polluted with low level details

    • IO hoisting breaks modularity

    • Programs are hard to understand, hard to modify, hard to write, hard to maintain, and hard to port to other platforms.


The paclang project

The PacLang Project

  • Aiming to make it easier to program Network Processors

  • Based around the PacLang language

    • C-like syntax and semantics

    • Statically allocated threads, linked by queues

    • Abstracts away all low level details

  • A number of interesting features

    • Linear type system

    • Architecture Mapping scripts (this talk)

    • Various other features in progress

  • A prototype implementation is available


Architecture mapping scripts

Architecture Mapping Scripts

  • Our compiler takes two files

    • A high level PacLang program

    • An architecture mapping script (AMS)

  • PacLang program contains no low-level details

    • Portable across different architectures

    • Very easy to read and debug

  • Low level details are all in the AMS

    • Specific to a particular architecture

    • Can change performance, but not semantics

    • Tells the compiler how to transform the program so that it executes efficiently


Design flow with an ams

Design Flow with an AMS

PacLang Program

AMS

Compiler

Refine AMS

Analyse Performance

Deploy


Advantages of the ams approach

Advantages of the AMS Approach

  • Improved code readability and portability

    • The code isn’t polluted with low-level details

  • Easier to get programs correct

    • Correctness depends only on the PacLang program

    • The AMS can change the performance, but not the semantics

  • Easy exploration of optimisation choices

    • You only need to modify the AMS

  • Performance

    • The programmer still has a lot of control over the generated code.

    • No need to pass all control over to someone else’s optimiser


Ams optimiser good

AMS + Optimiser = Good

  • Writing an optimiser that can do everything perfectly is hard

    • Network Processors are much harder to optimise for than CPUs

    • More like hardware synthesis than conventional compilation

  • Writing a program that applies an AMS is easier

  • AMS can fill in gaps left by an optimiser

    • Write an optimiser that usually does a reasonable job

    • Use an AMS to deal with places where the optimiser does poorly

  • Programmers like to have control

    • I may know exactly how I want to map my program to hardware

    • Optimisers can give unpredictable behaviour


An ams is an addition not an alternative to an automatic optimiser

An AMS is an addition, not an alternative to an automatic optimiser!

  • This is a sufficiently important point that it is worth making twice


What can an ams say

What can an AMS say?

  • How to pipeline a task across multiple microengines

  • What to store in each kind of memory

  • When to move data between different memories

  • How to represent data in memory (e.g. pack or not?)

  • How to protect shared resources

  • How to implement queues

  • Which code should be considered the critical path

  • Which code should be placed on the XScale core

  • Low level details such as loop unrolling and function inlining

  • Which of several alternative algorithms to use

    And whatever else one might think of


Ams based program pipelining

AMS-based program pipelining

  • High-level program has problem-orientated concurrency

    • Division of program into tasks models the problem

    • Tasks do not map directly to hardware units

  • AMS transforms this to implementation-oriented concurrency

    • Original tasks are split and joined to make new tasks

    • New tasks map directly to hardware units

Hardware Task

Hardware Task

AMS

Hardware Task

Hardware Task

Compiler

User Task

Hardware Task

Hardware Task

User Task

Hardware Task

Hardware Task

Hardware Task

Hardware Task

Hardware Task

Hardware Task


Task pipelining

Task Pipelining

  • Convert one repeating task into several tasks with a queue between them

A; B; C;

Pipeline Transform

A;

B;

C;


Pipelining is not always safe

Pipelining is not always safe

  • May change the behaviour of the program:

1,2,1,2,...

q.enq(1); q.enq(2);

Pipeline Transform

Iterations of t1 get ahead of t2

1,1,2,2,...

Elements now written to queue out of order!

q.enq(1);

q.enq(2);

t2

t1


Pipelining safety is tricky 1 3

Pipelining Safety is tricky (1/3)

  • Concurrent tasks interact in complex ways

q2.enq(q1.deq);

passes values from q1 to q2

1,1,...

1,1,2,2,...

values can appear on q2 out of order

q1

q2

q1.enq(1);

q2.enq(2);

Pipeline split point


Pipelining safety is tricky 2 3

Pipelining Safety is tricky (2/3)

  • Concurrent tasks interact in complex ways

q1.enq(3); q2.enq(4);

t3

q1 says: 1,1 written before 3.

q2 says: 4 written before 2.

t4 says: 3 written before 4.

unsplit task says: 2 written before 1,1.

This combination not possible in

the original program.

1,1,3,...

4,2,2,...

q2

q1

q1.enq(1);

q2.enq(2);

Pipeline split point


Pipelining safety is tricky 3 3

q2.enq(q1.deq);

q1.enq(q2.deq);

1,1,...

1,1,2,2,...

1,1,2,2

2,2,...

q1

q2

q1

q2

q1.enq(1);

q2.enq(2);

q1.enq(1);

q2.enq(2);

Pipeline split point

Pipeline split point

Pipelining Safety is tricky (3/3)

Unsafe

Safe


Checking pipeline safety

Checking Pipeline Safety

  • Difficult for programmer to know if pipeline is safe

  • Fortunately, our compiler checks safety

    • Rejects AMS if pipelining is unsafe

  • Applies a safety analysis that checks that pipelining cannot change observable program behaviour

  • I won’t subject you to the full safety analysis now

    • Read the details in the paper


Task rearrangement in action

ARP

Classify

IP Options

Rx

Tx

IP

ICMP Err

Task Rearrangement in Action

IP Options

+ ARP

+ICMP Err

Classify

+ IP(1/3)

Tx

Rx

IP(2/3)

IP(2/3)


The paclang language

The PacLang Language

  • High level language, abstracting all low level details

  • Not IXP specific – can be targeted to any architecture

    • Our toolset can also generate Click modules

  • C-like, imperative language

  • Static threads, connected by queues

  • Advanced type system

    • Linearly typed packets – allow better packet implementation

    • Packet views – make it easer to work with multiple protocols


Performance

Performance

  • One of the main aims of PacLang

    • No feature is added to the language if it can’t be implemented efficiently

  • PacLang programs run fast

  • We have implemented a high performance IP forwarder

    • It achieves 3Gb/s on a RadiSys ENP2611, IXP2400 card

      • Worst case, using min-size packets

    • Using a standard longest-prefix-match algorithm

    • Using only 5 of the 8 available micro-engines (including drivers)

    • Competitive with other IP forwarders on the same platform


Availability

Availability

  • A preview release of the PacLang compiler is available

    • Download it from Intel Research Cambridge, or from SourceForge

    • Full source-code is available

  • A research prototype, not a commercial quality product

    • Runs simple demo programs

    • But lacks many features that would be needed in a full product

    • Not all AMS features are currently working


A tangent lockbend

A Tangent: LockBend

  • Abstracted Lock Optimisation for C Programs

    • Take an existing C program

    • Add some pragmas telling the compiler how to transform the program to use a different locking strategy

      • Fine grained, ordered, optimistic, two phase, etc

    • Compiler verifies that program semantics is preserved

Compiler

LockBend Pragmas

Program with Optimised Locking Strategy

Legacy C Program


  • Login