230 likes | 379 Views
PAPR - Network Processor Architecture and Programming. Daniela GENIUS LIP6 ASIM. PAPR Team. PAPR started as a collaboration between the ASIM and RP departments of LIP6 Etienne Faure Daniela Genius Alain Greiner Eric Horlait Kave Salamatian. Contents. Introduction
E N D
PAPR - Network Processor Architecture and Programming Daniela GENIUS LIP6 ASIM
PAPR Team PAPR started as a collaboration between the ASIM and RP departments of LIP6 • Etienne Faure • Daniela Genius • Alain Greiner • Eric Horlait • Kave Salamatian
Contents • Introduction • The PAPR Generic Platform • Hardware/Software Codesign • Application Specification Language • MWMR Communication Channels • Memory Management • Hardware Coprocessors • Performance Evaluation • Research Perspectives
Introduction Bandwidth (Mbyte/sec) OC768 100,000 40Gb x4 OC192 10Gb 10,000 x16 OC12 PR 1000 622Mb x12 DS3 100 DS= Digital Signal 44Mb OC = Optical Carrier x28 10 DS1 1.5M 1 x24 0.1 DS0 64K year 1995 2000 2005 1980 1985 1990
Introduction • In recent years, diverse architectures have been proposed for Network Processors: • Intel IXP • IBM Power NP • Motorola • AMCC • ... • Bottlenecks are memory access and on-chip communications. • Our aim is to propose a design method for telecom embedded application, using a generic, multiprocessor, hardware platfom. • Related Work : STepNP of STMicroelectronics
Data rates keep increasing Protocols and applications keep evolving System design and test is slow and expensive Special-purpose hardware can hardly be reused Flexibility and adaptation to new standards (UMTS, IPv6, MPLS, ...) is a must. => The key is programmability We will use general-purpose processors, and a “classical” shared memory multiprocessor architecture Introduction
LIP6 Contributions • Generic network processor architecture: The LIP6 proposes a generic and flexible hardware platform which can adapt to different (networking) applications : PAPR • Application specification language/environment: The LIP6 proposes an environment which allows the system designer to describe and validate multi-threaded applications and which facilitates the mapping on the generic platform. • Design space exploration: The LIP6 proposes a general method to ease the migration from software (running on programmable processors), to hardware (dedicated coprocessors), in order to optimize performance or minimize power consumption.
The PAPR Generic Platform • Based on SoCLib component library • Shared address space • VCI compliant • MIPS R3000 micro-processors • External RAM controller • DSPIN network-on-chip • Dedicated hardware coprocessor for I/Os • Optionnal (synthesized) hardware coprocessors
Hardware/Software Codesign Multi-thread application Multi-processor architecture • The system designer must have the following possibilities : • choose the hard/soft implementation for each task • map the software tasks on the programmable processors (and the hardware tasks on synthesized or existing coprocessors) • map the communication channels onto the physical memory banks Application Mapping
Application Specification Language The software parallel application is described as a task graph with two types of nodes : tasks & communication channels. Tasks communicate through Multi-Writer / Multi-Reader FIFOs.. Tasks can be hardwareor software.
MWMR Communication Channels • Each MWMR channel is implemented as a software FIFO, and is caracterized by 2 parameters: width & a depth. • Each MWMR channel is protected by a lock, in order to guarantee exclusive access. • Read & Write communication primitives are non-blocking : - int mwmr_read(channel_id, *buffer, nb_bytes) - int mwmr_write(channel_id, *buffer, nb_bytes) • As any task can be implemented in hardware or software, MWMR channels can be accessed by both hardware andsoftware controllers. • The software versions of the communication primitives are built upon the POSIX API : The software application can be executed on any UNIX workstation, before being mapped on the SoC.
Application Specification Language R R R R R F F F I F O I O IPV4 Routing Application
Application Specification Language C C C C S C S S F F F I F O Classification Application
Memory Management • The network processor must have the largest possible storage capacity (several thousands packets). • In networking applications, the relevant information is usually located within the first few bytes of a packet. • On-chip memory is limited • External RAM is mandatory, with a careful allocation/free policy. • Only the packet descriptors are stored in the on-chip RAM.
Memory Management @ @ @ @ NULL „Slot“ Data Structure • Descriptor (128 bits) : MWMR channels • First slot (128 bytes) : on-chip RAM • Following slots (128 bytes) : external RAM
Memory Management • Incoming packets : The slots are allocated by the coprocessor Input Engine. • Outgoing packets : After treatment, once read by the Output Engine, they are de-allocated.
Hardware Coprocessors Both the Input Engine and Output Engine coprocessors are configured by software, and use a MWMR hardware controller. • Input Engine • Its aim is to copy the packets coming from the Gigabit Ethernet link into system memory. • It implements the management of the slot structure. • Output Engine • Its aim is to reconstitute the packets from their slots in order to copy them to the outgoing Ethernet link. • The Output Engine works symmetrically to the Input Engine.
The Hardware MWMR Controller HARDWARE COPROCESSOR (IP CORE) VCI_RAM (containing the MWMR software FIFO) FIFO VCI_MWMR_INITIATOR WRAPPER VCI VCI interconnect The VCI_MWMR_initiator wrapper is a generic hardware MWMR controller that has a DMA capability. It implements a variable number of read or write MWMR communication channels.
Performance Evaluation • We use SoCLIB models for the hardware part of the platform. • The abstraction level is CABA (cycle accurate, bit accurate). • In a class project, we developed a suite of small benchmarks (IPv4, classification, NAT, firewall, ...) • We analyze and compare the output files generated by the simulation (chronograms end Ethernet flows) • The throughput of incoming packets can be varied for performance/maximal load measurements.
Ongoing Research • Multi-Cluster Communication Architecture We want to optimize the performances by exploiting the locality of applications. Clusterization further complicates the mapping of tasks to processors, and communication channels on memory banks. • Hardware-managed Synchronization By using hardware queues, we try to minimize the cost of taking locks for exclusive FIFO access. • Macro-pipeline for packet treatment The two applications shown exhibit packet-level parallelism (one packet per task). We try to parallelize further by decomposing the treatment into several threads. • Packet Reordering MWMR communication channels allows out-of-order arrival of packets at the Output Engine. We want to analyse several strategy of packet reordering. • KPN communication mode on the MWMR channels Other classes of applications, whose task graphs are not necessarily of task-farm type, are currently studied : MJPEG and JPEG2000.