A Scalable, Cache-Based Queue Management Subsystem for Network Processors

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering

Packet processing systems • ASIC based • High performance • Low configurability • May be expensive when volumes are low • Network processors (NP) based • Very high degree of configurability • High volumes can result in low cost • Challenges in matching the ASIC performance • In this paper we concentrate on queuing bottlenecks associated with NP based packet processors

Basic NP architecture • Chip-Multiprocessor (CMP) architecture • A pool of relatively simple processors • Has dedicated hardware units for common cases • Provides high speed interconnection between processors • Integrates network interface and memory controllers • Group of processors collectively performs the packet processing task (queuing, scheduling, etc) • Best case performance is N times higher when each processor operates in parallel • Example, Intel’s IXP architecture

Intel’s IXP2850 Introduction • CMP architecture • 16 RISC type processors called Microengines (ME) • 8 hardware thread contexts on each ME • SPI4.2 and CSIX interface cores • 3 Rambus DRAM and 4 QDR SRAM controllers • Various hardware units, like Hash, CAM, etc. • Typically MEs are arranged in pipeline and groups of MEs collectively perform the packet processing task.

Why does PP need queues? • Routers and switch fabrics are packet processing systems which, • Receives packets at input ports • Classify and identifies the next hop for the packet • Transmit the packet at the appropriate output port • (Ingress rate > output link capacity) Due to, • traffic from many input ports destined to one output port • Bursty nature of internet traffic • Statistical oversubscription • Implications: • Unbounded delay for all flows • Packet loss across every active flow • A single misbehaving flow can affect all other flows

Solution • Keep queues for every flow or group of flows • Put arriving packets into appropriate queue • Treat each queue such that resources are allocated fairly to all flows • Send packets from queues such that each will receive fair share of the aggregate link bandwidth • In fact, queues are the fundamental data structure in any packet processing system. It ensures • Fair allocation of resources (bandwidth, buffer space, etc) • Isolate misbehaving and high priority flows • Guaranteed traffic treatment, delay, bandwidth, QoS • Conclusion: any packet processing system must handle large number of queues at very high speed

A simple queuing model • DRAM space is divided such that it can hold each arriving packets in an address space • Called as buffer • SRAM keeps the queue descriptor (QD) and next pointers • QD is the set of head, tail addresses and length of any queue • Next pointers are the address of next buffer in a queue • We need two categories of queues • A queue to keep all the free buffers available (i.e. free buffer queue) • Another set of queues to keep the buffers which holds the packets belonging to various flows (i.e. virtual queues) • These enables isolation of flows

A simple queuing model (cont)

Queuing Operation • For each arriving packet • A buffer is dequeued from the free buffer queue • Packet is written into it • Buffer is enqueued into the appropriate queue • Queue descriptors are updated • Thus enqueue operation into any queue involves • Update of free queue descriptor • Read followed by write • Update of virtual queue descriptor • Read followed by write • Free queue descriptor is kept on-chip, so updates fast • However, virtual queue descriptors are off-chip and hence their updates are slow

Queuing Operation in a NP • To achieve high throughput, a group of processors and associated threads are used to collectively perform the queuing • Each thread handles a Packet at a time and enqueues/dequeues it into the appropriate queue • When arriving/departing packets all belong to different queues, such a scheme effectively speeds up the operation linearly with increasing number of threads • However, when packets belong to the same queue, the entire operation gets serialized, and threads start competing for the same queue descriptor • Multiple processors/threads doesn’t result in any benefit

Operation • If all threads access different queues Thread 0 Read QD A Update Write QD B Update Write QD A Read QD B Thread 1 Read QD C Update Update Write QD C Read QD D Thread 2 Read QD E Read QD F Update Write QD E Thread x Update Write QD G Read QD G Read QD H • What if all threads access the same queue Thread 0 Update Write QD A Read QD A Wait Thread 1 Wait for thread 0 Read QD A Update Write QD A Thread 2 Wait Thread x Wait

Solution • Accelerate the serialized operations • Use mechanisms which will enable serialized operations run relatively faster • This can be done by putting a small on chip cache to hold the queue descriptors currently being accessed • Thus all threads but the first thread will be able to update the queue descriptor relatively much faster • In situations where threads access different queue descriptors, the operation will go as it is • When threads access the same queue descriptor, even if the operation gets serialized, each operation will be very fast

Queuing cache • Thus, queuing cache will sit between the memory hierarchy and MEs. • Whenever queue descriptors are accessed, they will be put into the cache • Questions • Size of cache? • Eviction policy? • Intuitively the size of cache should be same as the maximum number of threads that are collectively performing the queuing operation • Because only so many QDs can be accessed at a time • The eviction policy can be Least Recently Used (LRU)

Operation with queuing cache • If all threads access different queues Thread 0 Read QD A Update Write QD LRU Update Write QD LRU Read QD B Thread 1 Read QD C Update Update Write QD LRU Read QD D Thread 2 Read QD E Read QD F Update Write QD LRU Thread x Update Write QD LRU Read QD G Read QD H • If all threads access the same queue Thread 0 Update Update Read QD A Wait Thread 1 Update Update Wait for QD A Wait Thread 2 Wait Update Wait Thread x Wait Update Wait

Performance comparison • For a 200 MHz DDR SRAM with SRAM access latency of 80 ns and queuing cache access latency of 20 ns • And assuming that processor takes 10 ns to execute all the queuing related instructions associated with a single packet

Our design approach • Since queuing is so common in NPs, it may be a very good idea to add the hardware level support for enqueue and dequeue operations • Queuing cache is the best place to put these functionalities, because then the queuing will be very fast in situation where it get serialized • Thus each NP will support standard instructions, like enqueue, dequeue, etc • These instructions will be sent to the queuing cache • Queuing cache will internally manage the pointers and also handle any contention when threads access the same queue • Also threads themselves are relived from the burden of synchronization and pointer management and can operate independently

Implementation

Intel’s Approach • Intel’s second-generation IXP network processors have support for queuing via, • SRAM controller, which holds queue descriptors and implements queuing operations, and • MEs, which support enqueue and dequeue instructions • Caching of queue descriptors is implemented using • A Q-array in memory controller • Any queuing operation precedes a transfer of queue descriptors from SRAM to Q-array • A CAM in kept in each ME • To keep track of which QD are cached and their position in the Q-array • CAM supports LRU which is used to evict entries from the Q-array

Comparison • Reduced instruction count on each processor • If we move all the logic associated with the enqueues and dequeues to the queuing cache, software may become simple • Simple and modular software code for queuing tasks • No need for synchronization, etc • Queuing cache built near the memory controller results in significantly reduced on chip communication • Since queuing cache handles the pointer processing as well, the processors needn’t fetch the queue descriptors at all • Only communication between processors and queuing cache is instruction exchange • More scalable • Any number of MEs can participate in queuing • No local CAM per ME needed unlike Intel’s IXP approach

Conclusion • Contributions • Brief qualitative and quantitative analysis of queuing cache • A proposal for efficient and scalable design • Future work • Comparison to other caching technique • Implementation to measure the real complexity • We believe that such a cache based centralized queuing hardware unit will make the future network processors more • Scalable and • Easy to program • Questions?

A Scalable, Cache-Based Queue Management Subsystem for Network Processors

A Scalable, Cache-Based Queue Management Subsystem for Network Processors

Presentation Transcript

Proxy Cache Management for Fine-Grained Scalable Video Streaming

A Comparison of Load-based and Queue-based Active Queue Management Algorithms

Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors

Implementing a Scalable Multiarea Network OSPF-Based Solution

Multicore Navigator: Queue Manager Subsystem (QMSS)

Cluster-Based Scalable Network Services

Cache Friendly Parallel Priority Queue

Cache Memory Design for Network Processors

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing a Scalable Multiarea Network OSPF-Based Solution

A Scalable Web Cache Consistency Architecture

Work Queue: A Scalable Master/Worker Framework

Scalable Cache Coherent Systems

Network Processors

Implementing a Scalable Multiarea Network OSPF-Based Solution

Scalable Vector Processors for Embedded Systems

Implementing a Scalable Multiarea Network OSPF-Based Solution

Cache Coherence Techniques for Multicore Processors

Scalable Cache Coherent Systems

Network Processors

Scalable Cache Coherent Systems