1 / 52

Network Processors A generation of multi-core processors

INF5063: Programming Heterogeneous Multi-Core Processors. Network Processors A generation of multi-core processors. October 23, 2014. Agere Payload Plus APP550. Classifier buffer. Scheduler buffer. Stream editor memory. to Egress. from Ingress. to co-processor. from. co-processor.

taran
Download Presentation

Network Processors A generation of multi-core processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF5063:Programming Heterogeneous Multi-Core Processors Network ProcessorsA generation of multi-core processors October 23, 2014

  2. Agere Payload Plus APP550 Classifier buffer Scheduler buffer Stream editor memory to Egress from Ingress to co-processor from. co-processor Classifier memory PCI Bus Scheduler memory Statistics memory

  3. Agere Payload Plus APP550 Classifier buffer Scheduler buffer Stream editor memory • Stream Editor (SED) • two parallel engines • modify outgoing packets (e.g., checksum, TTL, …) • configurable, but not programmable • Packet (protocol data unit) assembler • collect all blocks of a frame • not programmable to Egress from Ingress • Pattern Processing Engine • patterns specified by programmer • programmable using a special high-level language • only pattern matching instructions • parallelism by hardware using multiple copies and several sets of variables • access to different memories to co-processor from. co-processor • Reorder Buffer Manager • transfers data between classifier and traffic manager • ensure packet order due to parallelism and variable processing time in the pattern processing Classifier memory • State Engine • gather information (statistics) for scheduling • verify flow within bounds • provide an interface to the host • configure and control other functional units PCI Bus Scheduler memory Statistics memory

  4. PowerNP 2 Interfaces (OUT to host) 2 Interfaces (IN from host) Internal memory External memory Control store Egress data store Ingress queue Ingress data store PowerPC core Embedded processors Embedded processors Embedded processors Instruct. memory Hardware classifier Embedded processors Embedded processors Embedded processors Embedded processors Egress queue Embedded processors Dispatch unit 4 Interfaces (OUT to net) 4 Interfaces (IN from net)

  5. PowerNP 2 Interfaces (OUT to host) 2 Interfaces (IN from host) • Coprocessors • 8 embedded processors • 4 kbytes local memory each • 2 cores/processor • 2 threads/core Internal memory External memory Control store Egress data store Ingress queue • Embedded PowerPC GPU • no OS on the NPF Ingress data store PowerPC core Embedded processors Embedded processors Embedded processors Instruct. memory Hardware classifier Embedded processors Embedded processors Embedded processors Embedded processors Egress queue Embedded processors Dispatch unit 4 Interfaces (OUT to net) 4 Interfaces (IN from net) • Link layer • framing outside the processor

  6. IXP1200 Architecture RISC processor: - StrongARM running Linux - control, higher layer protocols and exceptions - 232 MHz Access units: - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and synchronization Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz

  7. microengine 8 microengine 3 microengine 4 microengine 5 microengine 1 microengine 2 multiple independent internal buses IXP2400 Architecture • Coprocessors • hash unit • 4 timers • general purpose I/O pins • external JTAG connections (in-circuit tests) • several bulk cyphers (IXP2850 only) • checksum (IXP2850 only) • … PCI bus RISC processor: - StrongArm  XScale - 233 MHz  600 MHz SRAM bus SRAM access PCI access Embedded RISK CPU (XScale) SRAM coprocessor SCRATCH memory FLASH slowport access • Slowport • shared inteface to external units • used for FlashRom during bootstrap • Media Switch Fabric • forms fast path for transfers • interconnect for several IXP2xxx Microengines - 6  8 - 233 MHz  600 MHz … SDRAM access MSFaccess DRAM DRAM bus receive bus transmit bus • Receive/transmit buses • shared bus  separate busses

  8. INF5063:Programming Heterogeneous Multi-Core Processors Example: SpliceTCP

  9. SYN SYNACK SYN SYNACK Internet TCP Splicing Some client University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen

  10. ACK ACK ACK ACK Internet TCP Splicing Some client University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen

  11. HTTP-GET HTTP-GET DATA DATA Internet TCP Splicing Some client University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen

  12. Internet TCP Splicing Some client University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen

  13. TCP Splicing accept connect while(1) read write Data link layer Transport layer Application layer Network layer • Linux Netfilter • Establish upstream connection • Receive entire packet • Rewrite headers • Forward packet Physical layer Network layer Transport layer Data link layer • IXP 2400 • Establish upstream connection • Parse packet headers • Rewrite headers • Forward packet University of Oslo INF5063, Carsten Griwodz & Pål Halvorsen

  14. Graph from the presentation of the paper SpliceNP: A TCP Splicer using a Network Processor, ANCS2005, Princeton, NJ, Oct 27-28, 2005 By Li Zhao, Yan Lou, Laxmi Bhuyan (Univ. Calif. Riverside), Ravi Iyer (Intel) Throughput vs Request File Size Major performance gain at all request sizes

  15. INF5063:Programming Heterogeneous Multi-Core Processors Example:Transparent protocol translation and load balancing in a media streaming scenario slides from an ACM MM 2007 presentation by Espeland, Lunde, Stensland, Griwodz and Halvorsen

  16. IXP 2400 Balancer Monitor Appears as ONE machineto the outside world • identify connection • if exist send to right server (select port to use)else create new session (select one server) send packet Historic and current loads of the different servers Load Balancer RSTP/RTPvideo server mplayer clients ingress RTP/UDP RTSP / RTP parser RTSP Network . . . egress RTSP RTP/UDP RSTP/RTPvideo server

  17. IXP 2400 RTSP / RTP parser Balancer Monitor egress Transport Protocol Translator RSTP/RTPvideo server mplayer clients ingress HTTP-streaming isfrequently used today!! HTTP RTSP Network . . . RSTP/RTPvideo server

  18. IXP 2400 Protocol translator RTSP / RTP parser Balancer Monitor egress Transport Protocol Translator RSTP/RTPvideo server mplayer clients ingress RTP/UDP RTP/UDP HTTP HTTP HTTP RTSP/RTP Network . . . RTSP RTSP/RTP RSTP/RTPvideo server

  19. Results • The prototype works and both load balances and translates between HTTP/TCP and RTP/UDP • The protocol translation gives a much more stable bandwidth than using HTTP/TCP all the way from the server HTTP protocol translation

  20. INF5063:Programming Heterogeneous Multi-Core Processors Example: Booster Boxes slide content and structure mainly from the NetGames 2002 presentation by Bauer, Rooney and Scotton

  21. Client-Server local distribution network backbone network local distribution network local distribution network

  22. Peer-to-peer local distribution network backbone network local distribution network local distribution network

  23. Booster boxes • Middleboxes • Attached directly to ISPs’ access routers • Less generic than, e.g. firewalls or NAT • Assist distributed event-driven applications • Improve scalability of client-server and peer-to-peer applications • Application-specific code • “Boosters” • Caching on behalf of a server • Aggregation of events • Intelligent filtering • Application-level routing

  24. Booster boxes local distribution network backbone network local distribution network local distribution network

  25. Process data close to the source Booster boxes Load redistribution by delegating server functions local distribution network backbone network local distribution network local distribution network

  26. Booster boxes • Application-specific code • Caching on behalf of a server • Non-real time information is cached • Booster boxes answer on behalf of servers • Aggregation of events • Information from two or more clients within a time window is aggregated into one packet • Intelligent filtering • Outdated or redundant information is dropped • Application-level routing • Packets are forward based on • Packet content • Application state • Destination address

  27. Data Layer behaves like a layer-2 switch for the bulk of the traffic copies or diverts selected traffic IBM’s booster boxes use the packet capture library (“pcap”) filter specification to select traffic Architecture

  28. Data Aggregation Example: Floating Car Data • Main booster task • Complex message aggregation • Statistical computations • Context information • Very low real-time requirements Traffic monitoring/predictions Pay-as-you-drive insurance Car maintenance Car taxes … • Transmission of • Position • Speed • Driven distance • … Statistics gathering Compression Filtering …

  29. Main booster task Simple message aggregation Limitedreal-timerequirements Interactive TV Game Show 3. packetaggregation 4. packetforwarding 2. packetinterception 1. packet generation

  30. Main booster task Dynamic server selection based on current in-game location Require application-specific processing Game with large virtual space server 2 server 1 Virtual space handled by server 1 handled by server 2 • High real-time requirements

  31. Summary • Scalability • by application-specific knowledge • by network awareness • Main mechanisms • Caching on behalf of a server • Aggregation of events • Attenuation • Intelligent filtering • Application-level routing • Application of mechanism depends on • Workload • Real-time requirements

  32. INF5063:Programming Heterogeneous Multi-Core Processors Multimedia Examples

  33. Multicast Video-Quality Adjustment

  34. Multicast Video-Quality Adjustment

  35. IO hub memory hub CPU memory Multicast Video-Quality Adjustment

  36. Multicast Video-Quality Adjustment • Several ways to do video-quality adjustments • frame dropping • re-quantization • scalable video codec • … • Yamada et. al. 2002: use low-pass filter to eliminate high-frequency components of the MPEG-2 video signal and thus reduce data rate • determine a low-pass parameter for each GOP • use low-pass parameter to calculate how many DCT coefficients to remove from each macro block in a picture • by eliminating the specified number of DCT coefficients the videodata rate is reduced • implemented the low-pass filter on an IXP1200

  37. Multicast Video-Quality Adjustment Yamada et. al. 2002 • Low-pass filter on IXP1200 • parallel execution on 200MHz StrongARM and microengines • 24 MB DRAM devoted to StrongARM only • 8 MB DRAM and 8 MB SRAM shared • test-filtering program on a regular PC determined work-distribution • 75% of data from the block layer • 56% of the processing overhead is due to DCT • five step algorithm: • StrongArm receives packet  copy to shared memory area • StrongARM process headers and generate macroblocks (in shared memory) • microengines read data and information from shared memory and perform quality adjustments on each block • StrongARM checks if the last macroblock is processed (if not, go to 2) • StrongARM rebuilds packet

  38. Multicast Video-Quality Adjustment Yamada et. al. 2002 • Segmentation of MPEG-2 data • slice = 16 bit high stripes • macroblock = 16 x 16 bit square • four 8 x 8 luminance • two 8 x 8 chrominance • DCT transformed with coefficients sorted in ascending order • Data packetization for video filtering • 720 x 576 pixels frames and 30 fps • 36 “slices” with 45 macroblocks per frame • Each slice = one packet • 8 Mbps stream  ~7Kb per packet

  39. Multicast Video-Quality Adjustment Yamada et. al. 2002

  40. Multicast Video-Quality Adjustment Yamada et. al. 2002

  41. Multicast Video-Quality Adjustment Yamada et. al. 2002

  42. Multicast Video-Quality Adjustment Yamada et. al. 2002

  43. Multicast Video-Quality Adjustment Yamada et. al. 2002 • Evaluation – three scenarios tested • StrongARM only  550 kbps • StrongARM + 1 microengine  350 kbps • StrongARM + all microengines  1350 kbps • achieved real-time transcoding not enough for practical purposes, but distribution of workload is nice

  44. INF5063:Programming Heterogeneous Multi-Core Processors Parallelism, Pipelining &Workload Partitioning

  45. Divide and … • Divide a problem into parts – but how? Pipelining: Parallelism: Hybrid:

  46. Key Considerations • System topology • processor capacities: different processors have different capabilities • memory attachments: • different memory types have different rates and access times • different memory banks have different access times • interconnections: different interconnects/busses have different capabilities • Requirements of the workload? • dependencies • Parameters? • width of pipeline (level of parallelism) • depth of pipeline (number of stages) • number of jobs sharing busses

  47. Network Processor Example • Pipelining vs. Multiprocessor by Ning Weng & Tilman Wolf • network processor example • all pipelining, parallelism and hybrid is possible • packet processing scenario • what is the performance of the different schemes taking into account…? • … processing dependencies • … processing demands • … contention on memory interfaces • … pipelining and parallelism effects (experimenting with the width and the depth of the pipeline)

  48. Simulations • Several application examples in the paper giving different DAGs, e.g.,… • ... flow classification:classify flows according to IP addresses and transport protocols • Measuring system throughput varying all the parameters • # processors in parallel (width) • # stages in the pipeline (depth) • # memory interfaces (busses) between each stage in the pipeline • memory access times

  49. Results • # memory interfaces per stage M = 1 • Memory service time S = 10 • Increases with the pipeline depth D • Good scalability – proportional to the # processors • Increases with the width W initially, but tails off for large W • Poor scalability due to contention on the memory channel • Efficiency per processing engine…?

  50. Lessons learned… • Memory contention can become a severe system bottleneck • the memory interface saturates with about two processing elements per interface • off-chip memory access cause significant reduction in throughput and drastic increase in queuing delay • performance increase with more • memory channels • lower access times • Most NP applications are of sequential nature which leads to highly pipelined NP topologies • Balancing processing tasks to avoid slow pipeline stages • Communication and synchronization are the main contributors to the pipeline stage time, next to the memory access delay • “Topology” has significant impact on performance

More Related