1 / 28

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors. Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005. Overview. Problem.

zizi
Download Presentation

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005

  2. Overview Problem • Edge routers need to support sophisticated set of services • How to best use the numerous hardware resources that Network processors provide • Cores, multiple memory levels, inter core queuing, crypto assists • Workloads fluctuate over time

  3. 200000 Overview http_data avg 150000 100000 50000 0 0 100 200 300 400 500 600 700 800 900 1000 Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004 http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.html Location: Network edge in front of a group of Internet clients Duration: 5 days ProblemWorkload variations There is no representative workload !

  4. Overview Problem • Edge routers need to support large sets of sophisticated services • How to best use the numerous hardware resources that Network processors provide • Cores, multiple memory levels, inter core queuing, crypto assists • Workloads fluctuate over time • There is no representative workload • Usually over provision to handle worst case Run time adaptation Ability to change mapping of services to hardware resources

  5. Overview IPv6 Compression and Forwarding MEv2 1 MEv2 2 MEv2 3 MEv2 4 Change allocation to increase individual service performance IPv6 Compression and Forwarding IPv6 Compression and Forwarding IPv6 Compression and Forwarding MEv2 1 MEv2 1 MEv2 1 MEv2 2 MEv2 2 MEv2 2 MEv2 3 MEv2 3 MEv2 3 MEv2 4 MEv2 4 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core IPv4 Compression and Forwarding IPv4 Compression and Forwarding IPv4 Compression and Forwarding MEv2 8 MEv2 8 MEv2 8 MEv2 7 MEv2 7 MEv2 7 MEv2 6 MEv2 6 MEv2 6 MEv2 5 MEv2 5 MEv2 5 Ex. 1 IntelXScale®core IntelXScale®core IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 9 MEv2 9 MEv2 9 MEv2 10 MEv2 10 MEv2 10 MEv2 11 MEv2 11 MEv2 11 MEv2 12 MEv2 12 MEv2 12 IPv4 Compression and Forwarding MEv2 16 MEv2 15 MEv2 14 MEv2 13 MEv2 16 MEv2 16 MEv2 16 MEv2 15 MEv2 15 MEv2 15 MEv2 14 MEv2 14 MEv2 14 MEv2 13 MEv2 13 MEv2 13 IPv6 Compression and Forwarding MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core IPv4 Compression and Forwarding MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 VPN Encrypt/Decrypt MEv2 1 MEv2 2 MEv2 3 MEv2 4 Support a large set of services in the “fast path”, according to use Power-down unneeded processors IPv6 Compression and Forwarding VPN Encrypt/Decrypt Ex. 2 IPv4 Compression and Forwarding MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScal®core Ex. 3 MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Adaptation Opportunities

  6. Overview C C A B B C C A A B B A 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 B MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 C IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries XScale Bind resources ME Checkpoint processors A A B, C B System Monitor Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system

  7. Monitoring Rate based Monitoring • Observe queue between two stages • Arrival/departure rates indicative of processing needs Rarr Rdep Rarr = Current arrival rate Rdep = Current departure rate Rworst = Worst case arrival rate tsw = Time to switch on a core Qsize • Assumption: Rdep scales linearly. • So for a stage running on n cores, Rdep = n * Rdep1

  8. Policy Allocation policy • Number of Cores = R / Rdep1 • If R = Rworst, system directly moves to worst case provisioned state • Only request cores as needed • NumCores (Rarr) = Rarr / Rdep1 Rarr Rdep Qadapt Buffer space to handle worst burst • If Rarr >> Rdep, request allocation of processors, immediately • How many? function of (Rarr / Rdep1) • If Rarr slightly larger, let queue grow till Qadapt, then request allocation of one processor

  9. Policy De-allocation policy • While increasing allocation, latch Rdep1 • if Rarr / Rdep1 < current allocation • Request de-allocation of one core • Hysterisis: Wait for some cycles before requesting de-allocation again • Avoids fluctuations for transient dips in arrival rate

  10. Overview A C C A B B B C C C A A B B A 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 B MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 C IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries XScale ME Resource Allocator A A Triggers B, C B System Monitor Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system

  11. Resource Allocation Resource allocator • Handles requests for allocation/de-allocation from individual stages • Aware of global system state and decides • specific processor to allocate/free • to de-allocate or migrate stage when no free processors available • Steal only when arrival rate < arrival rate for requesting stage • whether request is declined

  12. Overview A C C A B B B C C C A A B B 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries System Evaluation XScale ME Resource Allocator A A Triggers B, C B System Monitor Mapping Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system

  13. Results Experimentalsetup • Radisys, Inc. ENP-2611* • 600MHz Intel® IXP2400 Processor • MontaVista Linux* • 3 optical Gigabit Ethernet ports • IXIA* traffic generator for packet stimulus * Third party brands/names are property of their respective owners

  14. Overhead due to function calls to resource abstraction layer 14% performance degradation for processing min size packets at line rate Overall adaptation time is: Binding time + (checkpointing and loading time * number of cores) Cumulative effects: ~100ms Dominated by cost of binding mechanism Results Adaptation Costs

  15. Results Adaptation benefitsTesting Methodology • Need to measure ability of system to handle long term workload variations • Systems compared • Static system (Profile driven compilation) • Adaptive system

  16. Results L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L2 classifier Ethernet encapsulation L2 bridge L2 bridge Rx Tx L2 bridge L2 bridge L2 bridge L2 bridge L2 bridge 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Adaptation benefitsTesting Methodology Layer 3 switching application System Traffic Performance Static binary Profile Compiler

  17. Results 0%, 100% 20%, 80% 40%, 60% 50%, 50% 60%, 40% 80%, 20% 100%, 0% Benefits of run time adaptation Source: Intel Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests.  Any difference in system hardware or software design or configuration may affect actual performance.

  18. Conclusion Future work • Study ability of an adaptive system to handle short term fluctuations • Would it drop more packets than a non-adaptive system • Enable flow-aware run time adaptation • Explore more sophisticated resource allocation algorithms • support properties like fairness and performance guarantees

  19. Conclusion Related work • Ease of programming • NP-Click: N Shah etc, NP-2 workshop 2003 • Nova: L George, M Blume, ACM SIGPLAN 2003 • Auto-Partitioning programming model: Intel, whitepaper 2003 • Dynamic extensibility • Router plugins: D Decasper etc, SIGCOMM 1998 • PromethOS: R Keller etc, IWAN 2002 • VERA: S Karlin, L Peterson, Computer Networks 2002 • NetBind: M Kounavis, Software Practice and experience, 2004 • Load balancing • ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005

  20. Conclusion Conclusion • Run time adaptation is an attractive approach for handling traffic fluctuations • Implemented a framework capable of adapting processing cores allocated to network services • Implemented a policy that • Automatically balances service pipeline • Overcomes the code store limitation of fixed control store processor cores

  21. Background

  22. Mechanisms CheckpointingLeveraging domain characteristics • Finding the best checkpoint is easier in packet processing than in general domains • Characteristics of data-flow applications • Typically implemented as a dispatch loop • Dispatch loop is executed at high-frequency • Top of the dispatch loop has no stack information • Since compiler creates dispatch loop, compiler inserts checkpoints in the code

  23. Mechanisms MEv2 1 MEv2 1 MEv2 2 MEv2 2 MEv2 3 MEv2 3 MEv2 4 MEv2 4 MEv2 8 MEv2 8 MEv2 7 MEv2 7 MEv2 6 MEv2 6 MEv2 5 MEv2 5 MEv2 9 MEv2 9 MEv2 10 MEv2 10 MEv2 11 MEv2 11 MEv2 12 MEv2 12 MEv2 16 MEv2 16 MEv2 15 MEv2 15 MEv2 14 MEv2 14 MEv2 13 MEv2 13 Why Have Binding? Now we can use NN rings, local locks A B A B A A B B IntelXScale™Core IntelXScale™Core Want to be able to use the fastest implementations of resources available

  24. Mechanisms (4/6) Binding • Goal: Use the fastest implementations of resources available • Resource abstraction • Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc) • Must have little impact on run-time performance • Our approach: Adaptation time linking

  25. Mechanisms (6/6) Application Code Resource binding approachAdaptation-time linking A microengine-based example RAL calls are initially undefined Application .o file Final .o file RAL .o file RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 RAL Implementation 3 At run time, the RTS has the application .o file and the RAL .o file At run time, the RTS has the application .o file RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 Linker adjusts jump targets using import variable mechanism Linker adjusts jump targets using import variable mechanism Linker adjusts jump targets using import variable mechanism Process repeated after each adaptation

  26. Binding: The Value of Choosing the Right Resource Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests.  Any difference in system hardware or software design or configuration may affect actual performance.

  27. Problem domain • Compression • Monitoring (billing, QoS) • Forwarding • Switching MAN/WAN • VPN Gateway • Firewall • Intrusion Detection Access Network Enterprise LAN • XML & SSL acceleration • L4-L7 switching • Application acceleration

  28. Policy Determining Qadapt and monitoring interval • Want to maximize Qadapt • Qadapt function of queue monitoring interval Rarr Rdep Qadapt Qadapt Buffer space to handle worst burst with n+1 cores Buffer space to handle worst burst with n cores Queue fill up while core comes online Theoretical max Qadapt when queue depth can be detected instantaneously

More Related