Optimizing OpenStack with DPDK: Strategies for High Performance Applications

Integrating OpenStack with DPDKfor High Performance Applications OpenStack Summit 2018 Vancouver

Who are we? • Yasufumi Ogawa (@yogawa) • Core maintainer in DPDK/SPP • Tetsuro Nakamura (@tetsuro) • Core maintainer in “networking-spp” • Active in Nova- NFV/Placement

Agenda: • DPDK • Strategies for High Performance • Examples of How to Configure • Motivation and Usacase - SPP(Soft Patch Panel) • OpenStack • Bring the tuning settings to OpenStack • CPU pinning • Emulator threads policy • NUMA Architecture • Manage DPDK vhost-user interface

Strategies for High Performance There are three strategies for getting better performance for DPDK 1. Configuration considering Hardware Architecture • NUMA and CPU layout • Hugepages • Memory channels 2. Optimization of VM Resource Assignment • isolcpus • taskset 3. Writing Efficient Code • Reduce memory copy • Communication between lcores via ring

Configurations for DPDK DPDK provides several options for optimizing to the architecture (1) CPU Layout • Decide core assignment with '-l' option while launching DPDK app $ sudo /path/to/app -l 0-2 ... Main Thread Worker Thread Worker Thread ・・・ core 0 core 1 core 2 core 3 core 4 (2) Memory Channel • Give the number of mem channels with '-n' for optimization • Add appropriate size of padding for load/store packets For 2 channels and 4-ranked DIMM, 2nd packet should start from channel1, rank1 memory address Channel ・・・ 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Rank ・・・ 0 1 2 3 0 1 2 3 0 1 2 3 ・・・ Packet 0 1 2 3 4 padding 0 1 2 3 4 5 6 7 8 9 A B C D E F packet 2 packet 1

Configurations for VM CPU assignment is not controllable by default, but doable (1) isolcpus Use the isolcpus Linux kernel parameter to isolate them from Linux scheduler to reduce context switches. # /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=3-7” (2) taskset Update affinity by using taskset command to pin vCPU threads $ sudotaskset -pc 4 192623 pid 192623's current affinity list: 0-31 pid 192623's new affinity list: 4 $ sudotaskset -pc 5 192624 ....

Motivation - Large Scale Telco-Services on NFV • Large-scale cloud for telecom services • Service Function Chaining for virtual network appliances • Flexibility, Maintainability and High-Performance Monitoring Security Web Service Load Balancer L2 Switch Variety kinds of service apps on VMs Audio Video L3 Router MPLS Firewall DPI ・・・ VM VM VM VM VM VM ・・・

SPP (Soft Patch Panel) • Change network path with patch panel like simple interface • High-speed packet processing with DPDK • Update network configuration dynamically without terminating services VM VM VM VM VM Virtual Ports Physical Ports

SPP (Soft Patch Panel) • Multi-process Application • Primary process is a resource manager • Secondary processes are workers for packet forwarding • Several Virtual Port Support • ring pmd • vhost pmd • pcappmd etc Guest Guest NFV App NFV App vhost vhost Host MPLS Firewall L2 Switch SPP spp_nfv (Secondary) spp_nfv (Secondary) Resource Manager (Primary)

Performance Performance of SPP, OVS-DPDK and SR-IOV through 1 ~ 8 VMs Environment: • CPU: Xeon E5-2690v3 (12cores/24threads) • NIC: Intel X520 DP 10GB DA/SFP+ Server Adapter • DPDK v16.07 • Traffic: 64byte / 10GB Host#2 Host#1 ~ 8VMs l2fwd l2fwd l2fwd ... pktgen SPP / OVS / SR-IOV 10GB

Performance SPP ring achieves the best performance (vhost) SPP vhost keeps 7 Mpps for ~8 VMs

Can we bring performance tunings in OpenStack world?

A Basic Gap:

It’s not that easy. The gap makes it complex. Let’s see the status today, pain points, and possible improvements for “Rocky”.

Agenda: • DPDK • Strategies for High Performance • Examples of How to Configure • Motivation and Usacase - SPP(Soft Patch Panel) • OpenStack • Bring the tuning settings to OpenStack • CPU pinning • Emulator threads policy • NUMA Architecture • Manage DPDK vhost-user interface

Agenda: • Bring the tuning settings to OpenStack • 1. CPU pinning • 1-1. How to assign cores – Service Setup • 1-2. How to assign cores – VM deployment • 1-3. Pain Points in Queens • 1-4. Proposed improvements for Rocky • 2. Emulator threads policy • 3. NUMA Architecture

1-1. How to assign cores - Service Setup Isolate CPUs manually for vSwitch and VMsnot to be scheduled to other processes. ... [/etc/default/grub] GRUB_CMDLINE_LINUX_DEFAULT=“isolcpus=2-15” SPP Set CPUs for Network, which are used by vSwitch, via configuration file. (“0x3e” means 2-5) physical port VM VM [local.conf] DPDK_PORT_MAPPINGS = 00:04.0#phys1#1#0x3e Reserve the rest of CPUs for VMs [local.conf] vcpu_pin_set = 6-15 compute node

1-2. How to assign cores - VM deployment • Use “cpu_policy” in flavor extra specs. hw:cpu_policy=shared(default) hw:cpu_policy=dedicated VM VM vCPUs “float” across all the CPUs Each vCPU is “pinned” to a dedicated CPU. 0 1 0 1 10 10 12 12 14 14 0 0 2 2 4 4 6 6 8 8 11 11 13 13 15 15 1 1 3 3 5 5 7 7 9 9 Host OS OS vcpu_pin_set vcpu_pin_set Used by vSwitch Used by vSwitch 〇 Can use more vCPUs than real × Less and unpredictable performance 〇 Can get more performance × Less accommodation rate

(ref.) Performance Difference PowerEdge R730: CPU: E5-2690v3 (2.60GHz, 12 cores) NIC: Intel X520 DP 10Gb DA/SFP+ DPDK 17.11: Hugepage: 1GB Traffic: 64 byte UDP × 2.98 Throughput (GB) isolcpu + cpupinning harmed by other process

1-3. Pain Points 1 • VMs with different “cpu_policy”s can’t be colocated on the same host. • This is Because the “shared” VM would float on pCPUspinned to the “dedicated” VM, which results in unpredictable performance degradation. VM VM 10 12 14 0 2 4 6 8 11 13 15 1 3 5 7 9 0 0 1 1 Host OS vcpu_pin_set Used by vSwitch

1-3. Pain Points 2 • No way to assign both dedicated and shared vCPUs to one VM. • We want to save cores for: • house keeping tasks for OS, and DPDK cores for controlling tasks • example architecture of DPDK application hw:cpu_policy=mixed (not supported!) master core slave core1 slave core2 VM 10 12 14 0 2 4 6 8 0 1 2 11 13 15 1 3 5 7 9 Host

1-4. Proposed improvements for Rocky • Service Setup Options • Deprecate “vcpu_pin_set” option • Produce “cpu_shared_set” and “cpu_dedicated_set” instead • They are reported as “VCPU”, “PCPU”resource class respectively to Placement. • VM deployment Options • Deprecate “hw:cpu_policy” option • Request each resource class respectively • “resources:VCPU=2&resources:PCPU=4” spec: Standardize CPU resource tracking https://review.openstack.org/#/c/555081/

1-4. Proposed improvements for Rocky Setup compute hosts with both VCPUs and PCPUs Simply request them via flavor resources:VCPU=1, PCPU=3 Flavor VM deploy 0 1 2 3 10 12 14 This vCPU floats across the “VCPU”s 0 2 4 6 8 The other vCPUs are pinned to dedicated “PCPU”s 11 13 15 1 3 5 7 9 reported by the virt driver Placement Service • This compute node has: • 2 VCPUs • 8 PCPUs OS cpu_shared_set cpu_dedicated_set Used by vSwitch

Agenda: • Bring the tuning settings to OpenStack • 1. CPU pinning • 2. Emulator threads policy • 2-1. What is emulator threads? • 2-2. Emulator threads policy options • 2-3. Pain points • 2-4. Proposed improvements for Rocky • 3. NUMA Architecture

2-1. What is emulator threads? • VM(QEMU) process has “emulator threads” and “vCPU threads” • vCPU threads: • one thread per guest vcpu • used in Guest cpu execution • Emulator threads: • one or more thread per guest instance • not associated with any of guest vcpus • used for • the QEMU main event loop • asynchronous I/O operation completion • SPICE display I/Oetc. $ pstree -p 2606 qemu-system-x86(2606) ┬ {qemu-system-x8}(2607) ├ {qemu-system-x8}(2623) ├ {qemu-system-x8}(2624) ├ {qemu-system-x8}(2625) ├ {qemu-system-x8}(2626) vCPUs

2-2. Emulator threads policy options • You should take care not to let this “emulator threads” to steal time from vCPU threads, which run actual instructions for fast data path packet processing. hw:emulator_threads_policy=share(default) hw:emulator_threads_policy=isolate VM VM 0 1 0 1 10 10 12 12 14 14 Emulator threads are running on the same CPUs as vCPU threads Emulator threads are isolated to a dedicated CPU 0 0 2 2 4 4 6 6 8 8 11 11 13 13 15 15 1 1 3 3 5 5 7 7 9 9 e e OS OS vcpu_pin_set vcpu_pin_set Used by vSwitch Used by vSwitch

2-3. Pain Points • You should take care not to let this “emulator threads” to steal time from vCPU threads, which run actual instructions for fast data path packet processing. hw:emulator_threads_policy=isolate Question: Do we want to consume one dedicated CPU for every emulator threads? -> Not really... it is vCPU threads who process fast data path packet, not emulator threads. VM 0 1 10 12 14 Emulator threads are isolated to a dedicated CPU 0 2 4 6 8 11 13 15 1 3 5 7 9 e OS vcpu_pin_set Used by vSwitch

2-3. Proposed improvements for Rocky • “hw:emulator_thread_policy=share” will try to run emulator threads on CPUs in “shared_cpu_set” and fallback to the legacy behavior if unavailable. • “hw:emulator_thread_policy=isolate” will remain the same. hw:emulator_threads_policy=share(default) hw:emulator_threads_policy=isolate VM VM 0 0 1 2 3 1 2 3 Emulator threads try to float across the “VCPU”s 10 10 12 12 14 14 0 0 2 2 4 4 6 6 8 8 e e 11 11 13 13 15 15 1 1 3 3 5 5 7 7 9 9 cpu_shared_set OS OS Used by vSwitch cpu_shared_set Used by vSwitch cpu_dedicated_set

Agenda: • Bring the tuning settings to OpenStack • 1. CPU pinning • 2. Emulator threads policy • 3. NUMA Architecture • 2-1. What is NUMA? • 2-2. NUMA strategy in OpenStack • 2-3. Pain Points • 2-4. Proposed Improvements for Rocky

3-1. What is NUMA? • NUMAstands for Non-Uniform Memory Access. • The access cost to memory is different (not symmetric). • We want to avoid remote access for NFV application. • -> Therefore, In OpenStack with libvirt/KVM backend, NUMA architecture in an instance *always* reflects on the underlying physical NUMA architecture. See the next page. NUMA0 NUMA1 remote access local access Socket0 Socket1 Memory Memory Memory Memory Memory Memory Core1 Core0 Core12 Core13 Memory Memory Memory Memory Memory Memory Core14 Core3 Core2 Core15 Core16 Core5 Core4 Core17 Memory Memory Memory Memory Memory Memory Core18 Core7 Core6 Core19 Core20 Core9 Core8 Core21 Memory Memory Memory Memory Memory Memory Core22 Core10 Core11 Core23

(ref.) Performance Difference PowerEdge R740: CPU: Xeon GOLD 5118 (2.30GHz) NIC: Intel X710 DP 10Gb DA/SFP+ DPDK 17.11: Hugepage: 1GB Traffice: 64 byte UDP × 1.75 Throughput (GB) remote access local access

3-2. NUMAstrategy in OpenStack • Let’s think of deploying instances to a host with 2 NUMA nodes. • With cpu pinning feature, nova picks dedicated CPUs from only one NUMA node. • Each VM memory is allocated on the same NUMA node as CPUs hw:cpu_policy=dedicated hw:cpu_policy=dedicated VM1 VM2 26 20 10 28 16 22 12 24 14 18 30 0 2 4 6 8 0 1 2 3 0 1 2 3 27 11 21 29 17 23 13 25 15 19 31 1 3 5 7 9 NUMA0 NUMA1 OS vcpu_pin_set vcpu_pin_set Used by vSwitch

3-2. NUMAstrategy in OpenStack • The host has 18 CPUs left! – 6 from NUMA node0, 12 from NUMA node1 • Can we deploy 16 CPUs with “dedicated” cpu policy to the host ?? • -> the answer is “No,” because neither of the NUMA node has room for that. hw:cpu_policy=dedicated hw:cpu_policy=dedicated VM1 VM2 26 20 10 12 22 28 16 24 14 18 30 0 2 4 6 8 0 1 2 3 0 1 2 3 27 21 11 23 29 13 17 25 15 31 19 1 3 5 7 9 NUMA0 NUMA1 OS vcpu_pin_set vcpu_pin_set Used by vSwitch

3-3. Pain Points/Improvements for Rocky • NUMA node information is very important in deployment. • However, placement API exposes no information about NUMA node. • -> In Rockys, we propose to use placement to see NUMA resources. • Compute host • DISK_GB:300 (used:200) • Compute host • PCPU:26 (used:8) • MEMORY_MB:4096 (used:2048) • DISK_GB:300 (used:200) • NUMA 0 • PCPU:10 (used:4) • MEMORY_MB:2048 (used:1024) • NUMA 1 • PCPU:16 (used:4) • MEMORY_MB:2048 (used:1024) Enable NUMA in Placement

Manage DPDK vhost-user interfaces • vhost-user is standard choice of interface to communicate DPDK based switches and VMs. • However, the number of the interface is limitedto 32 port by default because of the memory requirement. • The number of SR-IOV VFs is similarly limited and going to be managed in Placement. • We’d like to have some similar solution for management of vhost-user interfaces.

Summary • We Introduced our DPDK product “Soft Patch Panel(SPP)”, which is available from OpenStack using “networking-spp”. • We have a lot of parameters for performance tunings for SPP as well as other DPDK applications. • We already have several schemes to tune these parameters in OpenStack. • For further optimization, new features are being proposed and under community review or for Rocky release. Soft Patch Panel: http://dpdk.org/browse/apps/spp/ networking-spp: https://github.com/openstack/networking-spp/ ...feel free to contact us !!

Optimizing OpenStack with DPDK: Strategies for High Performance Applications

Optimizing OpenStack with DPDK: Strategies for High Performance Applications

Presentation Transcript

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

Memory Management for High-Performance Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

Bringing Mission and Performance Critical Applications to Openstack

ME964 High Performance Computing for Engineering Applications

FPGAs for high performance – high density applications

HIGH PERFORMANCE CONTROL APPLICATIONS WITH JAVA

High-Performance Data Transport for Grid Applications

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

High Performance Thermoplastic Composites for Medical Applications

High Performance Ajax Applications

OpenStack High Availability

ME964 High Performance Computing for Engineering Applications