NVMe™/TCP Development Status and a Case study of SPDK User Space Solution2019 NVMe™ Annual Members Meeting and Developer DayMarch 19, 2019Sagi Grimberg, Lightbits Labs Ben Walker and Ziye Yang, Intel
NVMe™/TCP Status • TP ratified @ Nov 2018 • Linux Kernel NVMe/TCP inclusion made v5.0 • Interoperability tested with vendors and SPDK • Running in large-scale production environments (backported though) • Main TODOs: • TLS support • Connection Termination rework • I/O Polling (leverage .sk_busy_loop() for polling) • Various performance optimizations (mainly on the host driver) • A few minor Specification wording issues to fixup
Performance: Interrupt Affinity • In NVMe™ we pay a close attention to steer an interrupt to the application CPU core • In TCP Networking: • TX interrupts are usually steered to the submitting CPU core (XPS) • RX interrupts steering is determined by: Hash(5-tuple) • That is not local to the application CPU core • But, aRFS comes to the rescue! • RPS mechanism is offloaded to the NIC • NIC driver implements: .ndo_rx_flow_steer • The RPS stack learns where the CPU core that processes the stream and teaches the HW with a dedicated steering rule.
Canonical Latency Overhead Comparison • The measurement tests the latency overhead for a QD=1 I/O operation • NVMe™/TCP is faster than iSCSI but slower than NVMe/RDMA
Performance: Large Transfers Optimizations • NVMe™ usually impose minor CPU overhead for large I/O • <= 8K (two pages) only assign 2 pointers • > 8K setup PRP/SGL • In TCP networking: • TX large transfers involves higher overhead for TCP segmentation and copy • Solution: TCP Segmentation Offload (TSO) and .sendpage() • RX large transfers involves higher overhead for more interrupts and copy • Solution: Generic Receive Offload (GRO) and Adaptive Interrupt Moderation • Still more overhead than PCIe though...
Throughput Comparison • Single-threaded NVMe™/TCP achieves 2x better throughput • NVMe/TCP scales to saturate 100Gb/s for 2-3 threads however iSCSI is blocked
NVMe™/TCP Parallel Interface • Each NVMe queue maps to a dedicated bidirectional TCP connection • No controller-wide sequencing • No controller-wide reassembly constraints
4K IOPs Scalability • iSCSI is serialized heavily and cannot scale with the number of threads • NVMe™/TCP scales very well reaching over 2M 4K IOPs
Performance: Read vs. Write I/O Queue Separation • Common problem with TCP/IP is head-of-queue (HOQ) blocking • For example, a small 4KB Read is blocked behind a large 1MB Write to complete data transfer • Linux supports Separate Queue mappings since v5.0 • Default Queue Map • Read Queue Map • Poll Queue Map • NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. • In the Future can contain Priority Based Queue Arbitration to eliminate even further
Performance: Read vs. Write I/O Queue Separation • NVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking. • Future: Priority Based Queue Arbitration can reduce impact even further
Mixed Workloads Test • Test the impact of Large Write I/O on Read Latency • 32 “readers” issuing synchronous READ I/O • 1 Writer that issues 1MB Writes @ QD=16 • iSCSI Latencies collapse in the presence of Large Writes • Heavy serialization over a single channel • NVMe™/TCP is very much on-par with NVMe/RDMA
Commercial Performance Software NVMe™/TCP controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts
Commercial Performance – Mixed Workloads Software NVMe™/TCP Controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts
Slab, sendpage and kernel hardening We never copy buffers NVMe™/TCP TX side (not even PDU headers) As a proper blk_mq driver, Our PDU headers were preallocated in advance PDU headers were allocated as normal Slab objects Can a Slab original allocation be sent to the network with Zcopy? • Linux-mm seemed to agree we can (Discussion)... But, every now and then, under some workloads the kernel would panic... kernel BUG at mm/usercopy.c:72! CPU: 3 PID: 2335 Comm: dhclient Tainted: G O 4.12.10-1.el7.elrepo.x86_64 #1 ... Call Trace: copy_page_to_iter_iovec+0x9c/0x180 copy_page_to_iter+0x22/0x160 skb_copy_datagram_iter+0x157/0x260 packet_recvmsg+0xcb/0x460 sock_recvmsg+0x3d/0x50 ___sys_recvmsg+0xd7/0x1f0 __sys_recvmsg+0x51/0x90 SyS_recvmsg+0x12/0x20 entry_SYSCALL_64_fastpath+0x1a/0xa5
Slab, sendpage and kernel hardening Root Cause: In high queue depth, TCP stack coalesce PDU headers into a single fragment At the same time, we have userspace programs applying bpf packet filters (in this case dhclient) Kernel Hardening applies heuristics to catch exploits: • In this case, panic if usercopy attempts to copy skbuff that contains a fragment that cross the Slab object boundary Resolution: Don’t allocate PDU headers from the Slab allocators Instead use a queue private page_frag_cache • This resolved the panic issue • But also improved the page referencing efficiency on the TX path!
Ecosystem Linux kernel support is upstream since v5.0 (both host and NVM subsystem) • https://lwn.net/Articles/772556/ • https://patchwork.kernel.org/patch/10729733/ SPDK support (both host and NVM subsystem) • https://github.com/spdk/spdk/releases • https://spdk.io/news/2018/11/15/nvme_tcp/ NVMe™ compliance program • Interoperability testing started at UNH-IOL in the Fall of 2018 • Formal NVMe compliance testing at UNH-IOL planned to start in the Fall of 2019 For more information see: • https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/
Summary • NVMe™/TCP is a new NVMe-oF™ transport • NVMe/TCP is specified by TP 8000 (available at www.nvmexpress.org) • Since TP 8000 is ratified, NVMe/TCP is officially part of NVMe-oF 1.0 and will be documented as part of the next NVMe-oF specification release • NVMe/TCP offers a number of benefits • Works with any fabric that support TCP/IP • Does not require a “storage fabric” or any special hardware • Provides near direct attached NAND SSD performance • Scalable solution that works within a data center or across the world
Storage Performance Development Kit • User-space C Libraries that implement a block stack • Includes an NVMe™ driver • Full featured block stack • Open Source • 3-clause BSD • Asynchronous, event loop, polling design strategy • Very different than traditional OS stack (but very similar to the new io_uring in Linux) • 100% focus on performance (latency and bandwidth) https://spdk.io
NVMe-oF™ History NVMe™ over Fabrics Target • July 2016: Initial Release (RDMA Transport) • July 2016 – Oct 2018: • Hardening, Feature Completeness • Performance Improvements (scalability) • Design changes (introduction of poll groups) • Jan 2019: TCP Transport • Compatible with Linux kernel • Based on POSIX sockets (option to swap in VPP) NVMe over Fabrics Host • December 2016: Initial Release (RDMA Transport) • July 2016 – Oct 2018: • Hardening, Feature Completeness • Performance Improvements (zero copy) • Jan 2019: TCP Transport • Compatible with Linux kernel • Based on POSIX sockets (option to swap in VPP)
NVMe-oF™ Target Design Overview Target spawns one thread per core which runs an event loop • Event loop is called a “poll group” New connections (sockets) are assigned to a poll group when accepted Poll group polls the sockets it owns using epoll/kqueue for incoming requests Poll group polls dedicated NVMe™ queue pairs on back end for completions (indirectly, via block device layer) I/O processing is run-to-completion mode and entirely lock-free.
Adding a New Transport • Transports are abstracted away from the common NVMe-oF™ code via a plugin system • Plugins are a set of function pointers that are registered as a new transport. • TCP Transport implemented in lib/nvmf/tcp.c Transport Abstraction • Socket operations are also abstracted behind a plugin system • POSIX sockets and VPP supported FC? RDMA TCP Posix VPP
Future Work Better socket syscall batching! • Calling epoll_wait, readv, and writev over and over isn’t effective. Need to batch the syscalls for a given poll group. • Abuse libaio’s io_submit? io_uring? Can likely reduce number of syscalls by a factor of 3 or 4. Better integration with VPP (eliminate a copy) Integrate with TCP acceleration available in NICs NVMe-oF offload support