Achieving 10 Gb/s Using Xen Para-virtualized Network Drivers

Achieving 10 Gb/s Using Xen Para-virtualized Network Drivers Kaushik Kumar Ram*, J. Renato Santos+,Yoshio Turner+, Alan L. Cox*, Scott Rixner* +HP Labs *Rice University

Xen PV Driver on 10 Gig Networks Throughput on a single TCP connection (netperf) • Focus of this talk: RX Xen summit – Feb 2009

Network Packet Reception in Xen gr 1 3 2 7 6 5 4 Post grant on I/O channel Push into the network stack I/O Channel Driver Domain Guest Domain Mechanisms to reduce driver domain cost: • Use of Multi-queue NIC • Avoid data copy • Packet demultiplex in hardware • Grant Reuse Mechanism • Reduce cost of grant operations Backend Driver Frontend Driver demux Bridge grant copy Physical Driver event IRQ DMA Xen NIC Hardware Incoming Pkt Xen summit – Feb 2009 3 02/25/2009

Using Multi-Queue NICs 9 1 2 7 gr 4 3 6 5 8 Post grant on I/O channel Push into the network stack UnMap buffer I/O Channels Driver Domain Guest Domain Backend Driver Frontend Driver • Advantage of multi-queue • Avoid data copy • Avoid software bridge Map buffer post buf on dev queue Physical Driver DMA One RX queue per guest event IRQ Xen guest MAC addr MQ NIC Hardware Incoming Pkt demux Xen summit – Feb 2009

Performance Impact of Multi-queue Driver Domain CPU Cost • Savings due to multiqueue • grant copy • bridge • Most of remaining cost • grant hypercalls (grant + xen functions) Xen summit – Feb 2009 5 02/25/2009

Using Grants with Multi-queue NIC gr 2 3 1 use page for I/O Driver Domain Guest Domain • Multi-queue replaces one grant hypercall (copy) with two hypercalls (map/unmap) • Grant hypercalls are expensive • Map/unmap calls for every I/O operation Backend Driver Frontend Driver Physical Driver Xen map grant hypercall unmap grant hypercall NIC Xen summit – Feb 2009

Reducing Grant Cost • Grant Reuse • Do not revoke grant after I/O is completed • Keep buffer page on a pool of unused I/O pages • Reuse already granted pages available on buffer pool for future I/O operations • Avoids map/unmap on every I/O Xen summit – Feb 2009 7 02/25/2009

Revoking a Grant for when the Page is Mapped in Driver Domain • Guest may need to reclaim I/O page for other use (e.g. memory pressure on guest) • Need to unmap page at driver domain before using it in guest kernel • To preserve memory isolation (e.g. protect from driver bugs) • Need handshake between frontend and backend to revoke grant • This may be slow especially if the driver domain is not running Xen summit – Feb 2009 8 02/25/2009

Approach to AvoidHandshake when Revoking Grants • Observation: No need to map guest page into driver domain with multi-queue NIC • Software does not need to look at packet header, since demux is performed in the device • Just need page address for DMA operation • Approach: Replace grant map hypercall with a shared memory interface to the hypervisor • Shared memory table provides translation of guest grant to page address • No need to unmap page when guest needs to revoke grant (no handshake) Xen summit – Feb 2009 9 02/25/2009

Software I/O Translation Table gr pg 9 6 1 2 7 4 3 8 5 10 • SIOTT: software I/O translation table • Indexed by grant reference • “pg” field: guest page address & permission • “use” field indicates if grant is in use by driver domain • set/clear hypercalls • Invoked by guest • Set validates grant, pins page, and writes page address to SIOTT • Clear requires that “use”=0 Send grant over I/O channel create a grant for buffer page Use page for I/O Driver Domain Guest Domain event Backend Driver Frontend Driver get page SIOTT set hypercall pg use Physical Driver set use clear hypercall reset use 1 #pg 0 Xen Validate, pin and update SIOTT check use and revoke DMA NIC Xen summit – Feb 2009 10 02/25/2009

Grant Reuse:Avoid pin/unpin hypercall on every I/O gr gr 2 3 6 5 1 8 9 7 4 10 11 Use page for I/O I/O Buffer Pool Driver Domain Guest Domain create grant event return buffer to pool & keep grant Backend Driver Frontend Driver reuse buffer & grant from pool return buffer to pool & keep grant kernel mem pressure SIOTT pg use revoke grant set hypercall Physical Driver return page to kernel clearhypercall #pg 0 Xen validate, pin and update SIOTT clear SIOT NIC Xen summit – Feb 2009 11 02/25/2009

Performance Impact of Grant Reuse w/ Software I/O Translation Table Driver Domain CPU Cost cost saving: grant hypercall Xen summit – Feb 2009 12 02/25/2009

Impact of optimizations on throughput CPU utilization Data rate • Multi-queue w/ grant reuse significantly reduce driver domain cost • Bottleneck shifts from driver domain to guest • Higher cost in guest than in Linux still limits throughput in Xen Xen summit – Feb 2009 13 02/25/2009

Additional optimizations at guest frontend driver • LRO (Large Receive Offload) support at frontend • Consecutive packets on same connection combined into one large packet • Reduces cost of processing packet in network stack • Software prefetch • Prefetch next packet and socket buffer struct into CPU cache while processing current packet • Reduces cache misses at frontend • Avoid full page buffers • Use half-page (2KB) buffers (Max pkt size is 1500 bytes) • Reduces TLB working set and thus TLB misses Xen summit – Feb 2009 14 02/25/2009

Performance impact of guest frontend optimizations Guest Domain CPU Cost • Optimizations bring CPU cost in guest close to native Linux • Remaining cost difference • Higher cost in netfront than in physical driver • Xen functions to send and deliver events Xen summit – Feb 2009 15 02/25/2009

Impact of all optimizations on throughput optimized PV driver (1 guest) optimized PV driver (2 guests) current PV driver Direct I/O (1 guest) Linux • Multiqueue with software optimizations achieves the same throughput as direct I/O ( ~8 Gb/s) • 2 or more guests are able to saturate 10 gigabit link Xen summit – Feb 2009 16 02/25/2009

Conclusion • Use of multi-queue support in modern NICs enables high performance networking with Xen PV Drivers • Attractive alternative to Direct I/O • Same throughput, although with some additional CPU cycles at driver domain • Avoids hardware dependence in the guests • Light driver domain enables scalability for multiple guests • Driver domain can now handle 10 Gb/s data rates • Multiple guests can leverage multiple CPU cores and saturate10 gigabit link Xen summit – Feb 2009 17 02/25/2009

Status • Status • Performance results obtained on a modified netfront/netback implementation using the original Netchannel1 protocol • Currently porting mechanisms to Netchannel2 • Basic multi-queue already available on public netchannel2 tree • Additional software optimizations still in discussion with community and should be included in netchannel2 sometime soon. • Thanks to • Mitch Williams and John Ronciak from Intel for providing samples of Intel NICs and for adding multi-queue support on their driver • Ian Pratt, Steven Smith and Keir Fraser for helpful discussions Xen summit – Feb 2009 18 02/25/2009

Achieving 10 Gb/s Using Xen Para-virtualized Network Drivers