1 / 22

The ‘zero-copy’ initiative

The ‘zero-copy’ initiative. A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets . From Wikipedia, the free encyclopedia:. Zero-copy is an adjective that refers to computer operations in which the

luke
Download Presentation

The ‘zero-copy’ initiative

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets

  2. From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations in which the CPU does not perform the task of copying data from one area of memory to another. The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increases the performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste of resources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc. Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements. Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.

  3. Application source-code char message[] = “This is a test of network-packet transmission \n”; int main( void ) { int fd = open( “/dev/nic”, O_RDWR ); if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); } int msglen = strlen( message ); int nbytes = write( fd, message, msglen ); if ( nbytes < 0 ) { perror( “write” ); exit(1); } printf( “Transmitted %d bytes \n”, nbytes ); }

  4. Transmit operation user space kernel space Linux OS kernel runtime library file subsystem nic device-driver write() my_write() application program packet buffer user data-buffer copy_from_user() DMA hardware We want to eliminate this copying-operation

  5. Our driver’s packet-layout packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count -- data -- -- data -- -- data – 16 bytes base-address (64-bits) Packet- length CSO cmd status CSS special Format for Legacy Transmit-Descriptor

  6. Can zero-copy be transparent? • We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code • We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!

  7. TX Descriptor’s CMD byte Command-Byte Format I D E V L E 0 0 R S I C I F C S E O P EOP = End-Of-Packet (1=yes, 0=no) RS = Report Status (1=yes, 0=no) VLE = VLAN-tag Enable Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?

  8. Splitting our packet-layout packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count HDR LEN -- data -- -- data -- -- data – base-address (64-bits) Packet- Length (=HDR) CSO cmd EOP=0 status CSS special base-address (64-bits) Packet- Length (=LEN) CSO cmd EOP=1 status CSS special Format for Legacy Transmit-Descriptor Pair

  9. Splitting our packet-buffer packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count HDR LEN packet-buffer in user-space -- data -- -- data -- -- data – base-address (64-bits) Packet- Length (=HDR) CSO cmd EOP=0 status CSS special base-address (64-bits) Packet- Length (=LEN) CSO cmd EOP=1 status CSS special Format for Legacy Transmit-Descriptor Pair Two physical packet-buffers comprise one logical packet that gets transmitted!

  10. Transmitting a ‘split-packet’ The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet Application-program packet-data buffer DMA User-space Kernel-space Device-driver module packet-header buffer DMA NIC hardware

  11. The ‘virt_to_phys()’ macro • Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory • For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction

  12. Linux memory-mapping = persistent mapping = transient mappings kernel space HMA user space 896-MB physical RAM There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses CPU’s virtual address-space

  13. Two-Level Translation Scheme PAGE TABLES PAGE DIRECTORY PAGE FRAMES CR3

  14. Linear to Physical linear address physical address-space dir-index table-index offset page table page frame page directory CR3

  15. Address-translation • The CPU examines any virtual address it encounters, subdividing it into three fields 31 22 21 12 11 0 index into page-directory index into page-table offset into page-frame 10-bits 10-bits 12-bits This field selects one of the 1024 array-entries in the Page-Directory This field selects one of the 1024 array-entries in that Page-Table This field provides the offset to one of the 4096 bytes in that Page-Frame

  16. Format of a Page-Table entry 31 12 11 10 9 8 7 6 5 4 3 2 1 0 PAGE-FRAME BASE ADDRESS AVAIL 0 0 D A P C D P W T U W P LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no) PWT = Page Write-Through (1=yes, 0 = no) PCD = Page Cache-Disable (1 = yes, 0 = no)

  17. Finding the user-buffer’s PFN • To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in • And its PFN (Page-Frame Number) can be found from its virtual address by ‘walking-the-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory

  18. Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame; unsigned int dindex, pindex, offset; // take apart the virtual-address of the user’s ‘buf’ variable dindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits) pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits) offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits) // then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” ); pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF ); pfn_pgtbl = (pgdir[ dindex ] >> 12); pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] ); pfn_frame = (pgtbl[ pindex ] >> 12); kunmap( &mem_map[ pfn_pgtbl ]; txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;

  19. Can’t cross a ‘page-boundary’ • In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region • But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame buf

  20. Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset; offset len buf PAGE_SIZE PAGE_SIZE PAGE_SIZE

  21. ‘zerocopy.c’ • We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ ) • It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?

  22. Website article • We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets: The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.

More Related