Make Hosts Ready for Gigabit Networks

Make Hosts Ready for Gigabit Networks

Hardware Requirement • To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps. For example: • CPU speed • Is Pentium 100 MHZ PC fast enough to process a large number of packets per second? (10 bits/HZ ?) • Memory throughput • Is SDRAM’s sustained throughput large enough to move data in and out of it at Gbps ? • I/O Bus bandwidth • Is 32-bit 33 MHZ PCI bus fast enough to move data at Gbps • Network interface • Is the firmware on the NIC fast enough to process packets at Gbps?

Software Design • If a host’s hardware system can barely support Gbps bandwidth, its software system must be carefully designed so that Gbps can still be achieved for an application. For example: • NIC device driver in OS • TCP/IP Protocol stack in OS • Routing table look up in OS • Buffer system in OS • API between OS and application programs • Networking services (e.g., NAT, Firewall) • (Improving the design and implementation of software systems is focus of our course.)

The Path of Moving Data • What networking does is basically to move data from one networking application residing on one machine to a networking application residing on a different machine. • The path of moving data is: • Application -> operating system -> network interface -> network -> network interface -> operating system -> application program. • Therefore, to achieve Gbps, moving data between application and operating system, and between operating system and network interface, must be performed at least at Gbps.

The Cost of Moving Data • The cost of moving data is very high. • The CPU speed has been continuously improved and now increased to 2 GHZ. However, the throughput and access speed of memory (e.g., SDRAM) remains about the same as those a few years ago. • Therefore, the CPU now needs to wait and waste more clock cycles to access a word in memory. • The cost of moving data now becomes increasingly high and the memory becomes the performance bottleneck. • Therefore, the goal is to minimize the need for moving data or hide the cost of moving data.

Hide Memory Access Cost (1) • Scoreboarding processor • Instructions that load data into a register do not need to wait for the data to come back from memory, but rather mark the registers as awaiting data. (single stream) • The processor then can continue execution. • Only if an instruction accesses the register before the memory access has completed does the processor needs to stop execution.

Hide Memory Access Cost (2) • Super-scalar processor • Permit independent instructions to be executed in the same clock cycle (multiple instruction streams) • Therefore, an instruction that is loading data from memory can be executed in parallel with an instruction that does not need this data. Both scoreboarding and super-scalar methods benefit reading a small amount of data. They are not very useful for reading a large amount of data. Therefore, operating system should be designed to minimize the number of times that a large amount of data has to be copied.

Host Memory Hierarchy Good cache performance depends on good locality. However, networking code often violates the locality assumptions. Example: when a packet arrives, it interrupts the execution of the processor. This forces the processor to load new instructions. Furthermore, because the data of the packet is not in the data cache, it needs to be fetched from memory.

The Problem with Layered Code • Layering is a useful concept that enables network researchers to cooperate together at the same time on different aspects of a networking problem. • However, an implementation of protocol stacks based on strict layering often results in bad performance. • Because the upper layer does not know which format the lower layer wants, the packets copied into the lower layer often need to be reformatted and recopied. • Nowadays, we are seeing that more and more implementation violates the layering concept for higher performance. • E.g., Content-aware (URL) Web switching at an IP router.

Reduce Memory Copy Operations • Currently, on a UNIX host, two data copy operations are needed to move data in an application to the network interface. • Application -> OS • OS -> network interface.

One-Copy Techniques • Virtual page remapping: • The first copy can be eliminated by using virtual memory mechanism to map the pages used by the application to the pages used by the buffer in OS. • The buffers in the application must start and end at page boundary for this mechanism to work. • Copy-on-write: • The first copy can also be saved by COW. • If a packet needs to be copied from one domain to another domain, copy-on-write can be used to reduce or eliminate the copy operation. • The pages of the packet will be copied and generated only when the packet is modified. Otherwise, the same pages are used in different domains.

One-Copy Techniques • Memory-mapped buffer: • The second copy can be eliminated by mapping the memory on the NIC to a part of the system memory. • The OS then can use the mapped system memory for its buffer area. • Therefore, when the application’s data is copied to the buffer in the OS, effectively it is copied into the memory on NIC. • (From PCI specification, it shows that if there is memory on a PCI card, we can map that memory to a part of the system memory.)

Zero-Copy Technique • Memory-mapped buffer + Virtual page remapping : • We can first map the NIC’s buffer to the buffer in the OS (PCI hardware map operation). • We can then map the buffer in the application to the buffer in the OS. (OS software map operation) • Then, the buffer in the application is mapped to the NIC’s buffer. • This will result in zero-copy operation. • Although from network performance’s viewpoints zero-copy is good, it is very difficult to use for the application. • Because now the application needs to know the hardware details, which however should be hidden by the OS.

DMA Technique • To avoid the data copy operation between the OS and the NIC, instead of using the normal programming I/O, we can use DMA. • Using DMA, a NIC can transfer data directly from/to memory without involving the CPU. This enables CPU to execute in parallel with the data transfer. (However, CPU may still be stalled.) • Generally, DMA’s performance is better than PIO. However, there are some situations where PIO is preferred (e.g., doing checksuming) • Scatter-Gather capability in DMA-based NIC is important because they can avoid data copy operations.

Buffer Editing • To support Gbps, the design of a buffer system should allow buffers to be created, clipped, shared, split, concatenated, destroyed with little overhead. • Otherwise, a packet may need to be copied to a new buffer again and again while traversing the layers of a protocol stack. • E.g., as a packet goes down/up a protocol stack when it is sent/received, more and more headers need to be prepended/stripped to/from it. • Generally, lists or tree structures are used as the data structure to easily support the above operations. • E.g., the mbuf used in the BSD UNIX.

API Design • The design of application program interface (API) can significantly affect the data passing performance between the user application and the OS. • Currently, the read() and write() system calls provided on UNIX allow the user to choose a buffer with arbitrary address, size, alignment, and unconstraint access to that buffer. • This makes the OS difficult to avoid the data copy operation between the application and the OS. • Suppose that, instead, the UNIX requires that the buffer must start and end at page boundary, the length be a multiple of page size, then copy-on-write technique can be used to avoid one copy operation.

Data Manipulation • Data manipulation are computations that inspect and possibly modify every word of data in a network packet. • E.g., encryption, compression, checksuming, presentation conversion, etc. • Typically, different network layers manipulate data independently from each other. • Each data manipulation requires the CPU to load potentially un-cached data from memory and store the inspected/modified data to memory. • Therefore, repeated transfers need to across the CPU/memory data path multiple times, which limits and lowers the achievable maximum throughput.

Integrated Layer Processing • Integrated layer processing (ILP) technique can be used to minimize the number of data transfers. • The data manipulation steps from different protocol layers are combined into a pipeline. • A word of data is loaded into a register, then manipulated by multiple manipulation layers while it remains in a register, then finally stored – all before the next word of data is processed. • In this way, a combined series of data manipulations only transfer data from memory to the CPU and back once, instead of transferring the data once per distinct layer. The difficulty is that different manipulations cannot be easily integrated.

Copy-Avoiding Techniques Relationship

NIC to NIC Transfers • What we have discussed so far is to reduce the number of copy operations required for sending data from the user application, through the OS, to the NIC. • Here, we discuss the methods that can reduce the number of copy operations required for forwarding data from one NIC to another NIC. (I.e., the system functions as a routing or switching device.)

Techniques for NIC-to-NIC (1) • Hardware streaming (peer-to-peer) • The maximum achievable forwarding throughput is the I/O bus bandwidth. • The problem are that special hardware is required and the OS has no chance to inspect/modify packets. • As a result, some processing (e.g., routing table lookup) needs to be performed by the CPU on NICs. • However, due to economic, the CPU on NICs are often much slower than the CPU on the system. • DMA-DMA streaming • The maximum achievable throughput is only ½ of the I/O bandwidth. • However, packets can be inspected/modified by the OS. • E.g., routing table lookup

Techniques for NIC-to-NIC (2) • OS kernel streaming • Packets are first DMA’ed into memory. • Packets then are read from memory to the CPU for inspection or modification. • Depending on the number of inspection/modification and the memory system read/write throughput, the achievable maximum forwarding throughput is further limited by (memory throughput / number of read or write). • User-level streaming • In some applications, packets may need to go up to the user level for inspection/modification. • E.g., a Web proxy system, an email relay system, NATD • The throughput will be further limited.

Latency of Small Packets • For large packets, we care about the cost of copying them (i.e., transfer throughput). • For small packets, however, what we care about is the latency of their transmission. • The following three interactions between the processor and memory can affect latency: • Branch misses • Context switching • Interrupts

Branch Misses • To make the instruction pipeline full (some processors can have up to 13 stages), most processors today fetch instructions continuously. • Conditional branches, however, present a problem because the target instruction cannot be determined until the condition result has been computed. • If the CPU waits for the completion of the condition testing before fetching the next instruction, the pipeline cannot be full most of the time. This will result in low CPU utilization. • To solve this problem, most processors today try to predict the next instruction to perform. • If the guess is wrong, the instructions that are already fetched need to be abandoned. This will also result in low CPU utilization. Do not put too many if-then-else in your networking code.

Context Switches • Context switches are very expensive because they require both new code and data be fetched from the slow memory and loaded into the processor cache. • In a perfect system, no more than one context switch should be needed to send a packet and one context switch plus an interrupt to receive a packet. • In micro-kernel OS, sending and receiving a packet need more context switches than a traditional UNIX kernel. (because the packet needs to traverse the application program, network server, and micro kernel domains) For a high-speed system, macros are preferred over function calls. Function calls are preferred over threads (need to save its PC and stack) to process an incoming packet.

Interrupts • Interrupts are very expensive. • They cause context switches, which in turn cause a lot of code and data cache misses. • Sometimes the host’s priority mode needs to be changed from the user mode to privileged mode when an interrupt occurs. Changing mode, however, is a very slow operation. • One solution is to minimize the number of interrupts. • Do not issue a receive interrupt for every incoming packet. Issue an interrupt only when a certain number of packets have been received or a timer has expired. • When a receive interrupt occurs, the device driver retrieves and processes as many packets from the NIC as possible. • Do not issue a transmit interrupt for every sent packet. Issue a transmit interrupt only after a certain number of packets have been sent or a timer ha expired.

Receive Livelock Problem • Can happen in an interrupt-driven kernel • This problem happens when packets arrives at the system at high rates. • When this problem occurs, the system will spend all of its time processing interrupts, to the exclusion of other necessary tasks. • The result is that, under extreme conditions, no packets can be delivered to the user application or the output of the system. • To avoid this problem, tasks and interrupts must be carefully scheduled.

Techniques to Avoid Livelock • Limit the interrupt arrival rate • For example, when the ipintr queue is going to be full and packets are going to be dropped, we can temporarily disable interrupts. The interrupt can be re-enabled when the buffer occupancy of the ipintr queue drops a certain threshold. • Use of polling • Poll each NIC at a fixed rate. This can limit packet processing rate and also provide fair resource allocation between multiple interfaces. • Avoiding preemption • Let higher-level protocol processing (e.g., TCP/IP) be executed at the same level as that used by an interrupt service routine.

Make Hosts Ready for Gigabit Networks

Make Hosts Ready for Gigabit Networks

Presentation Transcript

Ready by 21  . ready for College. Ready for Work. Ready for Life.

Towards Gigabit

IP-QoS Benchmarking in Gigabit Networks

Unix Hosts

Scalable Integrated Performance Analaysis of Multi-Gigabit Networks

Hosts

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet Networks - Design, Implementation and Management

Your Hosts

Your Hosts

Make Protocol Ready for Gigabit

e-VLBI: Science with Multi-Gigabit Global Networks

Evaluating System Performance in Gigabit Networks

IP for Mobile hosts

Gigabit Ethernet

SmartARP: Making Gigabit Networks Cheap

Cleaning solutions for Airbnb hosts

Make your Enterprise Lightning Ready

SmartARP: Making Gigabit Networks Cheap

Your Hosts