Improving IPC by Kernel Design

Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993

The Performance ofu-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Proceedings of the 16th Symposium on Operating Systems Principles October 1997, pp. 66-77

Jochen Liedtke (1953 – 2001) • 1977 – Diploma in Mathematics from University of Beilefeld. • 1984 – Moved to GMD (German National Research Center). Build L3. Known for overcoming ipc performance hurdles. • 1996 – IBM T.J Watson Research Center. Developed L4, a 12kb second generation microkernel.

The IPC Dilemma • IPC is a core paradigm of u-kernel architectures • Most IPC implementations perform poorly • Really fast message passing systems are needed to run device drivers and other performance critical components at the user-level. • Result: programmers circumvent IPC, co-locating device drivers in the kernel and defeating the main purpose of the microkernel architecture

What to Do? • Optimize IPC performance above all else! • Results: L3 and L4: second-generation micro-kernel based operating systems • Many clever optimizations, but no single “silver bullet”

Summary of Techniques Seventeen Total

Client (Sender) Server (Receiver) send ( ); System call, Enter kernel Exit kernel receive ( ); System call, Enter kernel Exit kernel Client is not Blocked send ( ); System call, Enter kernel Exit kernel receive ( ); System call, Enter kernel Exit kernel Standard System Calls (Send/Recv) Kernel entered/exited four times per call!

New Call/Response-based System Calls Special system calls for RPC-style interaction Kernel entered and exited only twice per call! Client (Sender) Server (Receiver) reply_and_recv_next ( ); call ( ); System call, Enter kernel Allocate CPU to Server Suspend Re allocate CPU to Client Exit kernel Resume from being suspended Exit kernel handle message reply_and_recv_next ( ); Enter kernel Send Reply Wait for next message

Complex Message Structure Batching IPC Combine a sequence of send operations into a single operation by supporting complex messages • Benefit: reduces number of sends.

Direct Transfer by Temporary Mapping • Naïve message transfer: copy from sender to kernel then from kernel to receiver • Optimizing transfer by sharing memory between sender and receiver is not secure • L3 supports single-copy transfers by temporarily mapping a communication window into the sender.

Scheduling • Conventionally, ipc operations call or reply & receive require scheduling actions: • Delete sending thread from the ready queue. • Insert sending thread into the waiting queue • Delete the receiving thread from the waiting queue. • Insert receiving thread into the ready queue. • These operations, together with 4 expected TLB misses will take at least 1.2 us (23%T).

Solution, Lazy Scheduling • Don’t bother updating the scheduler queues! • Instead, delay the movement of threads among queues until the queues are queried. • Why? • A sending thread that blocks will soon unblock again, and maybe nobody will ever notice that it blocked • Lazy scheduling is achieved by setting state flags (ready / waiting) in the Thread Control Blocks

Pass Short Messages in Registers • Most messages are very short, 8 bytes (plus 8 bytes of sender id) • Eg. ack/error replies from device drivers or hardware initiated interrupt messages. • Transfer short messages via cpu registers. • Performance gain of 2.4 us or 48%T.

Impact on IPC Performance • For an eight byte message, ipc time for L3 is 5.2 us compared to 115 us for Mach, a 22 fold improvement. • For large message (4K) a 3 fold improvement is seen.

Relative Importance of Techniques • Quantifiable impact of techniques • 49% means that that removing that item would increase ipc time by 49%.

OS and Application-Level Performance

OS-Level Performance

Application-Level Performance

Conclusion • Use a synergistic approach to improve IPC performance • A thorough understanding of hardware/software interaction is required • no “silver bullet” • IPC performance can be improved by a factor of 10 • … but even so, a micro-kernel-based OS will not be as fast as an equivalent monolithic OS • L4-based Linux outperforms Mach-based Linux, but not monolithic Linux

Improving IPC by Kernel Design