CS533 Concepts of Operating Systems Class 6

CS533 Concepts of Operating SystemsClass 6 Micro-kernels Mach vs L3 vs L4

Binary Compatibility • Emulation libraries • Trampoline mechanism • Single server architecture • Multi-server architecture • IPC overhead proportional to number of servers (independent protection domains) CS533 - Concepts of Operating Systems

Optimizing IPC • Liedtke argues Mach’s overhead is due to poor implementation! • Optimized IPC implementation in L3 • Architectural level • System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks. • Algorithmic level • Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages. • Interface level • Unnecessary Copies, Parameter passing. • Coding level • Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch. CS533 - Concepts of Operating Systems

L3 IPC Performance vs Mach IPC CS533 - Concepts of Operating Systems

L3 RPC Performance vs Previous Systems CS533 - Concepts of Operating Systems

But Is That Enough? • What is the impact on overall system performance? • Haertig et al explore performance and extensibility of L4-based Linux OS vs Mach-based Linux and native Linux • L4 has even more IPC optimizations than L3! CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Fully binary compliant with Linux/X86 • Restricted modifications to architecture-dependent part of Linux • No Linux-specific modifications to L4 kernel CS533 - Concepts of Operating Systems

Experiment • What is the penalty of using L4Linux? Compare L4Linux to native Linux • Does the performance of the underlying micro-kernel matter? Compare L4Linux to MkLinux • Does co-location improve performance? Compare L4Linux to an in-kernel version of MkLinux CS533 - Concepts of Operating Systems

Microbenchmarks • measured system call overhead on shortest system call “getpid()” CS533 - Concepts of Operating Systems

Microbenchmarks (cont.) • Measures specific system calls to determine basic performance. CS533 - Concepts of Operating Systems

Macrobenchmarks • measured time to recompile Linux server CS533 - Concepts of Operating Systems

Macrobenchmarks (cont.) • Next use a commercial test suite to simulate a system under full load. CS533 - Concepts of Operating Systems

Performance Analysis • L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load. • MkLinux: 49% average, 60% at maximum. • Co-located MkLinux: 29% average, 37% at maximum. CS533 - Concepts of Operating Systems

Conclusion? • Can hardware-based protection be made to work efficiently enough? • Did these experiments explore the cost of “fine grained” protection? CS533 - Concepts of Operating Systems

Spare Slides CS533 - Concepts of Operating Systems

The IPC Dilemma • IPC is very import in μ-kernel design • Increases modularity, flexibility, security and scalability. • Past implementations have been inefficient. • Message transfer takes 50 - 500μs. CS533 - Concepts of Operating Systems

The L3 (μ-kernel based) OS • A task consists of: • Threads • Communicate via messages that consist of strings and/or memory objects. • Dataspaces • Memory objects. • Address space • Where dataspaces are mapped. CS533 - Concepts of Operating Systems

Redesign Principles • IPC performance is the Master. • All design decisions require a performance discussion. • If something performs poorly, look for new techniques. • Synergetic effects have to be taken into considerations. • The design has to cover all levels from architecture down to coding. • The design has to be made on a concrete basis. • The design has to aim at a concrete performance goal. CS533 - Concepts of Operating Systems

Achievable Performance • A simple scenario • Thread A sends a null message to thread B • Minimum of 172 cycles • Will aim at 350 cycles (7 μs) • Will actually achieve 250 cycles (5 μs) CS533 - Concepts of Operating Systems

Levels of the redesign • Architectural • System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks. • Algorithmic • Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages. • Interface • Unnecessary Copies, Parameter passing. • Coding • Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch. CS533 - Concepts of Operating Systems

Architectural Level • System Calls • Expensive! So, require as few as possible. • Implement two calls: • Call • Reply & Receive Next • Combines sending an outgoing message with waiting for an incoming message. • Schedulers can handle replies the same as requests. CS533 - Concepts of Operating Systems

A Complex Message Messages • Complex Messages: • Direct String, Indirect Strings (optional) • Memory Objects • Used to combine sends if no reply is needed. • Can transfer values directly from sender’s variable to receiver’s variables. CS533 - Concepts of Operating Systems

User A Kernel User B Direct Transfer • Each address space has a fixed kernel accessible part. • Messages transferred via the kernel part • User A space -> Kernel -> User B space • Requires 2 copies. • Larger Messages lead to higher costs CS533 - Concepts of Operating Systems

Shared User Level memory (LRPC, SRC RPC) • Security can be penetrated. • Cannot check message’s legality. • Long messages -> address space becoming a critical resource. • Explicit opening of communication channels. • Not application friendly. CS533 - Concepts of Operating Systems

User A Kernel User B Temporary Mapping • L3 uses a Communication Window • Only kernel accessible, and exists per address space. • Target region is temporarily mapped there. • Then the message is copied to the communication window and ends up in the correct place in the target address space. CS533 - Concepts of Operating Systems

Temporary Mapping • Must be fast! • 2 level page table only requires one word to be copied. • pdir A -> pdir B • TLB must be clean of entries relating to the use of the communication window by other operations. • One thread • TLB is always “window clean”. • Multiple threads • Interrupts – TLB is flushed • Thread switch – Invalidate Communication window entries. CS533 - Concepts of Operating Systems

Strict Process Orientation • Kernel mode handled in same way as User mode • One kernel stack per thread • May lead to a large number of stacks • Minor problem if stacks are objects in virtual memory CS533 - Concepts of Operating Systems

User area Kernel area tcb Kernel stack Thread Control Blocks (tcb’s) • Hold kernel, hardware, and thread-specific data. • Stored in a virtual array in shared kernel space. CS533 - Concepts of Operating Systems

Tcb Benefits • Fast tcb access • Saves 3 TLB misses per IPC • Threads can be locked by unmapping the tcb • Helps make thread persistent • IPC independent from memory management CS533 - Concepts of Operating Systems

Algorithmic Level • Thread ID’s • L3 uses a 64 bit unique identifier (uid) containing the thread number. • Tcb address is easily obtained • anding the lower 32 bits with a bit mask and adding the tcb base address. • Virtual Queues • Busy queue, present queue, polling-me queue. • Unmapping the tcb includes removal from queues • Prevents page faults from parsing/adding/deleting from the queues. CS533 - Concepts of Operating Systems

Algorithmic Level • Timeouts and Wakeups • Operation fails if message transfer has not started t ms after invoking it. • Kept in n unordered wakeup lists. • A new thread’s tcb is linked into the list τ mod n. • Thread with wakeups far away are kept in a long time wakeup list and reinserted into the normal lists when time approaches. • Scheduler will only have to check k/n entries per clock interrupt. • Usually costs less the 4% of ipc time. CS533 - Concepts of Operating Systems

Algorithmic Level • Lazy Scheduling • Only a thread state variable is changed (ready/waiting). • Deletion from queues happens when queues are parsed. • Reduces delete operations. • Reduces insert operations when a thread needs to be inserted that hasn’t been deleted yet. CS533 - Concepts of Operating Systems

Algorithmic Level • Short messages via registers • Register transfers are fast • 50-80% of messages ≥ 8 bytes • Up to 8 byte messages can be transferred by registers with a decent performance gain. • May not pay off for other processors. CS533 - Concepts of Operating Systems

Interface Level • Unnecessary Copies • Message objects grouped by types • Send/receive buffers structured in the same way • Use same variable for sending and receiving • Avoid unnecessary copies • Parameter Passing • Use registers whenever possible. • Far more efficient • Give compilers better opportunities to optimize code. CS533 - Concepts of Operating Systems

Code Level • Cache Misses • Cache line fill sequence should match the usual data access sequence. • TLB Misses • Try and pack in one page: • Ipc related kernel code • Processor internal tables • Start/end of Larger tables • Most heavily used entries CS533 - Concepts of Operating Systems

Coding Level • Registers • Segment register loading is expensive. • One flat segment coving the complete address space. • On entry, kernel checks if registers contain the flat descriptor. • Guarantees they contain it when returning to user level. • Jumps and Check • Basic code blocks should be arranged so that as few jumps are taken as possible. • Process switch • Save/restore of stack pointer and address space only invoked when really necessary. CS533 - Concepts of Operating Systems

L4 Slides CS533 - Concepts of Operating Systems

Introduction • μ-kernels have reputation for being too slow,inflexible • Can 2nd generation μ-kernel (L4) overcome limitations? • Experiment: • Port Linux to run on L4 (Mach 3.0) • Compared to native Linux, MkLinux (Linux on 1st gen Mach derived μ-kernel) CS533 - Concepts of Operating Systems

Introduction (cont.) • Test speed of standard OS personality on top of fast μ-kernel: Linux implemented on L4 • Test extensibility of system: • pipe-based communication implemented directly on μ-kernel • mapping-related OS extensions implemented as user tasks • user-level real-time memory management implemented • Test if L4 abstractions independent of platform CS533 - Concepts of Operating Systems

L4 Essentials • Based on threads and address spaces • Recursive construction of address spaces by user-level servers • Initial address space σ0 represents physical memory • Basic operations: granting, mapping, and unmapping. • Owner of address space can grant or map page to another address space • All address spaces maintained by user-level servers (pagers) CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Fully binary compliant with Linux/X86 • Restricted modifications to architecture-dependent part of Linux • No Linux-specific modifications to L4 kernel CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Address Spaces • Initial address space σ0 represents physical memory • Basic operations: granting, mapping, and unmapping. • L4 uses “flexpages”: logical memory ranging from one physical page up to a complete address space. • An invoker can only map and unmap pages that have been mapped into its own address space CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Address Spaces (cont.) • I/O ports are parts of address spaces. • Hardware interrupts are handled by user-level processes. The L4 kernel will send a message via IPC. CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • The Linux server • L4Linux will use a single-server approach. • A single Linux server will run on top of L4, multiplexing a single thread for system calls and page faults. • The Linux server maps physical memory into its address space, and acts as the pager for any user processes it creates. • The Server cannot directly access the hardware page tables, and must maintain logical pages in its own address space. CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Interrupt Handling • All interrupt handlers are mapped to messages. • The Linux server contains threads that do nothing but wait for interrupt messages. • Interrupt threads have a higher priority than the main thread. CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • User Processes • Each different user process is implemented as a different L4 task: Has its own address space and threads. • The Linux Server is the pager for these processes. Any fault by the user-level processes is sent by RPC from the L4 kernel to the Server. CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • System Calls • Three system call interfaces: • A modified version of libc.so that uses L4 primitives. • A modified version of libc.a • A user-level exception handler (trampoline) calls the corresponding routine in the modified shared library. • The first two options are the fastest. The third is maintained for compatibility. CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Signalling • Each user-level process has an additional thread for signal handling. • Main server thread sends a message for the signal handling thread, telling the user thread to save it’s state and enter Linux CS533 - Concepts of Operating Systems

L4Linux – Design & Implementation • Scheduling • All thread scheduling is down by the L4 kernel • The Linux server’s schedule() routine is only used for multiplexing it’s single thread. • After each system call, if no other system call is pending, it simply resumes the user process thread and sleeps. CS533 - Concepts of Operating Systems

CS533 Concepts of Operating Systems Class 6