html5-img
1 / 28

Improving IPC by Kernel Design

Improving IPC by Kernel Design. By Jochen Liedtke Presented by Chad Kienle. The IPC Dilemma. IPC is very import in μ -kernel design Increases modularity, flexibility, security and scalability. Past implementations have been inefficient. Message transfer takes 50 - 500 μ s.

eron
Download Presentation

Improving IPC by Kernel Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving IPC by Kernel Design By Jochen Liedtke Presented by Chad Kienle

  2. The IPC Dilemma • IPC is very import in μ-kernel design • Increases modularity, flexibility, security and scalability. • Past implementations have been inefficient. • Message transfer takes 50 - 500μs.

  3. The L3 (μ-kernel based) OS • A task consists of: • Threads • Communicate via messages that consist of strings and/or memory objects. • Dataspaces • Memory objects. • Address space • Where dataspaces are mapped.

  4. Redesign Principles • IPC performance is the Master. • All design decisions require a performance discussion. • If something performs poorly, look for new techniques. • Synergetic effects have to be taken into considerations. • The design has to cover all levels from architecture down to coding. • The design has to be made on a concrete basis. • The design has to aim at a concrete performance goal.

  5. Achievable Performance • A simple scenario • Thread A sends a null message to thread B • Minimum of 172 cycles • Will aim at 350 cycles (7 μs) • Will actually achieve 250 cycles (5 μs)

  6. Levels of the redesign • Architectural • System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks. • Algorithmic • Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages. • Interface • Unnecessary Copies, Parameter passing. • Coding • Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.

  7. Architectural Level • System Calls • Expensive! So, require as few as possible. • Implement two calls: • Call • Reply & Receive Next • Combines sending an outgoing message with waiting for an incoming message. • Schedulers can handle replies the same as requests.

  8. A Complex Message Messages • Complex Messages: • Direct String, Indirect Strings (optional) • Memory Objects • Used to combine sends if no reply is needed. • Can transfer values directly from sender’s variable to receiver’s variables.

  9. User A Kernel User B Direct Transfer • Each address space has a fixed kernel accessible part. • Messages transferred via the kernel part • User A space -> Kernel -> User B space • Requires 2 copies. • Larger Messages lead to higher costs

  10. Shared User Level memory (LRPC, SRC RPC) • Security can be penetrated. • Cannot check message’s legality. • Long messages -> address space becoming a critical resource. • Explicit opening of communication channels. • Not application friendly.

  11. User A Kernel User B Temporary Mapping • L3 uses a Communication Window • Only kernel accessible, and exists per address space. • Target region is temporarily mapped there. • Then the message is copied to the communication window and ends up in the correct place in the target address space.

  12. Temporary Mapping • Must be fast! • 2 level page table only requires one word to be copied. • pdir A -> pdir B • TLB must be clean of entries relating to the use of the communication window by other operations. • One thread • TLB is always “window clean”. • Multiple threads • Interrupts – TLB is flushed • Thread switch – Invalidate Communication window entries.

  13. Strict Process Orientation • Kernel mode handled in same way as User mode • One kernel stack per thread • May lead to a large number of stacks • Minor problem if stacks are objects in virtual memory

  14. User area Kernel area tcb Kernel stack Thread Control Blocks (tcb’s) • Hold kernel, hardware, and thread-specific data. • Stored in a virtual array in shared kernel space.

  15. Tcb Benefits • Fast tcb access • Saves 3 TLB misses per IPC • Threads can be locked by unmapping the tcb • Helps make thread persistent • IPC independent from memory management

  16. Algorithmic Level • Thread ID’s • L3 uses a 64 bit unique identifier (uid) containing the thread number. • Tcb address is easily obtained • anding the lower 32 bits with a bit mask and adding the tcb base address. • Virtual Queues • Busy queue, present queue, polling-me queue. • Unmapping the tcb includes removal from queues • Prevents page faults from parsing/adding/deleting from the queues.

  17. Algorithmic Level • Timeouts and Wakeups • Operation fails if message transfer has not started t ms after invoking it. • Kept in n unordered wakeup lists. • A new thread’s tcb is linked into the list τ mod n. • Thread with wakeups far away are kept in a long time wakeup list and reinserted into the normal lists when time approaches. • Scheduler will only have to check k/n entries per clock interrupt. • Usually costs less the 4% of ipc time.

  18. Algorithmic Level • Lazy Scheduling • Only a thread state variable is changed (ready/waiting). • Deletion from queues happens when queues are parsed. • Reduces delete operations. • Reduces insert operations when a thread needs to be inserted that hasn’t been deleted yet.

  19. Algorithmic Level • Short messages via registers • Register transfers are fast • 50-80% of messages ≥ 8 bytes • Up to 8 byte messages can be transferred by registers with a decent performance gain. • May not pay off for other processors.

  20. Interface Level • Unnecessary Copies • Message objects grouped by types • Send/receive buffers structured in the same way • Use same variable for sending and receiving • Avoid unnecessary copies • Parameter Passing • Use registers whenever possible. • Far more efficient • Give compilers better opportunities to optimize code.

  21. Code Level • Cache Misses • Cache line fill sequence should match the usual data access sequence. • TLB Misses • Try and pack in one page: • Ipc related kernel code • Processor internal tables • Start/end of Larger tables • Most heavily used entries

  22. Coding Level • Registers • Segment register loading is expensive. • One flat segment coving the complete address space. • On entry, kernel checks if registers contain the flat descriptor. • Guarantees they contain it when returning to user level. • Jumps and Check • Basic code blocks should be arranged so that as few jumps are taken as possible. • Process switch • Save/restore of stack pointer and address space only invoked when really necessary.

  23. Some techniques

  24. Results

  25. Results

  26. Remarks • Ports • Extend L3 to support indirection of port rights. • Port-based ipc can be implemented efficiently. • Dash-like Message Passing • “restricted virtual memory remapping” • Implemented effieciently • Cache thrashing • Processor Dependencies • Processor specific implementations are required to get really high performance

  27. Conclusion • Faster, cross address space ipc can be achieved! • 22 times faster (compared to mach) • Applying principles: • Performance based reasoning • Hunting for new techniques • Consideration of synergetic effects • Concreteness • Performance goal must be aimed at from the beginning! • Techniques and methods are applicable to many μ-kernels and different hardware.

  28. Questions/comments?

More Related