improving ipc by kernel design l.
Skip this Video
Loading SlideShow in 5 Seconds..
Improving IPC by Kernel Design PowerPoint Presentation
Download Presentation
Improving IPC by Kernel Design

Loading in 2 Seconds...

play fullscreen
1 / 20

Improving IPC by Kernel Design - PowerPoint PPT Presentation

  • Uploaded on

Improving IPC by Kernel Design. Jochen Liedtke Presentation: Rebekah Leslie. Microkernels and IPC:. Microkernel architectures introduce a heavy reliance on IPC, particularly in modular systems Mach pioneered an approach to highly modular and configurable systems, but had poor IPC performance

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Improving IPC by Kernel Design' - kamuzu

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving ipc by kernel design

Improving IPC by Kernel Design

Jochen Liedtke

Presentation: Rebekah Leslie

microkernels and ipc
Microkernels and IPC:
  • Microkernel architectures introduce a heavy reliance on IPC, particularly in modular systems
  • Mach pioneered an approach to highly modular and configurable systems, but had poor IPC performance
  • Poor performance leads people to avoid microkernels entirely, or architect their design to reduce IPC
  • Paper examines a performance oriented design process and specific optimizations that achieve good performance
performance vs protection
Performance vs. Protection:
  • Mach provided strong isolation between tasks using indirect IPC transfer (ports) with limited access controlled by capabilities (port rights)
  • L3 removes indirect transfer, capabilities, and RPC validation to achieve better performance
    • Provides basic address space protection
    • Does not provide true isolation
  • Recent L4 designs incorporate the use of capabilities to achieve isolation in security-critical systems
l3 system architecture
L3 System Architecture:
  • Mach-like design with a focus on highly modular systems
    • Limit kernel features to functionality that absolutely cannot be implemented at user-level
    • Reliance on user level servers for many “traditional” OS features: page-fault handling, exception handling, device drivers
  • System organized into tasks and threads
  • IPC allows direct data transfer and memory sharing
  • Direct communication between threads via thread ID
performance centric design
Performance-centric Design:
  • Focus on IPC
    • Any feature that will increase cost must be closely evaluated
    • When in doubt, design in favor of IPC
  • Design for Performance
    • A poorly performing technique is unacceptable
    • Evaluate feature cost compared to concrete baseline
    • Aim for a concrete performance goal
  • Comprehensive design
    • Consider synergistic effects of all methods and techniques
    • Cover all levels of implementation, from design to code
performance baseline
Performance Baseline:
  • The cost of each feature must be evaluated relative to a concrete performance baseline
  • For IPC, the theoretical minimum is an empty message: this measures the overhead without data transfer cost

127 cycles without prefetching delays or cache misses

+ 45 cycles for TLB misses

= 172 cycle minimum time

GOAL: 350 cycles (7 s) for short messages

messages in l3
Messages in L3:
  • Tag: Description of message contents
  • Direct string: Data to be transferred directly from send buffer to receive buffer
  • Indirect string: Location and size of data to be transferred by reference
  • Memory object: Description of a region of memory to be mapped in receiver address space (shared memory)
  • System calls: Send, receive, call (send and receive), reply/wait (receive and send)


direct string

indirect strings

memory objects

basic message optimizations
Basic Message Optimizations:
  • Ability to transfer long, complex messages reduces the number of messages that need to be sent (system calls)
  • Indirect strings avoid copy operations at user level
    • User specifies data location, rather than copying data to buffer
    • Receiver specifies destination, rather than copying from buffer
  • Memory objects transferred lazily, i.e., page table is not modified until access is required
  • Combined send/receive calls reduce number of traps
optimization direct transfer via temporary mapping


mapped with kernel-only permission

Optimization - Direct Transfer via Temporary Mapping:
  • Two copy message transfer costs 20 + 0.75n cycles
  • L3 copies data once to a special communication window in kernel space
  • Window is mapped to the receiver for the duration of the call (page directory entry)



add mapping to space B



optimization transfer short messages in registers
Optimization - Transfer Short Messages in Registers:
  • IPC messages are often very short
    • Example: Device driver ack or error replies
    • On average, between 50% and 80% of L3 messages are less than eight bytes long
  • Even on the register poor x86, 2 registers can be set aside for short message transfer
  • Register transfer implementation saved 2.4 s, even more than the overhead of temporary mapping (1.2 s) because it enabled further optimizations
thread scheduling in l3
Thread Scheduling in L3:
  • Scheduler maintains several queues to keep track relevant thread-state information
    • Ready queue stores threads that are able to run
    • Wakeup queues store threads that are blocked waiting for an IPC operation to complete or timeout (organized by region)
    • Polling-me queue stores threads waiting to send to some thread
  • Efficient representation of data structures
    • Queues are stored as doubly-linked lists distributed across TCBs
    • Scheduling never causes page faults
optimization lazy dequeueing
Optimization - Lazy Dequeueing
  • A significant in a microkernel is the scheduler overhead for kernel threads (recall the user-level threads papers)
  • Sometimes, threads are removed from a queue, only to be inserted again a short while later
  • With weak invariants on scheduling queues, you can delay deleting an a from a queue and save overhead
    • The ready queue contains at least all ready threads
    • A wakeup queue contains at least all waiting threads
optimization store task control blocks in virtual arrays
Optimization - Store Task Control Blocks in Virtual Arrays
  • A task control block (TCB) stores kernel data for a particular thread
  • Every operation on a thread requires lookup, and possibly modification, of that thread’s TCB
  • Storing TCBs in a virtual array provides fast access to TCB structures
optimization compact structures with good locality
Optimization - Compact Structures with Good Locality
  • Access TCBs through pointer to the center of the structure so that short displacements can be used
    • One-byte long registers reach twice as much TCB data as with a pointer to the start of a structure
  • Group related TCB information on cache line boundaries to minimize cache misses
  • Store frequently accessed kernel data in same page as hardware tables (IDT, GDT, TSS)
optimization reduce segment register loads
Optimization - Reduce Segment Register Loads
  • Loading segment registers is expensive (9 cycles register), so many systems use a single, flat segment
  • Kernel preservation of the segment registers requires 66 cycles for the naive approach (always reload registers)
  • L3 instead checks if the flat value is still intact, and only does a load if not
  • Checking alone costs 10 cycles
performance impact of specific optimizations
Performance Impact of Specific Optimizations:
  • Large messages dominated by copy overhead
  • Small messages get benefit of faster context switching, fewer system calls, and fast access to kernel structures
ipc performance compared to mach short message
IPC Performance Compared to Mach (Short Message):
  • Measured using pingpong micro-benchmark that makes use of unified send/receive calls
  • For an n-byte message, the cost is 7 + 0.02n s in L3
ipc performance compared to mach long messages
IPC Performance Compared to Mach (Long Messages):
  • Same benchmark with larger messages.
  • For n-byte messages larger than 2k, cache misses increase and the IPC time is 10 + 0.04n s
    • Slightly higher base cost
    • Higher per-byte cost
  • By comparison, Mach takes 120 + 0.08n s
  • Well-performing IPC was essential in order for microkernels to gain wide adoption, which was a major limitation of Mach
  • L3 demonstrates that good performance is attainable in a microkernel system with IPC performance that is 10 to 22 times better than Mach
  • The performance-centric techniques demonstrated in the paper can be employed in any system, even if the specific optimizations cannot