1 / 23

Kernel-Kernel communication in a shared-memory multiprocessor

Kernel-Kernel communication in a shared-memory multiprocessor. By ELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL L. SCOTT. Introduction. Computer architecture can influence multiprocessor operating system design

julialopez
Download Presentation

Kernel-Kernel communication in a shared-memory multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel-Kernel communication in a shared-memory multiprocessor By ELISEU M. CHAVES,* JR.,PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL L. SCOTT

  2. Introduction • Computer architecture can influence multiprocessor operating system design • How to distribute kernel functionality across processors • Inter-Kernel Communication • Data Structure and Location • How to maintain Data Integrity and provide fast access

  3. Introduction • Computer architecture can influence multiprocessor operating system design • How to distribute kernel functionality across processors • Inter-Kernel Communication • Data Structure and Location • How to maintain Data Integrity and provide fast access • Some Examples...

  4. Introduction • Bus-based shared memory multiprocessors • Kernel code shared for all processors • Typically these have uniform memory access • Use explicit synchronization to control access to data structures (critical sections, locks, spin-locks, semaphores) • Advantages • Easy to port an existing uniprocessor kernel and add explicit synchronization • Decent performance • Simplified load balancing and global resource management • Disadvantages • Good for smaller machines but not very scalable • Increased contention leads to performance degradation

  5. Introduction • Distributed systems & Distributed memory computers • Kernel data distributed among processors • Each processor executes a copy of the kernel • Local data operated on directly, but remote data is accessed via remote invocation • Uses implicit synchronization to control access to data • Hypercubes & Mesh-connected machines • Advantages • Works well for machines with no shared memory • Modular design gives increased fault protection • Can use non-preemption for bulk of synchronization • Disadvantages • Reliance on a communication network to send invocations • Security of data while in the network

  6. Large Scale Shared Memory Multiprocessors • Memory can be accessed directly, but at varying cost (NUMA) • Some with coherent cache between processors • ( ccNUMA ) • Large scale shared memory multiprocessors have properties common to both • Bus-based systems • Which use Remote access • Distributed memory multi-computers • Which use Remote invocation • Kernel Designers must choose between • Remote ACCESS & • Remote INVOCATION • Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993)

  7. Target Parameters of this Research • General purpose parallel operating system • Shared memory access to all processors • Full range of kernel services • Reasonable use of all processors • Arrange system using collection of nodes • Possibly with caches? • NUMA architecture • No cache coherence

  8. Kernel-Kernel Communication Options • Remote Access • execution on node i, reading and writing memory on node j’s memory when necessary • Remote Invocation • Processor at node i sends a message to node j’s processor requesting an operation on i’s behalf • Bulk Data Transfer • Move data required from node j to node i • Specific request to kernel • Re-Map memory from node j to node i • Data migration • Cache coherence blurs distinction with remote access

  9. Types of Remote Invocation [1/2] • Interrupt Level • Request executes in the interrupt handler “bottom half” • Doesn’t execute in the context of a ‘process’ • Can’t sleep • Can’t call schedule • Can’t transfer data to/from user space • Interrupts enabled during execution to allow top half to continue handling other interrupts • Mutual exclusion achieved through ‘masking interrupts’ • Course granularity can unnecessarily prohibit too many other invocations • Prohibits outgoing invocations when incoming invocations are disabled. (to avoid deadlock) • Limiting for accessing data on two different processors • Yes, lots of limitations, but very fast!

  10. Types of Remote Invocation [2/2] • Kernel Process Level • Normal “Top Half” activity • From user space - interrupt handler performs an asynchronous trap & remote invocation executes • From kernel space – interrupt handler queues the invocation for execution when kernel becomes idle, or when control returns to user space • Has process context so requested operation free to block during its execution • Process holding semaphores or scheduler-based locks can perform process-level RI’s • Deadlock still possible, but can be avoided • Requires requesting process to block when making RI • Busy waiting would lock out other incoming process-level RI’s

  11. Coexistence is possible • Remote access and interrupt-level RI on the same data structure • Deadlock-avoidance rule: must always be possible to perform incoming invocations wile waiting for outgoing invocation

  12. Trade-Offs • Direct costs of remote operations • Latency of the operation • (R-1)*n < C • R: remote to local access time ratio • n: number of memory accesses • C: overhead of remote invocation • If C is greater, implement Operation using Remote Access • Notes: • In general, C is fixed so use RI for longer operations • Parameter passing not included in calculation

  13. Trade-Offs • Indirect costs for local operations • RI may need context of invoker as parameter • Operations arranged to increase node locality?? • Access between invoker/target processor not interleaved?? • Process level RI for all remote accesses may allow data structure implementation without explicit synchronization • Uses lack of pre-emption within kernel to provide implicit synchronization • Avoiding Explicit Synchronization • Can improve speed of remote and local operations because of the high cost of explicit synchronization • Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel • May not scale well to large machines • Mapping on demand not a good option

  14. Trade-Offs • Competition for processor and memory cycles • When do operations serialize • Remote Invocation – on processor that executes them • Serialize whether data is common or not • Remote Memory Access – at the memory • more chance of parallel computation because operations do more than just access data • If competing for locked data operations may serialize • Tolerance for Contention • Remote Memory Access has slightly higher throughput because of parallel operation possibility • However, this can introduce greater variance of completion time

  15. Trade-Offs • Compatibility with conceptual kernel model • Procedure based kernel • Gives uniform model for user and kernel processes • Closely mimics hardware of UMA multiprocessor • Minimizes context switching • Fits nicely with Remote Memory Access • Message based kernel • Comparmentalization of kernel • Synchronization is subsumed by message passing • Closely mimics hardware of distributed-memory multicomputer • Easier to debug • Remote Invocation more appropriate

  16. Case Study • Psyche on the BBN butterfly • 12:1 Remote-to-Local memory access time ratio • 6.88µs Read 32-bit word remote memory location • 0.518µs Read 32-bit word local memory location • 4.27µs Write 32-bit word remote memory location • 0.398µs Write 32-bit word local memory location • 56µs Average Latency Interrupt level Remote Invocation • 421µs Average latency Process level Remote Invocation

  17. Latency of Kernel Operations Times in columns represent cost of operations: • col 1 & 3 measurements for original Psyche kernel • col 2 synchronization without context switching • col 4 subsumed in other operation with course grain locking • col 6 cost if data accessed were always via remote invocation – no explicit synchronization required • col 5 cost of remote invocations in a hybrid kernel

  18. % Latency from locking • Cost of synchronization • Measure Locking OFF vs ON

  19. Remote Access Penalty • NL RA(off) - local(off)/ RA(off) • L RA(off) – local(off)/RA(on) • Overhead of explicit synchronization and remote access is a function of complexity • Example • Interrupt level RI can be justified to avoid 11 remote refs • 6µs memory access cost & 60µs overhead of RI interrupt level

  20. RI time as percentage of RA time • RA: locking in the case of remote access • column 6 as % of column 3 • B: locking in the case of both • column 5 as % of column 3 • Simulates Hybrid kernel • N: locking in the case of neither • column 6 as % of column 4 • Simulates desired operation being subsumed by other operation with coarse-grained locking

  21. Not so fast • Not such a clear win for Remote Invocation • Experiments represent extreme cases • Simple enough to perform via interrupt level RI • Complex enough to absorb overhead of process level RI • Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI • Throughput for tiny operations • Remote memory accesses steal busy cycles from processor where memory is being used • Remote invocations steal the entire processor for the entire operation • Experiments show Interrupt level RI negatively affects throughput of the remote processor & remote access does not

  22. Conclusions • Good Kernel design must consider • Cost of remote invocation mechanism • Cost of atomic operations and synchronization • Ratio of remote to local access time • Breakeven points depend on the number of remote references in any given Operation • Interrupt level RI has small number of uses • TLB shootdown • Console I/O • Results could apply for ccNUMA machines as well • Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations

  23. REFERENCES • Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman “The Design and Implementation of the 4.3BSD UNIX Operating System” Addison Wesley Publishing Company • Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition” Prentice-Hall of India • Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd Edition” O’Reilly

More Related