実践システムソフトウェア Practical System Software

実践システムソフトウェアPracticalSystem Software 石川　裕 (Yutaka Ishikawa) 情報理工学系研究科コンピュータ科学専攻 part 4

Schedule • Oct. 6 Introduction • Oct. 13 Linux and Windows Basic Architecture • Oct. 20 Lab Experiment 1 • Setting up and Introducing the WRK environment • Oct. 27 Scheduler • Nov. 10 Lab Experiment 2 • Adding a new system call for counting some kernel events (context switch, page fault, process creation/deletion, etc..) • Assignment 1 • Nov. 17 No class • Nov. 24 Synchronizations • Dec. 1 Lab Experiment 3 • User space synchronization and priority inversion • Assignment 2 • Dec. 8 Virtual Memory • Dec. 15 Lab Experiment 4 • Experiment on working set • Assignment 3 • Dec. 22 I/O & File System • Jan. 12 Lab Experiment 5 • Final Assignment • Jan. 19 Lab Experiment 6 • Jan. 26 Lab Experiment 7 Grade: based on the number of attendances and the evaluation of four reports part 4

Concurrency & Synchronizationin Windows NT and Linux Advanced operating systems course, The University of Tokyo

Outline • Important notions • Concurrency in OS kernels • Interrupt handling • Synchronizations methods • Atomic variables • Spinlocks • Waitqueues / semaphores / mutexes • Windows • Linux

Important notions • Concurrency: simultaneous threads of execution, potentially interacting with each other • Synchronization: coordination of threads of execution for achieving correct/desired behavior • Race condition: situation of multiple threads competing for a shared resource, where the order of access is significant (e.g.: for the resource’s consistency) • Critical section: part of a thread’s execution where shared resource is accessed, possibly leading to race condition • Mutual exclusion: mechanism for ensuring that only one thread is in its critical section for a particular shared resource at a given time (i.e. to avoid race condition)

Important notions • Resource locking: a solution for achieving mutual exclusion among several threads • Deadlock: multiple threads waiting for an event that can be caused by only one of the waiting threads • Starvation: indefinite waiting due to the inability of accessing necessary resource(s)

Concurrency in OS kernels Advanced operating systems course, The University of Tokyo

Concurrency in OS kernels(kernel execution contexts) Code runs in kernel mode for different reasons: 1. Requests from user mode: • System-calls, kernel code runs in the context of the requesting thread 2. Asynchronous interrupts (e.g.: from external devices): • Interrupt service routine (ISR) • No process address space associated • IMPORTANT: ISRs are not schedulable entities (they cannot sleep) 3. Synchronous interrupts (also called exceptions): • Faults, division by zero, page fault, etc.. • Associated with the faulting thread and instruction 4. Dedicated kernel-mode threads: • Scheduled, preempted, etc., like any other threads

Trapping into the kernel Trap: processor‘s mechanism to capture executing thread • Switch from user to kernel mode • Interrupts – asynchronous • Exceptions - synchronous Interruptserviceroutines Interruptserviceroutines Interruptserviceroutines Interrupt Interruptdispatcher System service call Systemservicedispatcher System services System services System services HW exceptionsSW exceptions Exceptionhandlers Exceptionhandlers Exceptiondispatcher Exceptionhandlers Virtual addressexceptions Virtual memorymanager‘s pager

Hardware interrupt handling,top and bottom-halves • Hardware signals the CPU that an event occurred, data is ready on a port, etc.. • OS executes Interrupt Service Routine (ISR) • Normally split into two phases/halves: • Top-half (the actual ISR): • Executes time-critical work with the interrupt disabled • Responds (acknowledges or resets) hardware • Saves device data • Defers further work (i.e.: requests/schedules bottom-half) • Bottom-half: • Executed later at a safer time • Interrupts are not disabled • Does the actual job, can wake up processes, schedule I/O, processes a protocol, etc.. • NOTE: can be even in process context!

0 2 3 Peripheral Device Controller CPU Interrupt Controller n CPU Interrupt Service Table ISR ACK interrupt Read from device Schedule bottom-half Handle protocol, Wake process, etc.. Interrupt handler (bottom-half) Interrupt handler (top-half) Hardware interrupt handling,execution flow, top and bottom-halves

Hardware interrupt handling,top and bottom-halves • Why splitting? • Interrupt handlers may need to perform a large amount of work in response to an event • Example: receiving a packet from the network • Top-half: • ACKs network card and copies data buffer • Bottom-half: • Processes network protocol stack, signals process for I/O • Minimizing time while interrupts are disabled • Improves responsiveness • Ensures that new interrupts are not missed

Synchronization in OS kernels Three major synchronization mechanisms: • Atomic operations / variables • Spinlocks • Busy waiting • Wait queues (semaphores, mutexes) • Waiting by means of descheduling

Atomic operations • Simplest locking mechanism • Guarantees atomicity on a variable • for instance to increase/decrease a counter • Usually provides operations such as: • Test and set operations • atomic_read, atomic_set, atomic_add, atomic_sub, atomic_inc_and_test, atomic_dec_and_test • atomic_test_and_set • Swap values

Mutual Exclusion with atomic test-and-Set • Shared data: boolean lock = false; • Thread Ti do { while (TestAndSet(lock)); critical section lock = false; remainder section }

Spinlocks • Spinlock acquisition and release routines implement mutual exclusion • A spinlock is either free, or is considered to be taken by a CPU • If a spinlock is taken the other CPU/thread continuously checks its value while it becomes available, hence the expression “spins” • IMPORTANT: busy waiting, can be very inefficient, therefore: • Spinlocks are meant to be held for only a short time! • The thread holding the lock is not supposed to sleep • Why? Another thread may try to acquire the same lock, deadlocking the system (CPU will be spinning there forever…) • NOTE: thread means thread of execution, not necessary a schedulable thread • A spinlock is just a data cell in memory • Accessed with a test-and-modify operation that is atomic across all processors 31 0

Waitqueues and semaphores • Semaphore S – an integer variable if S = 1, called mutex (mutual exclusion) • Can only be accessed via two atomic operations: wait (S): while (S <= 0);S--; signal (S): S++;

Semaphore implementation based on a waitqueue • Semaphores may suspend(deschedule) and resume threads, by interacting with the scheduler code • Purpose: avoid busy waiting • Define a semaphore as a record of: typedef struct { int value; struct thread *L; // waitqueue } semaphore; • Assume two simple operations: • suspend() suspends the thread that invokes it. • resume(T) resumes the execution of a blocked thread T.

Implementation of wait() / signal() • Semaphore operations now defined as wait(S): S.value--; if (S.value < 0) { add this thread to S.L; suspend(); } signal(S): S.value++; if (S.value <= 0) { remove a thread T from S.L; resume(T); }

Mutual exclusion comparison • Spinlock • Low overhead locking • Short lock hold time • Busy waiting while contended • Cannot sleep when held • Note: it can still be used from process context, because preemption is disabled, but need to disable interrupts that can access the same resource and make sure that doesn’t sleep • Semaphore / mutex • High overhead • Waitqueue management, context switching, etc.. • Long(er) lock hold time • Waiting by means of sleeping (descheduling) when contended • Sleeping is fine • Cannot be used from interrupt context! 20

Windows NT Advanced operating systems course, The University of Tokyo

Windows Interrupt Request Levels (IRQLs) • IRQL = Interrupt Request Level • the “precedence” of the interrupt with respect to other interrupts • Different interrupt sources have different IRQLs • not the same as IRQ • IRQL is also a state of the processor • Servicing an interrupt raises processor IRQL to that interrupt’s IRQL • this masks subsequent interrupts at equal and lower IRQLs • User mode is limited to IRQL 0 • No waits (sleep) or page faults at IRQL >= DISPATCH_LEVEL

Interrupt Precedence via IRQLs (x86) 31 High 30 Power fail 29 Interprocessor Interrupt 28 Clock Hardware interrupts Profile & Synch (Srv 2003) . . . Device 1 Deferrable software interrupts 2 Dispatch/DPC APC 1 normal thread execution 0 Passive/Low

IRQLs on 64-bit Systems x64 IA64 15 High/Profile/Power High/Profile 14 Interprocessor Interrupt Interprocessor Interrupt/Power 13 Clock Clock 12 Synch (MP only) Synch (Srv 2003) Device n Device n . . 4 Device 1 . 3 Correctable Machine Check Device 1 2 Dispatch/DPC & Synch (UP only) Dispatch/DPC APC APC 1 Passive/Low Passive/Low 0

Interrupt Prioritization & Delivery • IRQLs are determined as follows: • x86 UP systems: IRQL = 27 - IRQ • x86 MP systems: bucketized (random) • x64 & IA64 systems: IRQL = IDT vector number / 16 • On MP systems, which processor is chosen to deliver an interrupt? • By default, any processor can receive an interrupt from any device • Can be configured with IntFilter utility in Resource Kit • On x86 and x64 systems, the IOAPIC (I/O advanced programmable interrupt controller) is programmed to interrupt the processor running at the lowest IRQL • On IA64 systems, the SAPIC (streamlined advanced programmable interrupt controller) is configured to interrupt one processor for each interrupt source • Processors are assigned round robin for each interrupt vector

WinNT interrupt handling (top-half) user or kernel mode code kernel mode Note: no thread or process context switch! Interrupt dispatch routine interrupt ! Disable interrupts Record machine state (trap frame) to allow resume Disable (some) interrupts Find and call appropriate ISR Dismiss interrupt Restore machine state (including mode and enabled interrupts) Interrupt service routine (top-half) Tell the device to stop interrupting Interrogate device state, start next operation on device, etc. Defer part of the work, request DPC (bottom-half) Return to caller

DPC object DPC object DPC object Deferred Procedure Calls (DPCs)(atomic bottom-half) • Used to defer processing from higher (device) interrupt level to a lower (dispatch) level • Also used for quantum end and timer expiration • Driver (usually ISR) queues request • One queue per CPU. DPCs are normally queued to the current processor, but can be targeted to other CPUs • Executes specified procedure at dispatch IRQL (or “dispatch level”, also “DPC level”) when all higher-IRQL work (interrupts) completed queue head

DPC DPC DPC DPC Delivering a DPC 1. Timer expires, kernelqueues DPC that willrelease all waiting threads.Kernel requests SW int. DPC routines can‘tassume whatprocess addressspace is currentlymapped Interruptdispatch table high Power failure 2. DPC interrupt occurswhen IRQL drops belowdispatch/DPC level 3. After DPC interrupt,control transfers tothread dispatcher dispatcher Dispatch/DPC APC DPC queue Low DPC routines can call kernel functionsbut can‘t call system services, generatepage faults, or create or wait on objects 4. Dispatcher executes each DPCroutine in DPC queue

Asynchronous Procedure Calls (APCs) • Execute code in context of a particular user thread • APC routines can acquire resources (objects), incur page faults,call system services • APC queue is thread-specific • User mode & kernel mode APCs • Permission required for user mode APCs • Executive uses APCs to complete work in thread space • Wait for asynchronous I/O operation • Emulate delivery of POSIX signals • Make threads suspend/terminate itself (env. subsystems) • APCs are delivered when thread is in alertable wait state • WaitForMultipleObjectsEx(), SleepEx()

Asynchronous Procedure Calls(APCs) • Special kernel APCs (process-context bottom half) • Run in kernel mode, at IRQL 1 • Always deliverable unless thread is already at IRQL 1 or above • Used for I/O completion reporting from “arbitrary thread context” • Kernel-mode interface is linkable, but not documented • “Ordinary” kernel APCs • Always deliverable if at IRQL 0, unless explicitly disabled (disable with KeEnterCriticalRegion) • User mode APCs • Used for I/O completion callback routines (see ReadFileEx, WriteFileEx); also, QueueUserApc • Only deliverable when thread is in “alertable wait” Thread Object K APC objects U

WinNT Spinlocks in Action CPU 1 CPU 2 Try to acquire spinlock: Test, set, WAS CLEAR (got the spinlock!) Begin updating data that’s protected by the spinlock (done with update) Release the spinlock: Clear the spinlock bit Try to acquire spinlock: Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, was set, loop Test, set, WAS CLEAR (got the spinlock!) Begin updating data

Queued Spinlocks • Problem: Checking status of spinlock via test-and-set operation creates bus contention • Queued spinlocks maintain queue of waiting processors • First processor acquires lock; other processors wait on processor-local flag • Thus, busy-wait loop requires no access to the memory bus • When releasing lock, the first processor’s flag is modified • Exactly one processor is being signaled • Pre-determined wait order

WinNT waiting (by means of scheduler) • Flexible wait calls • Wait for one or multiple objects in one call • Wait for multiple can wait for “any” one or “all” at once • “All”: all objects must be in the signalled state concurrently to resolve the wait • All wait calls include optional timeout argument • Waiting threads consume no CPU time • Waitable objects include: • Events (may be auto-reset or manual reset; may be set or “pulsed”) • Mutexes (“mutual exclusion”, one-at-a-time) • Semaphores (n-at-a-time) • Timers • Processes and Threads (signalled upon exit or terminate) • Directories (change notification) • No guaranteed ordering of wait resolution • If multiple threads are waiting for an object, and only one thread is released (e.g. it’s a mutex or auto-reset event), which thread gets released is unpredictable • Typical order of wait resolution is FIFO; however APC delivery may change this order

Executive Synchronization • Waiting on Dispatcher Objects – outside the kernel Create and initialize thread object Initialized Wait is complete; Set object tosignaled state Thread waitson an objecthandle Waiting Ready Terminated Transition Standby Running Interaction with thread scheduling

Interactions between Synchronization and Thread Dispatching • User mode thread waits on an event object‘s handle • Kernel changes thread‘s scheduling state from ready to waiting and adds thread to wait-list • Another thread sets the event • Kernel wakes up waiting threads; variable priority threads get priority boost • Dispatcher re-schedules new thread – it may preempt running thread it it has lower priority and issues software interrupt to initiate context switch • If no processor can be preempted, the dispatcher places the ready thread in the dispatcher ready queue to be scheduled later

What signals an object? System events and resultingstate change Dispatcher object Effect of signaled stateon waiting threads Owning thread releases mutex Mutex (kernel mode) Kernel resumes one waiting thread nonsignaled signaled Resumed thread acquires mutex Owning thread or otherthread releases mutex Mutex (exported to user mode) Kernel resumes one waiting thread nonsignaled signaled Resumed thread acquires mutex One thread releases thesemaphore, freeing a resource Semaphore Kernel resumes one or more waiting threads nonsignaled signaled A thread acquires the semaphore.More resources are not available

What signals an object? (contd.) Dispatcher object System events and resultingstate change Effect of signaled stateon waiting threads A thread sets the event Event Kernel resumes one or more waiting threads nonsignaled signaled Kernel resumes one or more threads Dedicated thread sets oneevent in the event pair Event pair Kernel resumes waitingdedicated thread nonsignaled signaled Kernel resumes theother dedicated thread Timer expires Timer Kernel resumes all waiting threads nonsignaled signaled A thread (re) initializes the timer

What signals an object? (contd.) Dispatcher object System events and resultingstate change Effect of signaled stateon waiting threads IO operation completes File Kernel resumes waitingdedicated thread nonsignaled signaled Thread initiates wait on an IO port Process terminates Process Kernel resumes all waiting threads nonsignaled signaled A process reinitializesthe process object Thread terminates Thread Kernel resumes all waiting threads nonsignaled signaled A thread reinitializesthe thread object

WinNT Wait Internals 1:Dispatcher Objects • Any kernel object you can wait for is a “dispatcher object” • some exclusively for synchronization • e.g. events, mutexes (“mutants”), semaphores, queues, timers • others can be waited for as a side effect of their prime function • e.g. processes, threads, file objects • non-waitable kernel objects are called “control objects” • All dispatcher objects have a common header • wait list (queue) embedded • All dispatcher objects are in one of two states • “signaled” vs. “nonsignaled” • when signalled, a wait on the object is satisfied • different object types differ in terms of what changes their state • wait and unwait implementation iscommon to all types of dispatcher objects Dispatcher Object Size Type State Wait listhead Object-type-specific data (see \ntddk\inc\ddk\ntddk.h)

Represent a thread’s reference to something it’s waiting for (one per handle passed to WaitFor…) All wait blocks from a given wait call are chained to the waiting thread Type indicates wait for “any” or “all” Key denotes argument list position for WaitForMultipleObjects WaitBlockList List entry Thread Object Key Type Next link Size Type State Wait listhead List entry List entry Thread Thread Object Object Key Type Key Type Next link Next link Thread Objects Wait Internals 2:Wait Blocks WaitBlockList Dispatcher Objects Wait blocks Size Type State Wait listhead Object-type-specific data Object-type-specific data

Linux 2.6 Advanced operating systems course, The University of Tokyo

Linux Interrupt handling and Windows IRQLs • Linux doesn’t use the notion of IRQ levels (IRQL) • IRQs can be masked (disabled) one-by-one • enable_irq(int irq) / disable_irq(int irq) • There are executions contexts (similar to WinNT’s IRQL >= DISPATCH_LEVEL) • in_interrput(): • CPU is executing a hardware or software interrupt handler • in_atomic(): • CPU currently holds a spinlock, schedule() or sleeping is not available (can be on behalf of a process) • Note: in_interrupt () => in_atomic()

Linux interrupt handler (top-half) user or kernel mode code Switch to kernel stack Save registers Entry task interrupt ! Tell the device to stop interrupting Interrogate device state, start next operation on device, etc. Defer part of the work, request tasklet/workqueue (bottom-half) Interrupt handler routine Schedule if necessary Deliver signals if necessary Restore registers Activate user stack Exit task

Linux bottom halves (introduction) • Software interrupts (softIRQ): • Specified at kernel compile time • Atomic context, but interrupts are enabled • Tasklets: • Based on softIRQs • Interrupt context • Workqueues: • Executed by kernel threads • Process context

Software interrupts (softIRQ)(atomic bottom-half) • SoftIRQs are executed: • In the return from hardware interrupt code • In the ksoftirqd kernel thread • In any code that explicitly checks for and executes pending softirqs, such as the networking subsystem • Interrupts are enabled! • Can be executed simultaneously on multiple CPUs • Code must be reentrant, requires careful locking • Prioritized:

Tasklets (atomic bottom-half) • Build on top of softIRQs • Note: has nothing to do with tasks.. • Can be created and destroyed dynamically • Synchronized with respect to itself • The same tasklet cannot preempt itself • Interrupt context (no schedule(), sleeping)

Workqueues (process context bottom-half) • Can be added and destroyed dynamically • Run in process context by kernel threads • Schedulable • Can sleep, but no user mode address space is available • Meant to be used where memory allocation, obtaining semaphores, performing I/O is necessary…

Synchronization and locking in Linux • Spinlocks (basically the same how it is in Windows) • Reader-writer spinlocks • Multiple readers can access the resource at the same time • Write access is exclusive • Waitqueues • Semaphores • Reader-writer semaphores: down_read/down_write in rwsem.h • Multiple readers, exclusive writers • Completions in completion.h • Read-Copy-Update (RCU) locks

Waitqueues • Realized by a list(of tasks) and a spinlock • Tasks can wait: • TASK_INTERRUPTIBLE • TASK_UNINTERRUPTIBLE • Waiting tasks can be woken up • struct __wait_queue_head { • spinlock_t lock; • struct list_head task_list; • };

Semaphores and mutexes • Realized by: • A counter specifying how many processes may be in the in the critical section • Number of sleepers • Waitqueue • Mutexes are binary semaphores structsemaphore { atomic_tcount; intsleepers; wait_queue_head_twait; };

実践システムソフトウェア Practical System Software

実践システムソフトウェア Practical System Software

Presentation Transcript

Chapter 4

Practical Use of MDA Tables

Chapter 5: System Software: Operating Systems and Utility Programs

System Software and Machine Architecture

Practical System of Accounting : Cash Book

Patterns and Anti-Patterns in Hibernate

Object-Oriented Software Engineering Practical Software Development using UML and Java

Multi-Core System on Chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월

INTRODUCTION TO DIGITAL SWITCHING SYSTEM EWSD (Electronics Switching System Digital)

Practical Malware Analysis

Practical Aspects of Modern Cryptography

An Open Framework for Certified System Software

PRACTICAL INFECTION CONTROL-1

Engineering

Basics Of Computers

Information system architectures and architecting

Design and Software Architecture

Software Engineering Methods Software Design

Linux System Call

Database Design, a Practical Guide