Kernel Synchronization

Kernel Synchronization 9662542 吳承峯 9662638 陳誌偉 9665516 劉銘傑

Kernel Synchronization • Think of the kernel as a server that answers requests. • Requests can come either from a process running on a CPU or an external device issuing an interrupt request. • Parts of the kernel are not run serially, but in an interleaved way. • Race conditions → synchronization techniques.

How the Kernel Services Requests • Kernel → waiter must satisfy • Boss requests → interrupts • Customer requests → system calls or exceptions raised by User Mode processes. • A boss calls while the waiter is idle → the waiter starts servicing the boss. • A boss calls while the waiter is servicing a customer → the waiter stops servicing the customer and starts servicing the boss.

How the Kernel Services Requests • A boss calls while the waiter is servicing another boss • Stops servicing the first boss and starts servicing the second one. • When he finishes servicing the new boss, he resumes servicing the former one. • One of the bosses may induce the waiter to leave the customer being currently serviced. • After servicing the last request of the bosses, the waiter may decide to drop temporarily his customer and to pick up a new one.

Kernel Preemption • It is very hard to give a good definition of kernel preemption. • In general, a kernel is preemptive if a process switch may occur while the replaced process is executing a kernel function, that is, while it runs in Kernel Mode.

Types of Process Switches • Two kinds of process switches : • Planned process switch : Both in preemptive and nonpreemptive kernels, a process running in Kernel Mode can voluntarily relinquish the CPU. • Forced process switch :A process running in Kernel Mode reacts to asynchronous events that could induce a process switch.

The switch_to macro • All process switches are performed by the switch_to macro. • However, in nonpreemptive kernels, the current process cannot be replaced unless it is about to switch to User Mode. • The main characteristic of a preemptive kernel is that a process running in Kernel Mode can be replaced by another process while in the middle of a kernel function.

Example of the difference between preemptive and nonpreemptive kernels • While process A executes an exception handler (necessarily in Kernel Mode), a higher priority process B becomes runnable. • For instance, if an IRQ occurs and the corresponding handler awakens process B. • Preemptive Kernel : • A forced process switch replaces process A with B. • The exception handler is left unfinished and will be resumed only when the scheduler selects again process A for execution. • Nonpreemptive Kernel : • no process switch occurs until process A either finishes handling the exception handler or voluntarily relinquishes the CPU.

Example of the difference between preemptive and nonpreemptive kernels • Consider a process that executes an exception handler and whose time quantum expires. • Preemptive kernel : • the process may be replaced immediately; • Nonpreemptive kernel : • the process continues to run until it finishes handling the exception handler or voluntarily relinquishes the CPU.

Motivation of Kernel Preemption • Reduce the dispatch latency of the User Mode processes, that is, the delay between the time they become runnable and the time they actually begin running. • Processes performing timely scheduled tasks really benefit from kernel preemption. • Because it reduces the risk of being delayed by another process running in Kernel Mode.

Synchronization Primitives • This section examine how kernel control paths can be interleaved while avoiding race conditions among shared data.

Per-CPU Variables • The simplest and most efficient synchronization technique consists of declaring kernel variables as per-CPU variables. • A per-CPU variable is an array of data structures, one element per each CPU in the system. • A CPU should not access the elements of the array corresponding to the other CPUs. • It can freely read and modify its own element without fear of race conditions, because it is the only CPU entitled to do so.

Disadvantage • Do not provide protection against accesses from asynchronous functions (interrupt handlers and deferrable functions). • Per-CPU variables are prone to race conditions caused by kernel preemption, both in uniprocessor and multiprocessor systems. • As a general rule, a kernel control path should access a per-CPU variable with kernel preemption disabled.

Table 5-3. Functions and macrosfor the per-CPU variables

Atomic Operations • Several assembly language instructions are of type "read-modify-writ". • Access a memory location twice. • Race condition. • Ensuring that such operations are atomic at the chip level. • Be executed in a single instruction without being interrupted in the middle • Avoiding accesses to the same memory location by other CPUs.

Atomic Operation on 80x86 Instructions • Read-modify-write assembly language instructions (such as inc or dec) are atomic if no other processor has taken the memory bus after the read and before the write. • Read-modify-write assembly language instructions whose opcode is prefixed by the lock byte (0xf0) are atomic even on a multiprocessor system. • When the control unit detects the prefix, it "locks" the memory bus until the instruction is finished.

Atomic Operation on 80x86 Instructions • Assembly language instructions whose opcode is prefixed by a rep byte (0xf2, 0xf3, which forces the control unit to repeat the same instruction several times) are not atomic. • The control unit checks for pending interrupts before executing a new iteration.

Atomic operations in Linux • A special atomic_t type (an atomically accessible counter) and some special functions and macros (see Table 5-4) that act on atomic_t variables and are implemented as single, atomic assembly language instructions. • On multiprocessor systems, each such instruction is prefixed by a lock byte. typedef struct { volatile int counter; } atomic_t;

Table 5-4. Atomic operations in Linux

Spin Locks • When a kernel control path must access a shared data structure or enter a critical region, it needs to acquire a "lock" for it. • If a kernel control path wishes to access the resource, it tries to "open the door" by acquiring the lock. • It succeeds only if the resource is free. • When the kernel control path releases the lock, the door is unlocked and another kernel control path may enter the room.

Protecting critical regions with several locks • Five kernel control paths (P0, P1, P2, P3, and P4) are trying to access two critical regions (C1 and C2).

Spin Locks • Spin locks are a special kind of lock designed to work in a multiprocessor environment. • If the kernel control path finds the spin lock "open", • it acquires the lock and continues its execution. • If the kernel control path finds the lock "closed" by a kernel control path running on another CPU, • it "spins“ around, repeatedly executing a tight instruction loop, until the lock is released. • The instruction loop of spin locks represents a "busy wait ". • Many kernel resources are locked for a fraction of a millisecond only; therefore, it would be far more time consuming to release the CPU and reacquire it later.

Spin Lock Structure • In Linux, each spin lock is represented by a spinlock_t structure consisting of two fields: • Slock • the value 1 corresponds to the unlocked state • every negative value and 0 denote the locked state • break_lock • Flag signaling that a process is busy waiting for the lock (present only if the kernel supports both SMP and kernel preemption)

Table 5-7. Spin lock macros • All these macros are based on atomic operations; this ensures that the spin lock will be updated properly even when multiple processes running on different CPUs try to modify the lock at the same time

The spin_lock macro with kernel preemption • The spin_lock macro is used to acquire a spin lock. • It is first declared as #define spin_lock(lock) _spin_lock(lock) • The _spin_lock() macro is then defined as 253 #define _spin_lock(lock) \ 254 do { \ 255 preempt_disable(); \ 256 _raw_spin_lock(lock); \ 257 __acquire(lock); \ 258 } while(0)

The spin_lock macro • The following description refers to a preemptive kernel for an SMP system. It takes the address slp of the spin lock as its parameter and executes the following actions: • Invokes preempt_disable( ) to disable kernel preemption. • Invokes the _raw_spin_trylock( ) function, which does an atomic test-and-set operation on the spin lock's slock field;

The spin_lock macro • If the old value of the spin lock was positive, the macro terminates: the kernel control path has acquired the spin lock. • Otherwise, the kernel control path failed in acquiring the spin lock, thus the macro must cycle until the spin lock is released by a kernel control path running on some other CPU.. • Invokes preempt_enable( ) to undo the increase of the preemption counter done in step 1. • If kernel preemption was enabled before executing the spin_lock macro, another process can now replace this process while it waits for the spin lock.

The spin_lock macro • If the break_lock field == 0, then set break_lock = 1. • By checking this field, the process owning the lock and running on another CPU can learn whether there are other processes waiting for the lock. • If a process holds a spin lock for a long time, it may decide to release it prematurely to allow another process waiting for the same spin lock to progress. • Executes the wait cycle: while (spin_is_locked(slp) && slp->break_lock) cpu_relax(); • Jumps back to step 1 to try once more to get the spin lock.

The spin_lock macro without kernel preemption • If the kernel preemption option has not been selected when the kernel was compiled, the spin_lock macro will produce a assembly language fragment that is essentially equivalent to the following tight busy wait: 1: lock; decb slp->slock jns 3f 2: pause cmpb $0,slp->slock jle 2b jmp 1b 3:

The spin_unlock macro • The spin_unlock macro releases a previously acquired spin lock; it essentially executes the assembly language instruction: movb $1, slp->slock preempt_enable( ) /*if kernel preemption is not supported, preempt_enable( ) does nothing.*/ • Notice that the lock byte is not used because write-only accesses in memory are always atomically executed by the current 80x86 microprocessors.

Read/Write Spin Locks • Read/write spin locks have been introduced to increase the amount of concurrency inside the kernel. • They allow several kernel control paths to simultaneouslyreadthe same data structure, as long as no kernel control path modifies it. • If a kernel control path wishes to write to the structure, it must acquire the write version of the read/write lock, which grants exclusive access to the resource.

Read/Write Spin Locks • Two critical regions (C1 and C2) protected by read/write locks. • Kernel control paths R0 and R1 are reading the data structures in C1 at the same time, while W0 is waiting to acquire the lock for writing.

Read/write Spin Lock Structure • Each read/write spin lock is a rwlock_t structure. • Its lock field is a 32-bit field that encodes two distinct pieces of information: • A 24-bit counter, lock[0..23], denoting the number of kernel control paths currently reading the protected data structure. • 2's complement • An unlock flag, lock[24] that is set when no kernel control path is reading or writing

Read/write Spin Lock Structure • Notice that the lock field, • lock[0..31] = 0x01000000 • lock[0..31] = 0x00000000 • 0x00ffffff, 0x00fffffe • the rwlock_t structure also includes a break_lock field to signal that a process is busy waiting for the lock.

Getting and releasing a lock for reading • If the kernel preemption option has been selected when the kernel was compiled, the read_lock macro performs the very same actions as those of spin_lock( ), with just one exception: step 2 executes the _raw_read_trylock( ) function: int _raw_read_trylock(rwlock_t *lock) { atomic_t *count = (atomic_t *)lock->lock; atomic_dec(count); if (atomic_read(count) >= 0) return 1; atomic_inc(count); return 0; }

The read_lock macro with kernel preemption • The read/write lock counter of the lock field is accessed by means of atomic operations. • Notice that the whole function does not act atomically on the counter: • For instance, the counter might change after having tested its value with the if statement and before returning 1. • Nevertheless, the function works properly: in fact, the function returns 1 only if the counter was not zero or negative before the decrement, because the counter is equal to 0x01000000 for no owner, 0x00ffffff for one reader, and 0x00000000 for one writer.

The read_lock macro without kernel preemption • If the kernel non-preemption option has been selected when the kernel was compiled, the read_lock macro yields the following assembly language code: movl $rwlp->lock,%eax lock; subl $1,(%eax) jns 1f call __read_lock_failed 1:

The read_lock macro without kernel preemption • where __read_lock_failed( ) is the following assembly language function: __read_lock_failed: lock; incl (%eax) 1: pause cmpl $1,(%eax) js 1b lock; decl (%eax) js _ _read_lock_failed ret

Releasing a Read Lock • Releasing the read lock is quite simple, because the read_unlock macro must simply increase the counter in the lock field with the assembly language instruction: lock; incl rwlp->lock to decrease the number of readers, and then invoke preempt_enable( ) to reenable kernel preemption.

Getting and releasing a lock for writing • The write_lock macro is implemented in the same way as spin_lock( ) and read_lock( ). • If kernel preemption is supported, the function disables kernel preemption and tries to grab the lock right away by invoking raw_write_trylock( ). • If this function returns 0, the lock was already taken, thus the macro reenables kernel preemption and starts a busy wait loop, as explained in the description of spin_lock( ) in the previous section.

The _raw_write_trylock( ) function • The _raw_write_trylock( ) function is shown below: int _raw_write_trylock(rwlock_t *lock) { atomic_t *count = (atomic_t *)lock->lock; if (atomic_sub_and_test(0x01000000, count)) return 1; atomic_add(0x01000000, count); return 0; }

Releasing the Write Lock • Releasing the write lock is much simpler because the write_unlock macro must simply set the unlock flag in the lock field with the assembly language instruction: lock; addl $0x01000000,rwlp and then invoke preempt_enable().

Seqlocks • Seqlocks in Linux 2.6 are similar to read/write spin locks, except that they give a much higher priority to writers: • A writer is allowed to proceed even when readers are active. • The good part of this strategy is that a writer never waits (unless another writer is active) • The drawback is that a reader may sometimes be forced to read the same data several times until it gets a valid copy.

The Seqlock Structure • Each seqlock is a seqlock_t structure consisting of two fields: • a lock field of type spinlock_t • an integer sequence field • sequence counter • Each reader must read this sequence counter twice, before and after reading the data, and check whether the two values coincide. • A new writer has become active and has increased the sequence counter, thus implicitly telling the reader that the data just read is not valid

The Writer • A seqlock_t variable is initialized to "unlocked“ • by assigning to it the value SEQLOCK_UNLOCKED • by executing the seqlock_init macro. • Writers acquire and release a seqlock by invoking write_seqlock( ) and write_sequnlock( ) • write_seqlock( ) • acquires the spin lock in the seqlock_t data structure, then increases the sequence counter by 1 • write_sequnlock( ) • increases the sequence counter again, then releases the spin lock. • This ensures that • when the writer is in the middle of writing, the counter is odd • when no writer is altering data, the counter is even.

The Reader • Readers implement a critical region as follows: • read_seqbegin() returns the current sequence number of the seqlock • read_seqretry() returns 1 if • seq local variable is odd • if the value of seq does not match the current value of the seqlock's sequence counter

Conditions of using seqlock • As a general rule, the following conditions must hold: • The data structure to be protected does not includepointers that are modified by the writers and dereferenced by the readers • Otherwise, a writer could change the pointer under the nose of the readers. • The critical regions of the readers should be short and writers should seldom acquire the seqlock • Otherwise repeated read accesses would cause a severe overhead

Read-Copy Update (RCU) • Read-copy update (RCU) is another synchronization technique designed to protect data structures that are mostly accessed for reading by several CPUs. • RCU allows many readers and many writers to proceed concurrently • An improvement over seqlocks, which allow only one writer to proceed. • RCU is lock-free • It uses no lock or counter shared by all CPUs. • RCU is used in the networking layer and in the Virtual Filesystem

Key Idea of RCU • The key idea consists of limiting the scope of RCU as follows: • Restriction 1: Only data structures that are dynamically allocated and referenced by means of pointers can be protected by RCU. • Restriction 2: No kernel control path can sleep inside a critical region protected by RCU.

Kernel Synchronization