W4118 Operating Systems

W4118 Operating Systems Instructor: Junfeng Yang

Logistics • Homework 2 update • Assembly code to call a hook function written in C • syscall_fail(long syscall_nr) • Clarifications on sys_fail (int ith, int ncall, struct syscall_failures calls) • Only fail system calls from the current process • Each fail() injects only one failure • if a system call matches one of the system call numbers specified in the calls argument, count it as one matching call toward ith • Mac users: talk to TA to get access to VMware fusion

Last lecture • Processes in Linux • Context switch on x86 • Kernel stack captures all state  switch stack = switch process • Threads: good at expressing concurrency efficiently • Multithreading Models: different tradeoffs • Race conditions

Recall: banking example int balance = 1000; int main() { pthread_t t1, t2; pthread_create(&t1, NULL, deposit, (void*)1); pthread_create(&t2, NULL, withdraw, (void*)2); pthread_join(t1, NULL); pthread_join(t2, NULL); printf(“all done: balance = %d\n”, balance); return 0; } void* deposit(void *arg) { int i; for(i=0; i<1e7; ++i) ++ balance; } void* withdraw(void *arg) { int i; for(i=0; i<1e7; ++i) -- balance; }

Recall: a closer look at the banking example $ objdump –d bank … 08048464 <deposit>: … // ++ balance 8048473: a1 80 97 04 08 mov 0x8049780,%eax 8048478: 83 c0 01 add $0x1,%eax 804847b: a3 80 97 04 08 mov %eax,0x8049780 … 0804849b <withdraw>: … // -- balance 80484aa: a1 80 97 04 08 mov 0x8049780,%eax 80484af: 83 e8 01 sub $0x1,%eax 80484b2: a3 80 97 04 08 mov %eax,0x8049780 …

Avoiding Race Conditions • Race condition: a timing dependent error involving shared state • Critical section: a segment of code that accesses shared variable (or resource) and must not be concurrently executed by more than one thread // ++ balance mov 0x8049780,%eax add $0x1,%eax mov %eax,0x8049780 … // -- balance mov 0x8049780,%eax sub $0x1,%eax mov %eax,0x8049780 …

How to implement critical sections? • Atomic operations: no other instructions can be interleaved, executed “as a unit” “all or none”, guaranteed by hardware • A possible solution: create a super instruction that does what we want atomically • add $0x1, 0x8049780 • Problem • Can’t anticipate every possible way we want atomicity • Increases hardware complexity, slows down other instructions // ++ balance mov 0x8049780,%eax add $0x1,%eax mov %eax,0x8049780 … // -- balance mov 0x8049780,%eax sub $0x1,%eax mov %eax,0x8049780 …

Layered approach to synchronization • Hardware provides simple low-level atomic operations, upon which we can build high-level, atomic operations, upon which we can implement critical sections and build correct multi-threaded/multi-process programs Properly synchronized application High-level synchronization primitives Hardware-provided low-level atomic operations

Example low-level atomic operations and high-level synchronization primitives • Low-level atomic operations • On uniprocessor, disable/enable interrupt • x86 load and store of words • Special instructions: • Test-and-set • High-level synchronization primitives • Lock • Semaphore • Monitor Will look at them all. Start with lock.

Locks (or Mutex: Mutual exclusion) • Two common operations • lock(): acquire lock exclusively; wait if not available • unlock(): release exclusive access to lock • pthread example pthread_mutex_t l = PTHREAD_MUTEX_INITIALIZER void* deposit(void *arg) { int i; for(i=0; i<1e7; ++i) { pthread_mutex_lock(&l); ++ balance; pthread_mutex_unlock(&l); } } void* withdraw(void *arg) { int i; for(i=0; i<1e7; ++i) { pthread_mutex_lock(&l); -- balance; pthread_mutex_unlock(&l); } }

Critical Section Goals • Requirements • Safety (aka mutual exclusion): no more than one thread in critical section at a time. • Liveness (aka progress): • If multiple threads simultaneously request to enter critical section, must allow one to proceed • Must not depend on threads outside critical section • Bounded waiting (aka starvation-free) • Must eventually allow waiting thread to proceed • Makes no assumptions about the speed and number of CPU • However, assumes each thread makes progress • Desirable properties • Efficient: don’t consume too much resource while waiting • Don’t busy wait (spin wait). Better to relinquish CPU and let other thread run • Fair: don’t make some thread wait longer than others. Hard to do efficiently • Simple: should be easy to use

Implementing Locks: version 1 • Can cheat on uniprocessor: implement locks by disabling and enabling interrupts • Linux kernel heavily used this trick in single core days • cli(): __asm__ __volatile__("cli": : :"memory") • sti(): __asm__ __volatile__(“sti": : :"memory") • Good: simple! • Bad: • Both operations are privileged, can’t let user program use • Doesn’t work on multiprocessors lock() { disable_interrupt(); } unlock() { enable_interrupt(); }

Implementing Locks: version 2 • Peterson’s algorithm: software-based lock implementation • Good: doesn’t require much from hardware • Only assumptions: • Loads and stores are atomic • They execute in order • Does not require special hardware instructions

Software-based lock: 1st attempt // 0: lock is available, 1: lock is held by a thread int flag = 0; • Idea: use one flag, test then set, if unavailable, spin-wait (or busy-wait) • Why doesn’t work? • Not safe: both threads can be in critical section • Not efficient: busy wait, particularly bad on uniprocessor (will solve this later) lock() { while (flag == 1) ; // spin wait flag = 1; } unlock() { flag = 0; }

Software-based locks: 2nd attempt // 1: a thread wants to enter critical section, 0: it doesn’t int flag[2] = {0, 0}; • Idea: use per thread flags, set then test, to achieve mutual exclusion • Why doesn’t work? • Not live: can deadlock lock() { flag[self] = 1; // I need lock while (flag[1- self] == 1) ; // spin wait } unlock() { // not any more flag[self] = 0; }

Software-based locks: 3rd attempt // whose turn is it? int turn = 0; • Idea: strict alternation to achieve mutual exclusion • Why doesn’t work? • Not live: depends on threads outside critical section lock() { // wait for my turn while (turn == 1 – self) ; // spin wait } unlock() { // I’m done. your turn turn = 1 – self; }

Software-based locks: final attempt (Peterson’s algorithm) // whose turn is it? int turn = 0; // 1: a thread wants to enter critical section, 0: it doesn’t int flag[2] = {0, 0}; • Why works? • Safe? • Live? • Bounded wait? unlock() { // not any more flag[self] = 0; } lock() { flag[self] = 1; // I need lock turn = 1 – self; // wait for my turn while (flag[1-self] == 1 && turn == 1 – self) ; // spin wait while the // other thread has intent // AND it is the other // thread’s turn }

Notes on Peterson’s algorithm • Great way to start thinking of multi-threaded programming • Scheduler is “malicious” • The algorithm is useful in other contexts as well • Problem: • Doesn’t work with N>2 threads • Obvious extension: N flags, turn = 0,1,…,N-1 doesn’t work • Leslie Lamport’s Bakery’s algorithm • More importantly, doesn’t really work on modern out-of-order processors • Next: implement locks with hardware support

Implementing locks: version 3 // 0: lock is available, 1: lock is held by a thread int flag = 0; • Problem with the test-then-set approach: test and set are not atomic • Special atomic operation • test_and_set address, register • load *address to register, and set *address to 1 • Why works? lock() { while(test_and_set(&flag)) ; } unlock() { flag = 0; }

test_and_set on x86 • xchg reg, addr: atomically swaps *addr and reg • spin_lock in Linux can be implemented using this instruction (include/asm-i386/spin_lock.h) • Another way: append a “lock prefix” before an instruction • Examples in include/asm-i386/atomic.h long test_and_set(volatile long* lock) { int old; asm("xchgl %0, %1" : "=r"(old), "+m"(*lock) // output : "0"(1) // input : "memory“ // can clobber anything in // memory, so gcc won’t // reorder this statement with others ); return old; }

Spin-wait or block • So far the lock implementations we’ve seen are busy-wait or spin-wait locks: endlessly checking the lock flag without yielding CPU • Problem: waste CPU cycles • Worst case: prev thread holding a busy-wait lock gets preempted, other threads try to acquire the same lock • On uniprocessor: should not use spin-lock • Yield CPU when lock not available (need OS support) • On multi-processor • Thread holding lock gets preempted  ??? • Correct action depends on how long before lock release • Lock released “quickly”  spin-wait • Lock released “slowly”  block • Quick or slow is relative to context-switch overhead • Good plan: spin a bit, then block

Problem with simple yield lock() { while(test_and_set(&flag)) yield(); } • Problem: • Still a lot of context switches • Starvation possible • Why? No control over who gets the lock next • Need explicit control over who gets the lock

Implementing locks: version 4 typedef struct __mutex_t { int flag; // 0: mutex is available, 1: mutex is not available int guard; // guard lock to internal mutex data structure queue_t *q; // queue of waiting threads } mutex_t; • Add thread to queue when lock unavailable void lock(mutex_t *m) { while (test_and_set(m->guard)) ; //acquire guard lock by spinning if (m->flag == 0) { m->flag = 1; // acquire mutex m->guard = 0; } else { enqueue(m->q, self); m->guard = 0; yield(); } } void unlock(mutex_t *m) { while (test_and_set(m->guard)) ; if (queue_empty(m->q)) // release mutex; no one wants mutex m->flag = 0; else // hold mutex (for next thread!) wakeup(dequeue(m->q)); m->guard = 0; }

Semaphore • Synchronization tool that does not require busy waiting • Semaphore S– integer variable • Two standard operations modify S: acquire() and release() • Originally called P() andV() • from Dutch Proberen and Verhogen (Dijkstra) • Also called down() and up() • And even wait() and signal() • Higher-level abstraction, less complicated • Can only be accessed via two indivisible (atomic) operations P(S) { while(S ≤ 0) ; S--; } V(S) { S++; }

Semaphore Types • Counting semaphore – integer value can range over an unrestricted domain • Used for synchronization • Binary semaphore – integer value can range only between 0 and 1; can be simpler to implement • Used for mutual exclusion: same as mutex Process i P(S); Critical Section V(S); Remainder Section

Semaphore Implementation • Must guarantee that no two processes can execute P () / acquire () and V () / release () on the same semaphore at the same time • Thus, implementation of these operations becomes the critical section problem again, where the acquire and release code are placed inside the critical section. • Could now have busy waiting in critical section implementation • But if we know we can’t acquire semaphore, should we busy wait and burn up the CPU? • Note that applications may spend lots of time in critical sections and therefore this is not a good solution. • We’d like a semaphore that sleeps (or at least lets someone else run)

Semaphore Implementation with no Busy waiting • With each semaphore there is an associated waiting queue. Each entry in a waiting queue has two data items: • value (of type integer) • pointer to next record in the list • Two operations: • block– place the process invoking the operation on the appropriate waiting queue. • wakeup – remove one of processes in the waiting queue and place it in the ready queue. • Potential queuing policies: FIFO, LIFO, undef

Semaphore Implementation with no Busy waiting(Cont.) • Implementation of acquire(): • Implementation of release():

Producer-Consumer Problem • Bounded buffer: size ‘N’ • Access entry 0… N-1, then “wrap around” to 0 again • Producer process writes data to buffer • Must not write more than ‘N’ items more than consumer “ate” • Consumer process reads data from buffer • Should not try to consume if there is no data 0 1 N-1 In Out

Solving Producer-Consumer Problem • Solving with semaphores • We’ll use two kinds of semaphores • We’ll use counters to track how much data is in the buffer • One counter counts as we add data and stops the producer if there are N objects in the buffer • A second counter counts as we remove data and stops a consumer if there are 0 in the buffer • Idea: since general semaphores can count for us, we don’t need a separate counter variable • Why do we need a second kind of semaphore? • We’ll also need a mutex semaphore

Producer-Consumer Problem Shared: Semaphores mutex, empty, full; Init: mutex = 1; /* for mutual exclusion*/ empty = N; /* number empty buf entries */ full = 0; /* number full buf entries */ Producer do { . . . // produce an item in nextp . . . P(empty); P(mutex); . . . // add nextp to buffer . . . V(mutex); V(full); } while (true); Consumer do { P(full); P(mutex); . . . // remove item to nextc . . . V(mutex); V(empty); . . . // consume item in nextc . . . } while (true);

Common Errors with Semaphores Process l P(S) If (something) return; CS V(S) Process i P(S) CS P(S) Process k P(S) CS Process j V(S) CS V(S) A typo. Process J won’t respect mutual exclusion even if the other processes follow the rules correctly. Worse still, once we’ve done two “extra” V() operations this way, other processes might get into the CS inappropriately! Whoever next calls P() will freeze up. The bug might be confusing because that other process could be perfectly correct code, yet that’s the one you’ll see hung when you use the debugger to look at its state! A typo. Process I will get stuck (forever) the second time it does the P() operation. Moreover, every other process will freeze up too when trying to enter the critical section! Someone forgot to release the semaphore before returning! The next caller will get stuck.

W4118 Operating Systems