CS 3304 Comparative Languages

CS 3304Comparative Languages • Lecture 24:Concurrency – Implementations • 12 April 2012

Implementing Synchronization • Typically, synchronization is used to: • Make some operation atomic. • Delay that operation until some necessary precondition holds. • Atomicity: usually achieved with mutual exclusion locks. • Mutual exclusion ensures that only one thread is executing some critical section of code at given point in time: • Much early research was devoted to figuring out how to build it from simple atomic reads and writes. • Dekker is generally credited with finding the first correct solution for two threads in the early 1960s. • Dijkstra: a version that works for n threads in 1965. • Peterson: a much simpler two-thread solution in 1981. • Condition synchronization: allows a thread to wait for a precondition: e.g. a predicate on the value(s) in one or more shared variables.

Busy-Wait Synchronization • Busy-wait condition synchronization with atomic reads and writes is easy: • You just cast each condition in the form of “location X contains value Y” and you keep reading X in a loop until you see what you want. • Other forms are more difficult: • Spin locks: provide mutual exclusion. • Barriers: ensure that no thread continues past a given point in a program until all threads have reached that point.

Spin Locks • Spin lock: a busy-wait mutual exclusion mechanism. • Processors have instructions for atomic read/modify/write. • The problem with spin locks is that they waste processor cycles: overdemand for hardware resources – contention. • Synchronization mechanisms are needed that interact with a thread/process scheduler to put a thread to sleep and run something else instead of spinning. • Note, however, that spin locks are still valuable for certain things, and are widely used. • In particular, it is better to spin than to sleep when the expected spin time is less than the rescheduling overhead. • Reader-writer lock: allow concurrent access to readers threads.

Barriers • In data-parallel algorithms the correctness often depends on making sure that every thread completest the previous step before any moves on to the next. • Globally shared counter - modified by an atomic fetch_and_decrement instruction: • Threads toggle their local sense. • Threads decrement the counter and wait. • The last thread (counter is 1) allows other threads to proceed: • Reinitializes the counter to n. • Set the global sense to its local sense. • Sense reversing can lead to significant contention on large machines: • The fastest software barriers are O(log n). • Special hardware for near-constant-time.

Nonblocking Algorithms • Compare and store (CAS) - a universal primitive for single-location atomic update:acquire(L) versus start: r1 := x r1 := x r2 := foo(r1) r2 := foo(r1) x := r2 r2 := CAS(x, r1, r2)release(L) if !r2 goto start • Non-blocking: if the CAS operation fails it is because some other has made progress. • Generalization:repeat prepare -- harmless if we need to repeat CAS -- if successful, completes in a way visible to all threadsuntil successclean up -- performed by any thread if the original is delayed • Advantages: tolerant of page faults and preemption; can be safely used in signal/interruption handlers; can be faster. • Disadvantages: exceptionally subtle and difficult to devise.

Memory Consistency Models • Hardware memory coherence alone is not enough to make a multiprocessor behave as most programmers would expect. • When more than one location is written at about the same time, the order in which the writes become visible to different processors becomes very important. • Sequential consistency: • All writes are visible to all processors in the same order. • Any given processor’s writes are visible in order they were performed. • Very difficult to implement efficiently. • Relaxed memory models: • Certain loads and stores may appear to occur “out of order”. • Important ramifications for language designers, compiler writers, and the implementors of synchronization mechanisms and nonblocking algorithms.

The Cost of Ordering • Straightforward implementations: require both hardware and compilers to serialize operations. • Example - ordinary store instruction: • Temporal loop: • A’s write of inspected precedes its read of X in program order. • B’s write of X precedes its read of inspected in program order. • B’s read of inspected appears to A’s write of inspected, because it sees the unset value. • A’s read of X appears to precede B’s write of X as well, leaving us withxa = 0and ib = false. • May be also caused by compiler optimization.

Forcing Order • Avoiding temporal loops: use special synchronization or memory fence instructions. • Temporal loop - both A and B must prevent their read from bypassing (completing before) the logically earlier write: • Identifying the read or the write as a synchronization instruction • Sometimes more significant program changes are needed. • Fences and synchronization instructions may not suffice to solve the problem – concurrent propagation of writes. • Enclose the writes in a lock-based critical section.

Data Race Freedom • Multiprocessor memory behavior - transitive happens before relationship between instructions: • In certain cases an instruction on one processor happens before an instruction on another processor. • Write data-race free programs according to some (language-specific) memory model: • Never performs conflicting operations unless they are ordered by the model. • Memory consistency models distinguish: • Data races (memory races): between ordinary loads and stores. • Synchronization races - between lock operations, volatile load and stores, or other distinguished operations: • Temporal loop: avoid by declaring both X and inspected as volatile. • Concurrent propagation of writes: both C and D should read X and Y together in a single atomic operation.

Scheduler Implementation • OS-level processes must synchronize access to the ready list and condition queues, usually by means of spinning: • Assumes a single “low-level” lock (scheduler_lock) that protects the entire scheduler. • On a large multiprocessor we might increase concurrency by employing a separate lock for each condition queue, and another for the ready list. • Synchronization for sleep_on:disable_signalsacquire_lock(scheduler_lock)if not desired_condition sleep_on(condition_queue)release_lock(scheduler_lock)reenable signals

Scheduler-Based Synchronization • Busy-wait synchronization is generally level independent: • Consumes cycles that could be used for computation. • Makes sense only if the processor is idle or the expected wait time is less than the time required to switch contexts. • Scheduler-based synchronization is level dependent: • Specific to threads (language implementation) or processes (OS). • Semaphores were the first proposed scheduler-based synchronization mechanism, and remain widely used. • Conditional critical regions (CCRs), monitors, and transactional memory came later. • Bounded buffer abstraction - a concurrent queue of limited size into which producer threads insert data: • Buffer evens out the fluctuations. • The correct implementation requires both atomicity and condition synchronization.

Semaphores • A semaphore is a special counter: • Has an initial value and two operations, P and V, for changing value. • A semaphore keeps track of the difference between the number of P and V operations that have occurred. • A P operation is delayed (the process is de-scheduled) until #P-#V <= C, the initial value of the semaphore. • The semaphores are generally fair, i.e., the processes complete P operations in the same order they start them • Problems with semaphores: • They're pretty low-level: • When using them for mutual exclusion, it's easy to forget a P or a V, especially when they don't occur in strictly matched pairs. • Their use is scattered all over the place: • If you want to change how processes synchronize access to a data structure, you have to find all the places in the code where they touch that structure, which is difficult and error-prone

Semaphore Operations - Scheduler • Implementations of P and V for the scheduler operations. • The code for sleep_on cannot disable timer signals and acquire the scheduler lock itself because the caller needs to test a condition and then block as a single atomic operation.

Language-Level Mechanisms • Semaphores are considered to be too “low level” for well-structured, maintainable code: • Their operations are simple subroutine calls that are easy to leave out. • Uses of a given semaphore tend to get scattered throughout a program (unless hidden inside an abstraction) - difficult to track down for purposes of software maintenance. • Other language mechanisms include: • Monitors. • Conditional critical regions. • Transactional memory. • Implicit synchronization.

Monitors • Suggested by Dijkstra as a solution to the problems of semaphores (languages Concurrent Pascal, Modula, Mesa). • Monitor is a module or object with operations, internal state, and a number of condition variables: • Only one operation of a given monitor is allowed to be active at a given point in time (programmers are relieved of the responsibility of using P and V operations correctly). • A thread that calls a busy monitor is automatically delayed until the monitor is free. • An operation can suspend itself by waiting on a condition variable (not the same as semaphores – no memory). • All operations on the encapsulated data , including synchronization, are collected together. • Monitors have the highest-level semantics, but a few sticky semantic problem - they are also widely used.

Monitor - Semantic Details • Hoare’s definition of monitors: • One thread queue for every condition variable. • Two bookkeeping queues: • Entry queue: threads that attempt to enter a busy monitor. • Urgent queue: when a thread executes a signal operation from within a monitor, and some other thread is waiting on the specific condition, then the signaling thread waits on the monitor’s urgent queue. • Monitor variations: • Semantic of the signal operation. • Management of mutual exclusion when a thread waits inside a nested sequence of two or more monitor calls. • Monitor invariant: a predicate that captures the notion that “the state of monitor is consistent.” • Needs to be true initially and at monitor exit. • Monitors and semaphors are equally powerful.

Signals • One signals a condition variable when some condition on which thread may be waiting has become true. • To make sure the condition is still true when the thread wakes up, the thread needs to switch as soon as the signal occurs: we need the urgent queue. • Induces unnecessary scheduling overhead. • Mesa – signals are hints, not absolutes:if not desired_condition wait(condition_variable)becomeswhile not desired_condition wait(condition_variable) • Modula-3 takes a similar approach. • Concurrent Pascal - signal operation causes an immediate return from the monitor operation in which it appears: • Preserves invariant and low overhead but precludes algorithms in which a thread does useful work after signaling a condition.

Nested Monitor Calls • Usually a wait in a nested sequence of monitor operations: • Releases mutual exclusion on the innermost monitor. • Leaves the outer monitors locked. • Can lead to deadlock if the only another thread to reach a corresponding signal operation is through the same outer monitors: • The thread that entered the outer monitor first is waiting for the second thread to execute a signal operation but the second thread is waiting for the first to leave the monitor. • Deadlock: any situation in which a collection of threads are all waiting for each other, and none of them can proceed. • Solution - release exclusion on outer monitors when waiting in an inner one – adopted early uniprocessor implementations: • Requires that monitor invariant holds at any subroutine call that may result in a wait or (Hoare semantics) signal in a nested monitor. • May not all be know to the programmer.

Conditional Critical Regions • Proposed as an alternative to semaphores by Brinch Hansen. • Critical region - a syntactically delimited critical section in which the code is permitted to access a protected variable: • Specifies a Boolean condition that must be true before control enters:region protected_variable, when Boolean_condition do …end region • No thread can access the protected variable except within a region statement. • Any thread that reaches a region statement waits until the condition is true and no other is currently in a region for the same variable. • Nesting regions: a deadlock is possible. • Languages – Edison: • Influenced synchronization mechanism of Ada 95, Java, and C#.

Synchronization in Ada 95 • In addition to message passing in Ada 83, Ada 95 has a notion of protected object: • Three types of methods: functions, procedures, and entries. • Functions can only read the fields of the object. • Procedures and entries can read and write them. • An implicit reader-writer lock on the protected object ensures that potentially conflicting operations exclude one another in time. • Entry differs from procedures: • Can have a Boolean expression guard: the calling thread will wait for before beginning execution. • Three special forms of call: • Timed: abort after waiting for a specified amount of time. • Conditional: execute alternative code if the call cannot proceed now. • Asynchronous: execute alternative code now, abort if call can proceed. • Ada 95 shared memory sync: a hybrid of monitors and CCRs.

Synchronization in Java • An object has implicit mutual exclusion lock: synchronized. • Synchronized statements that refer to different objects may proceed concurrently. • Within a synchronized statement or method, a thread can suspend itself by calling the predefined method wait. • Threads can be awoken for spurious reasons:while (!condition) { wait();} • Resuming a thread suspended on an object: • Some other thread must execute the predefined method notify from within a synchronized statement. • There is also notifyAll that awakes all threads. • Synchronization in Java is sort of a hybrid of monitors and CCRs (Java 3 will have true monitors) – similarly in C#.

Lock Variables • C# and Java versions prior to 5: threads are never waiting for more than one condition. • Java 5 java.util.concurrent package provides a more general solution - explicit creation of Lock variables:Lock l = new ReentrantLock();l.lock();try { …} finally { l.unlock();} • Lacks the implicit release at the end of scope associated with synchronized methods and statements. • Java objects using only synchronized methods: monitors. • Java synchronized statements that begins with a wait in a loop resembles a CCR.

The Java Memory Model • Specifies exactly: • Which operations are guaranteed to be ordered across threads. • For every read/write pair if the read is permitted to return the value written by the write. • Java thread is allowed : • Buffer or reorder its writes until the point at which it writes a volatile variable or leaves a monitor. • Keep cached copies of values written by other threads until it reads a volatile variable or enters a monitor. • The compiler can: • Reorder ordinary reads/writes in the absence of intrathread data dependences. • It cannot reorder volatile access, monitor entry, or monitor exit with respect to one another.

Transactional Memory • Locks (semaphors, monitors, CCRs) make it easy to write data-race free programs but they do not scale: • Adding processors and threads: the lock becomes a bottleneck. • We can partition program data into equivalence classes: a critical section must acquire lock for every accessed equivalence class. • Different critical sections may locks in different orders: deadlock can result. • Enforcing a common order can be difficult. • Locks may be too low level a mechanism. • The mapping between locks and critical sections is an implementation detail from a semantic point of view: • We really want is a composable atomic construct: transactional memory (TM).

Atomicity without Locks • Transactions have been used for atomicity in databases. • The basic idea of TM: • The programmer labels code blocks as atomic. • The underlying system takes responsibility for executing those blocks in parallel whenever possible. • If the code inside the atomic block can safely be rolled back in the event of conflict, then the implementation can be based on speculation. • Implementation: rather surprising amount of variety. • Challenges: • What should we do about operations inside transactions that cannot easily be rolled back (I/O, system calls)? • How to discourage programmers from creating large transactions? • How transactions interact with locks/nonblocking data structures?

Implicit Synchronization • Thread operations on shared data are restricted in such a way that synchronization can be implicit in the operation themselves, rather than appearing as explicit operations: • Example: the forall loop of HPF and Fortran 95. • Dependence analysis: compiler identifies situations in which statements within a loop do not depend on one another and can proceed without synchronization. • Automatic parallelization: • Considerable success with well structured data-parallel program. • Thread level, for irregularly structured programs, is very difficult.

Futures • Implicit synchronization without compiler analysis: • Multilisp Scheme: (future (my-function my-args)) • In a purely functional program, semantically neutral. • Executes its work in parallel until it detects an attempt to perform an operation that is too complex for the system to run safely in parallel. • Work in a future is suspended if it depends in some way on the current continuation, such as raising an exception. • C# 3.0/Parallel FX: Future class. • Java: Future class and Executor object. • CC++: single-assignment variable. • Linda: a set of subroutines that manipulate a shared abstraction called the tuple space. • Parallel logic programming- AND and OR (speculative) parallelism: fail to adhere to the deterministic search.

Message Passing • Most concurrent programming on large multicomputers and net- works is currently based on messages. • To send/receive a message, one must generally specify where to send it to, or where to receive it from: communication partners need names for one another: • Addressing messages to processes: Hoare’s CSP (Communicating Sequential Processes). • Addressing messages to ports: Ada. • Addressing messages to channels: Occam. • Ada’s comparatively high-level semantics for parameter modes allows the same set of modes to be used for both subroutines and entries (rendezvous). • Some concurrent languages provide parameter modes specifically designed with remote invocation in mind.

Summary • We focus on shared-memory programming models and on synchronization in particular. • We distinguish between atomicity and condition synchronization, and between busy-wait and scheduler-based implementation. • Busy-wait mechanisms include spin locks and barriers • Scheduler-based implementations include semaphors, monitors, and conditional critical regions. • Transactional memories sacrifice performance for the sake of programmability.

CS 3304 Comparative Languages