Release Consistency

Release Consistency • Slides by Konstantin Shagin, 2002

The need for Relaxed Consistency Schemes • In any implementation of Sequential Consistency there should be some global control mechanism. • Either of writes or reads require memory synchronization operations. • In most implementation writes require some kind of memory synchronization: w(x) w(y) w(x) A B

barrier The Idea of Relaxed Consistency Schemes • The Relaxed Consistency Schemes are designed to allow less memory synchronization operations. • Writes can be delayed, aggregated, eliminated. • This results in less communication and therefore higher performance. w(x) w(y) w(x) A B

Node n Node 1 Node 2 Mem Mem Mem network distributed shared memory Software Distributed Shared Memory page based, permissions, … single system image, shared virtual address space, …

False Sharing • False sharing is a situation in which two or more processes access different variables within a page and at least one of the accesses is a write. • If only one process is allowed to write to a page at a time, false sharing leads to unnecessary communication, called the “ping-pong” effect.

Understanding False Sharing x w(x) w(x) w(x) A y p p p p p p B r(y) r(y) r(y) x w(x) w(x) w(x) A y page p1 B page p2 r(y) r(y) r(y)

False Sharing in Relaxed Consistency Schemes • False sharing has much smaller overhead in relaxed consistency models. • The overhead induced by false sharing can be further reduced by the the usage of multiple-writer protocols. • Multiple-writer protocols allow multiple processes to simultaneously modify their local copy of a shared page. • The modifications are merged at certain points of execution.

(*) K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26. IEEE, May 1990. Release Consistency[Gharachorloo et al. 1990, DASH]* • Introduces a special type of variables, called synchronization variablesor locks. • Locks cannot be read or written to. They can be acquired and released. For a lock L those operations are denoted by acquire(L)andrelease(L)respectively • We will say that a process that acquired a lock L but has not released it, holdsthe lock L. • No more than one process can hold a lock L. One process holds the lock while others wait.

Using Release and Acquire to define execution-flow synchronization primitives • Let a set of processes release tokens by reaching the operation release in their program order. • Let another set (possibly with overlap) acquire those tokens by performing acquire operation, where acquire can proceed only when all tokens have already arrived from all releasing processes. • 2-way synchronization = lock-unlock, 1 release, 1 acquire • n-way synchronization = barrier, n releases, n acquires • PARC’s synch = k-way synchronization

Model of Atomicity • A read by Pi is considered performed with respect toprocess Pk at a point in time when the issuing of a write to the same address by Pk can not affect the value returned by the read. • A write by Pi is considered performed with respect toprocess Pk at a point in time when an issued read to the same address by Pk returns the value defined by this write (or a later value). • An access is performed when it is performed with respect to all processes. • An acquire(L)by Pi is performed when Pi receives exclusive ownership of L (before any other requester). • A release(L)by Pi is performed when Pi gives away its exclusive ownership of L.

Formal Definition of Release Consistency • Conditions for Release Consistency: • Before a read or write access is allowed to perform with respect to any other process, all previous acquire accesses must be performed, and • Before a release access is allowed to perform with respect to any other process, all previous read or write accesses must be performed, and • acquire and release accesses are sequentially consistent.

w(x)1 r(x)0 r(x)? r(x)1 r(x)1 A rel(L1) acq(L1) B t Understanding RC From this point all processes must see the value 1 in X It is undefined what value is read here. It can be any value written by some process. Here it can be 0 or 1. 1 must be read according to rule (B), but the programmer can not be sure of it Programmer is sure that this will return 1 according to rules (C) and (A)

Acquire and Release • release serves as a memory-synch operation, or a flush of the local modifications to the attention of all other processes. • According to the definition, the acquire and release operations are not only used for synchronization of execution, but also for synchronization of memory, i.e. for propagation of writes from/to other processes. • This allows to overlap the two expensive kinds of synchronization. • This turns out also simpler on the programmer from semantic point of view.

Acquire and Release (cont.) • A release followed by an acquire of the same lock guarantees to the programmer that all writes previous to the release will be seen by all reads following the acquire. • The idea is to let the programmer decide which blocks of operations need be synchronized, and put them between matching pair of acquire-release operations. • In the absence of release/acquire pairs, there is no assurance that modifications will ever propagate between processes.

Consistency of synchronization operations • Note the relations of the release/acquire operations to themselves also define an independent memory consistency scheme. • The rule (C) defined it to be Sequential Consistency. • There are other flavors of RC in which the consistency of synchronization operations defined to be some consistency x (e.g., Coherence). Such a memory model is denoted by RCx. • RCx is weaker than RCy if x is weaker than y. • For simplicity, we deal only with RCsc.

Happened-Before relation induced by acquire/release • Redefine the happened-before relation using acquire and release instead of receive and send respectively. • We say that event e happened before event e’ (and denote it by e  e’ or e < e’) if one of the following properties holds: Processor Order: e precedes e’ in the same process Release-Acquire: e is a release and e’ is the following acquire of the same lock Transitivity: exists e’’ s.t. e < e’’ and e’’< e’

w(x) r(x) w(y) w(x) r(y) r(x) w(y) r(y) acq(L2) rel(L1) rel(L2) rel(L1) t Happened-Before relation induced by acquire/release (cont.) A B acq(L1) C rel(L2) acq(L2)

Competing Accesses • Two memory accesses are not synchronizedif they are independent events according to the previously defined happened-before relationship. • Two memory accesses are conflicting if they are accesses to the same memory location, and at least one of them is a write. • Conflicting accesses are said to be competing if there exists an execution in which they are not synchronized. • Competing accesses form a race condition as they may be executed concurrently.

Data Races in RC • Release Consistency does not guarantee anything about propagation of updates without synchronization. Example: Initially: grades = oldDatabase; updated = false; Thread T.A. Thread Lecturer grades = newDatabase; updated = true; while (updated == false); X:=grades.gradeOf(lecturersSon); • If the modification of variable updated is passed to Lecturer, while the modification of grades is not, then Lecturer looks at the old database! • This is possible in Release Consistency, but not in Sequential Consistency.

Expressiveness of Release Consistency[Gharachorloo et.al 1990] Let a properly-labeled (PL) program be such that has no competing accesses. Theorem: RCsc = SC for PL programs. Should make sure there are no data-races.

w(x) w(y) w(z) rel(L) P1 x z y P2 Implementing RC • The first implementation was proposed by the inventors of RC and is called DASH. • DASH combats memory latency by pipelining writes to shared memory. • The processor is stalled only when executing a release, at which time it must wait for all its previous writes to perform.

w(x) w(y) w(z) rel(L) P1 x,y,z P2 Implementing RC (cont.) • It is important to reduce the number of messages exchanges, because every message has additional fixed overhead, independent of its size. • Another implementation of RC, called Munin reduces the number of messages by buffering writes until a release.

(*) John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of MUNIN. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 152--164, October 1991. Eager Release Consistency[Carter et al. 1991, Munin]* • Implementation of Release Consistency (not a new memory model). • Postpone sending modifications to the next release. • Upon a release send all accumulated modifications to all caching processes. • No memory-synchronization operations on an acquire. • Upon a miss (no local caching of the variable) get latest modification from latest modifier (need some more control to store its identity, no big deal).

r(z)0 r(x)1 r(x)0 r(x)0 t Understanding ERC apply changes apply changes r(z)1 r(y)1 acq(L1) A z x,y apply changes w(x)1 w(y)1 r(z)1 B rel(L1) x,y z w(z)1 acq(L2) C rel(L2) apply changes • Release operation does not complete (is not performed) until the acknowledgements from all the processes are received.

Supporting Multiple Writersin ERC • Modifications are detected by twinning. • When writing to unmodified page, its twin is created. • When releasing, the final copy of a page is compared to its twin. • The resulting difference is called a diff. • Twinning and diffing not only allow multiple writers, but also reduce communication. • Sending a diff is cheaper than sending an entire page.

write P twin writable working copy release: diff  Twinning and Diffing

w(x)1 w(x)1 w(y)2 w(y)2 Update-based vs. Invalidate-based • In update-based protocols the modifications are sent whereas in invalidate-based protocol only notifications of modifications are sent. Update-based Invalidate-based rel(L) rel(L) P1 P1 x:=1 “I changed x and y” y:=2 P2 P2

w(x)1 w(y)2 Update-Based vs. Invalidate-Based (cont.) • Invalidations are smaller than the updates. • The bigger the coherency unit the bigger is the difference. • In invalidation-based schemes there can be significant overhead due to access misses. rel(L) P1 inv(x) x=1 y=2 get(x) get(y) inv(y) acq(L) P2 r(y) r(x)

Reducing the Number of Messages • In DASH and Munin systems all processes (or all processes that cache the page) see the updates of a process. • Consider the following example of execution in Munin: w(x) rel(L) P1 w(x) acq(L) rel(L) P2 w(x) acq(L) rel(L) P3 r(x) acq(L) P4 • There are many unneeded messages. In DASH even more. • This problem exists in invalidation-based schemes as well.

Reducing the Number of Messages (cont.) • Logically, however it suffices to update each processor’s copy only when it acquires L. w(x) rel(L) P1 w(x) acq(L) rel(L) P2 w(x) acq(L) rel(L) P3 r(x) acq(L) P4 • Therefore, a new algorithm, called Lazy Release Consistency (LRC) for implementing RC was proposed. • LRC is aimed at reducing both the number of messages and the amount of data exchanged.

(*) P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenopol. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115--132, Jan. 1994. Lazy Release Consistency[Keleher et al., Treadmarks 1992]* • The idea is to postpone sending of modifications until a remote processoractually needs them. • Invalidate-based protocol • The BIG advantage: no need to get modifications that are irrelevant, because they are already masked by newer ones. • NOTE: implements a slightly more relaxed memory model than RC!

Formal Definition of Lazy Release Consistency • Conditions for Lazy Release Consistency: • Before a read or write access is allowed to perform with respect to any other process, all previous acquire accesses must be performed with respect to that other process, and • Before a release access is allowed to perform with respect to any other process, all previous read or write accesses must be performed with respect to that other process, and • acquire and release accesses are sequentially consistent.

r(x)? r(x)0 r(x)1 r(x)? r(x)? r(x)0 w(x)1 r(x)? r(x)? rel(L1) acq(L1) acq(L2) t Understanding the LRC Memory Model A B C • It is guaranteed that the acquirer of the same lock sees the modification that precede the release in program order.

w(x)1 w(y)1 r(x)1 r(y)1 A rel(L1) acq(L1) acq(L2) rel(L2) rel(L1) acq(L2) B C t Understanding the LRC Memory Model: Transitivity • The process C sees the modification of x by A.

Implementation of LRC • Satisfying the happened-before relationship between all operations is enough to satisfy LRC. • Maintenance and usage of such a detailed ordering would be expensive. • Instead, the ordering is applied to process intervals. • Intervals are segments of time in the execution of a single process. • New interval begins each time a process executes a synchronization operation.

rel(L1) 1 2 3 acq(L2) acq(L1) rel(L2) acq(L3) rel(L3) rel(L1) acq(L2) 3 2 1 4 5 2 3 1 t Intervals P1 P2 P3

Happened-before of Intervals • A happened before partial order is defined between intervals. • An interval i1 precedes an interval i2 according to happened-before of intervals, if all accesses in i1 precede accesses in i2 according to the happened-before of accesses.

Vector Timestamps • An interval is said to be performed at a process if all interval’s accesses have been performed at that process. • Each process p has vector timestampVp that tracks which intervals have been performed at that process. • A vector timestamp consists of a set of interval indices, one per process in the system.

Management of Vector Timestamps • Vector timestamps are managed like vector clocks. • send and receive events are replaced by release and acquire (of the same lock) respectively. • A lock grant message (that is sent from releaser to acquirer to give acquire the exclusive ownership) contains the current timestamp of the releaser • Just before executing a release or acquire in p: Vp[q]:= Vp[q] + 1 • A lock grant message m is time-stamped with t(m)=Vp. • Upon acquire for every q: Vp[q]:= max{ Vp[q], t(m)[q] }

Vector Timestamps (cont.) • A process updates its vector timestamp at the end of an interval. Therefore during an interval the process’ timestamp does not change. • We denote the vector timestamp of process p at interval i by Vpi. • The entry for process q p is denoted by Vpi[q]. • It specifies the most recent interval of process q that has been performed at process p. • Entry Vpi[p] is always equal to i. • An interval x of process q is said to be coveredby Vpi if Vpi[q]  x

Write Notices • Write noticeis an indication that a given page has been modified. • Each process keeps a table of intervals covered by it. • An entry in this table represents an interval. It contains a write notice for every page that was modified during the segment of time corresponding to the interval. • Write notices are sent in the lock grant message along with the vector timestamp of the releaser.

Write Notices (cont.) • It is not necessary to send to acquirer the write notices belonging to intervals covered by its vector timestamp. • In order to let releaser know what intervals are covered by the acquirer, the acquirer sends the release its timestamp inside a lock request message. • When the releasersends a lock grant message to the acquirer, it sends only the write notices belonging to interval covered by itself, but not covered by the acquirer. • When the acquirer receives the lock grant message, it invalidates all the pages for which a write notice is included in the message.

Write Notices (cont.) w(x) w(y) rel(L) acq(L) A write notices for intervals not covered by VCB write notices generate write notices request diffs diffs lock request B acq(L) r(y) x,y invalidate according to write notices

Access Misses • When accessing an invalidated page, all the modifications made to it in the intervals that happened before the current interval must be obtained. • Note that this is true even if the access is a write. • A process can identify those intervals and the processes that performed the modification by the write notices it has for the page. • A write notice is saved along with the id of the process from which it was received and its vector timestamp. • How do we merge modifications performed by concurrent writers to a page?

P1 P2 X Y Tracking Modifications with Multiple Writers • It is possible that several processes make modifications to different variables at the same page. • If the intervals in which the modifications are performed are independent (according to happened-before), we cannot just bring a page from one of the processes. • What should we do? Employ the twinning and diffing technique again!

write P twin writable working copy release: diff  Twinning and Diffing (reminder)

w(x) w(y) rel(L1) acq(L2) t Tracking Modifications with Multiple Writers (cont.) • Note that twinning and diffing not only allows multiple independent writers but also significantly reduces the amount of data sent. P1 inv(P) page P acq(L1) r(x) P2 x y inv(P) P3 rel(L2)

Access Misses (cont.) • Consider the following scenario, in which P3has a miss on a page containing variables x, y and z: w(x) rel P1 inv(x) w(y) acq rel P2 inv(x,y) mod(x,y) r(z) acq P3 • When accessing z, P3sees that according to the locally stored write notices there has been two previous modifications. • They are ordered by happened before relationship therefore P3 can request both modifications from P2.

Access Misses (cont.) • More generally, if processor q modified page P at its interval x, then q is guaranteed to have any diffs of P created intervals that “happened-before” the interval x. • Therefore even if diffs from multiple writers need to be retrieved, it is usually only necessary to communicate with very few processors. • How long should a process keep the diffs ? • How long should a process keep the write notices ? • Clearly, not forever! A garbage collection needs to be done…

Garbage Collection • A diff needs to be retained until it is clear it will never be requested. • This happens when a diff has already been sent to every processor. • When a process sees it is running out of memory it initiates garbage collection, which is invoked at the next barrier. • Garbage collection piggybacks on the barrier to “stop the world”. Each process receives all write notices in the system and uses them to validate all of its cached pages. As a result, all write notices and diffs are discarded.

Release Consistency