- 86 Views
- Uploaded on
- Presentation posted in: General

CS 162 Memory Consistency Models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS 162Memory Consistency Models

- Memory operations are reordered to improve performance
- Hardware (e.g., store buffer, reorder buffer)
- Compiler (e.g., code motion, caching value in register)

- Behave the same as long as dependences are respected

≡

a1: St x

a2: Ld y

a2: Ld y

a1: St x

Reordering in Multiprocessors

- counter-intuitive program behavior

Possible outcomes

Initially x=y=0

a1: x = 1;

b1: Ry = y;

b2: Rx = x;

b1: Ry = y;

P1P2

a2: y = 1;

a1: x = 1;

b1: Ry = y;

b2: Rx = x;

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

b1: Ry = y;

a1: x = 1;

b2: Rx = x;

a2: y = 1;

b2: Rx = x;

a2: y = 1;

a2: y = 1;

Intuitively, y=1 x=1

(Rx=0, Ry =0)

a1: x = 1;

(Rx=0, Ry =1)

(Rx=1, Ry =0)

a2: y = 1;

(Rx=1, Ry =1)

Reordering in Multiprocessors

- counter-intuitive program behavior

Initially p=NULL, flag = false

P1P2

p = new A(…)

if (flag)

a = p->var;

flag = true;

flag is supposed to be set after p is allocated

- Lock-free algorithms, e.g., Dekker, Peterson

Reordering in Multiprocessors

- Dekker Algorithm (mutual exclusion)

- counter-intuitive program behavior

Initially flag1 = flag2 = 0

P1P2

flag1 = 1

flag1 = 1; flag2 = 1;

if (flag2 == 0) if (flag1 == 0)

critical sectioncritical section

St flag1

flag2 == 0

Ld flag2

After reordering, both flag1 and flag2 can be 0

- Specify the ordering of loads and stores to different memory locations
- Ld Ld, Ld St, St Ld, St St

- Contract between hardware, compiler, and programmer
- hardware and compiler will not violate the ordering specified
- the programmer will not assume a stricter order than that of the model

Programmability

Performance

High

Low

Low

High

Easier to reason

Fewer

memory

reorderings

Stronger models

Stronger constraints

Lower performance

- Cache coherence ensures a consistent view of memory
- Guarantees that the update to memory by one processor will be seen by other processors eventually

- But, how consistent ?
- NO guarantees on when an update should be seen
- NO guarantees on what order of updates should be seen

Initially A = B = 0

P1 P2 P3

A = 1; while (A != 1) ;

B = 1; while (B != 1) ;

tmp = A ;

tmp = 1? or tmp = 0?

P1

P2

P3

Pn

MEMORY

- Definition [Lamport]
- (1) the result of any execution is the same as if the operations of all processors were executed in some sequential order;
- (2) the operations of each individual processor appear in this sequence in the order specified by its program.

- Behave as the repetition:
- Pick a processor by any method (e.g., randomly)
- the processor completes a load/store operation

SC Example

P1P2

b1: Ry = y;

b2: Rx = x;

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;

b1: Ry = y;

b2: Rx = x;

b1: Ry = y;

a1: x = 1;

a1: x = 1;

a1: x = 1;

b2: Rx = x;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;

a2: y = 1;

a2: y = 1;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;

≡

a1: x = 1;

a2: y = 1;

a2: y = 1;

a1: x = 1;

a1: x = 1;

(Rx=0, Ry =0)

- However, the simplicity comes at the cost of performance
- prevents aggressive compiler optimizations (e.g., load reordering, store reordering, caching value in register)
- constrains hardware utilization, (e.g., store buffer)

- Simple and intuitive
- consistent with programmers’ intuition
- easy to reason program behavior

a1: x = 1

b1: R1 = y

a2: y = 1

b2: R2 = x

program order

conflict relation

SC Violation

- A cycle formed by program orders and conflict orders

[Shasha and Snir, 1988]

e.g., (a2, b1, b2, a1, a2)

- Executing in the order (a2, b1, b2, a1) will produce R1=1, R2=0, which is not an SC outcome

Insert fences to break cycle

- a2 can not be executed before a1

Fence Instructions

- Fence Instructions
- Order memory operations before and after the fence

P1

p = new A(…)

flag = true;

- Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11]
- Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]

FENCE

a1: St x

a2: Ld y

- At time T, a1 and a2 have completed; b1 and b2 only execute after time T.

Fence1

Fence2

T

- No cycle is formed at runtime

b1: St y

b2: Ld x

- Inserted statically and conservatively

if (cond)

a1: St x

a2: Ld y

a1: St *p

a2: Ld x

b1: St y

b2: Ld x

b1: St x

b2: Ld *q

Fence1

Fence2

Fence1

Fence2

- a1is in a conditional branch

- p and q may point to the same memory location

- No cycle is formed at runtime

- Inserted statically and conservatively

- Traditional fence
- Processor-centric - unaware of memory accesses in other processors

- However, purpose of fences
- Prevent memory accesses from being reordered and observed by other processors (i.e., a cycle formed at runtime)

Consider memory locations accessed around fences at runtime

Fences only take effect when there is a cycle about to happen

Proc 2

Proc 1

A1

B1

b1: …

a1: …

?

c2

Fence1

Fence2

a2: …

b2: …

c1

B2

A2

How to detect c2 efficiently?

Proc 2

Proc 1

A1

B1

b1: …

a1: …

Fence2

Fence1

a2: …

b2: …

c1

B2

A2

watchlist

?

c2

- How to detect c2 efficiently?
- Collecting watchlist for each fence
- Completing memory operation checks the watchlist
- bypass,if its address is not in the watchlist

- stall, otherwise

Traditional fence (T) vs. Address-aware fence (A)

Fence overhead becomes negligible

L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9):690–691, 1979.

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66–76, 1995.

D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988.

Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 2011.

C. Lin, V. Nagarajan, and R. Gupta. Address-aware fences. ICS ’13, pages 313–324, 2013