1 / 22

CS 162 Memory Consistency Models

CS 162 Memory Consistency Models. Reordering in Uniprocessors. Memory operations are reordered to improve performance Hardware ( e.g. , store buffer, reorder buffer) Compiler ( e.g. , code motion, caching value in register) Behave the same as long as dependences are respected. ≡. a1: St x

yehudi
Download Presentation

CS 162 Memory Consistency Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 162Memory Consistency Models

  2. Reordering in Uniprocessors • Memory operations are reordered to improve performance • Hardware (e.g., store buffer, reorder buffer) • Compiler (e.g., code motion, caching value in register) • Behave the same as long as dependences are respected ≡ a1: St x a2: Ld y a2: Ld y a1: St x

  3. Reordering in Multiprocessors • counter-intuitive program behavior Possible outcomes Initially x=y=0 a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; P1P2 a2: y = 1; a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; a1: x = 1; b1: Ry = y; a1: x = 1; b2: Rx = x; a2: y = 1; b2: Rx = x; a2: y = 1; a2: y = 1; Intuitively, y=1  x=1 (Rx=0, Ry =0) a1: x = 1; (Rx=0, Ry =1) (Rx=1, Ry =0) a2: y = 1; (Rx=1, Ry =1)

  4. Reordering in Multiprocessors • counter-intuitive program behavior Initially p=NULL, flag = false P1P2 p = new A(…) if (flag) a = p->var; flag = true; flag is supposed to be set after p is allocated • Lock-free algorithms, e.g., Dekker, Peterson

  5. Reordering in Multiprocessors • Dekker Algorithm (mutual exclusion) • counter-intuitive program behavior Initially flag1 = flag2 = 0 P1P2 flag1 = 1 flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section St flag1 flag2 == 0 Ld flag2 After reordering, both flag1 and flag2 can be 0

  6. Memory Consistency Models • Specify the ordering of loads and stores to different memory locations • Ld  Ld, Ld  St, St  Ld, St  St • Contract between hardware, compiler, and programmer • hardware and compiler will not violate the ordering specified • the programmer will not assume a stricter order than that of the model

  7. Memory Consistency Models Programmability Performance High Low Low High Easier to reason Fewer memory reorderings Stronger models Stronger constraints Lower performance

  8. Cache Coherence vs. Memory Model • Cache coherence ensures a consistent view of memory • Guarantees that the update to memory by one processor will be seen by other processors eventually • But, how consistent ? • NO guarantees on when an update should be seen • NO guarantees on what order of updates should be seen

  9. Cache Coherence vs. Memory Model Initially A = B = 0 P1 P2 P3 A = 1; while (A != 1) ; B = 1; while (B != 1) ; tmp = A ; tmp = 1? or tmp = 0?

  10. P1 P2 P3 Pn MEMORY Sequential Consistency (SC) • Definition [Lamport] • (1) the result of any execution is the same as if the operations of all processors were executed in some sequential order; • (2) the operations of each individual processor appear in this sequence in the order specified by its program. • Behave as the repetition: • Pick a processor by any method (e.g., randomly) • the processor completes a load/store operation

  11. SC Example P1P2 b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; a1: x = 1; a2: y = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; b1: Ry = y; a1: x = 1; a1: x = 1; a1: x = 1; b2: Rx = x; a2: y = 1; b1: Ry = y; b2: Rx = x; a2: y = 1; a2: y = 1; a2: y = 1; b1: Ry = y; b2: Rx = x; ≡ a1: x = 1; a2: y = 1; a2: y = 1; a1: x = 1; a1: x = 1; (Rx=0, Ry =0)

  12. Sequential Consistency (SC) • However, the simplicity comes at the cost of performance • prevents aggressive compiler optimizations (e.g., load reordering, store reordering, caching value in register) • constrains hardware utilization, (e.g., store buffer) • Simple and intuitive • consistent with programmers’ intuition • easy to reason program behavior

  13. SC Violation a1: x = 1 b1: R1 = y a2: y = 1 b2: R2 = x program order conflict relation SC Violation - A cycle formed by program orders and conflict orders [Shasha and Snir, 1988] e.g., (a2, b1, b2, a1, a2) - Executing in the order (a2, b1, b2, a1) will produce R1=1, R2=0, which is not an SC outcome Insert fences to break cycle - a2 can not be executed before a1

  14. Fence Instructions • Fence Instructions • Order memory operations before and after the fence P1 p = new A(…) flag = true; • Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] • Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98] FENCE

  15. Conservativeness of Fences a1: St x a2: Ld y • At time T, a1 and a2 have completed; b1 and b2 only execute after time T. Fence1 Fence2 T • No cycle is formed at runtime b1: St y b2: Ld x • Inserted statically and conservatively

  16. Conservativeness of Fences if (cond) a1: St x a2: Ld y a1: St *p a2: Ld x b1: St y b2: Ld x b1: St x b2: Ld *q Fence1 Fence2 Fence1 Fence2 • a1is in a conditional branch • p and q may point to the same memory location • No cycle is formed at runtime • Inserted statically and conservatively

  17. Processor-centric Fence • Traditional fence • Processor-centric - unaware of memory accesses in other processors • However, purpose of fences • Prevent memory accesses from being reordered and observed by other processors (i.e., a cycle formed at runtime)

  18. Address-aware Fences Consider memory locations accessed around fences at runtime Fences only take effect when there is a cycle about to happen

  19. Detect and Avoid Cycles Proc 2 Proc 1 A1 B1 b1: … a1: … ? c2 Fence1 Fence2 a2: … b2: … c1 B2 A2 How to detect c2 efficiently?

  20. Detect and Avoid Cycles Proc 2 Proc 1 A1 B1 b1: … a1: … Fence2 Fence1 a2: … b2: … c1 B2 A2 watchlist ? c2 • How to detect c2 efficiently? • Collecting watchlist for each fence • Completing memory operation checks the watchlist - bypass,if its address is not in the watchlist - stall, otherwise

  21. Performance: Execution Time Traditional fence (T) vs. Address-aware fence (A) Fence overhead becomes negligible

  22. Further Reading L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9):690–691, 1979. S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66–76, 1995. D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988. Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 2011. C. Lin, V. Nagarajan, and R. Gupta. Address-aware fences. ICS ’13, pages 313–324, 2013

More Related