are we trading consistency too easily a case for sequential consistency l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Are We Trading Consistency Too Easily? A Case for Sequential Consistency PowerPoint Presentation
Download Presentation
Are We Trading Consistency Too Easily? A Case for Sequential Consistency

Loading in 2 Seconds...

play fullscreen
1 / 19

Are We Trading Consistency Too Easily? A Case for Sequential Consistency - PowerPoint PPT Presentation


  • 179 Views
  • Uploaded on

Madan Musuvathi Microsoft Research . Dan Marino Todd Millstein. Abhay Singh Satish Narayanasamy. UCLA. University of Michigan. Are We Trading Consistency Too Easily? A Case for Sequential Consistency. Memory Consistency Model. Abstracts the program runtime (compiler + hardware)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Are We Trading Consistency Too Easily? A Case for Sequential Consistency' - dana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
are we trading consistency too easily a case for sequential consistency

Madan Musuvathi

Microsoft Research

Dan Marino

Todd Millstein

Abhay Singh

Satish Narayanasamy

UCLA

University of Michigan

Are We Trading Consistency Too Easily?

A Case for Sequential Consistency

memory consistency model
Memory Consistency Model
  • Abstracts the program runtime (compiler + hardware)
    • Hides compiler transformations
    • Hides hardware optimizations, cache hierarchy, …
  • Sequential consistency (SC) [Lamport ‘79]

“The result of any execution is the same as if the operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”

sequential consistency explained
Sequential Consistency Explained

int X = F = 0; // F = 1 implies X is initialized

X = 1;

F = 1;

t = F;

u = X;

X = 1;

X = 1;

X = 1;

t = F;

t = F;

t = F;

X = 1;

F = 1;

u = X;

t = F;

t = F;

X = 1;

F = 1;

F = 1;

X = 1;

u = X;

t = F;

u = X;

t=1, u=1

t=0, u=1

t=0, u=1

t=0, u=0

t=0, u=1

t=0, u=1

F = 1;

F = 1;

F = 1;

u = X;

u = X;

u = X;

t=1 implies u=1

conventional wisdom
Conventional Wisdom
  • SC is slow
    • Disables important compiler optimizations
    • Disables important hardware optimizations
  • Relaxed memory models are faster
conventional wisdom5
Conventional Wisdom

X

  • SC is slow
    • Hardware speculation can hide the cost of SC hardware

[Gharachorloo et.al. ’91, … , Blundell et.al. ’09]

    • Compiler optimizations that break SC provide negligible performance improvement [PLDI ’11]
  • Relaxed memory models are faster
    • Need fences for correctness
    • Programmers conservatively add more fences than necessary
    • Libraries use the strongest fence necessary for all clients
    • Fence implementations are slow
      • Efficient fence implementations require speculation support

?

implementing sequential consistency efficiently
Implementing Sequential Consistency Efficiently

asm:

moveax [X];

src:

t = X;

This talk

SC-Preserving

Compiler

Every SC behavior of the binary

is a SC behavior of the source

SC Hardware

Every observed runtime behavior

is a SC behavior of the binary

challenge important compiler optimizations are not sc preserving
Challenge: Important Compiler Optimizations are not SC-Preserving
  • Example: Common Subexpression Elimination (CSE)

t,u,v are local variables

X,Y are possibly shared

L1: t = X*5;

L2: u = Y;

L3: v = X*5;

L1: t = X*5;

L2: u = Y;

L3: v = t;

common subexpression elimination is not sc preserving
Common Subexpression Elimination is not SC-Preserving

Init: X = Y = 0;

Init: X = Y = 0;

L1: t = X*5;

L2: u = Y;

L3: v = X*5;

M1: X = 1;

M2: Y = 1;

L1: t = X*5;

L2: u = Y;

L3: v = t;

M1: X = 1;

M2: Y = 1;

possibly u == 1 && v == 0

u == 1 implies v == 5

implementing cse in a sc preserving compiler
Implementing CSE in a SC-Preserving Compiler

L1: t = X*5;

L2: u = Y;

L3: v = X*5;

L1: t = X*5;

L2: u = Y;

L3: v = t;

  • Enable this transformation when
    • X is a local variable, or
    • Y is a local variable
  • In these cases, the transformation is SC-preserving
  • Identifying local variables:
    • Compiler generated temporaries
    • Stack allocated variables whose address is not taken
a sc preserving llvm compiler for c programs
A SC-preserving LLVM Compiler for C programs
  • Modify each of ~70 phases in LLVM to be SC-preserving
    • Without any additional analysis
  • Enable trace-preserving optimizations
    • These do not change the order of memory operations
    • e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,…
  • Enable transformations on local variables
  • Enable transformations involving a single shared variable
    • e.g. t= X; u=X; v=X;  t=X; u=t; v=t;
average performance overhead is 2
Average Performance overhead is ~2%

173

480

373

237

132

200

116

159

298

154

  • Baseline: LLVM –O3
  • Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM
how far can a sc preserving compiler go
How Far Can A SC-Preserving Compiler Go?

float s, *x, *y;

int i;

s=0;

for( i=0; i<n; i++ ){

s += (x[i]-y[i])

* (x[i]-y[i]);

}

float s, *x, *y;

int i;

s=0;

for( i=0; i<n; i++ ){

s += (*(x + i*sizeof(float)) –

*(y + i*sizeof(float))) *

(*(x + i*sizeof(float)) –

*(y + i*sizeof(float)));

}

no

opt.

SC

pres

float s, *x, *y;

float *px, *py, *e, t;

s=0;py=y; e = &x[n]

for(px=x; px<e; px++, py++){

t = (*px-*py);

s += t*t;

}

float s, *x, *y;

float *px, *py, *e;

s=0;py=y; e = &x[n]

for(px=x; px<e; px++, py++){

s += (*px-*py)

* (*px-*py);

}

full

opt

we can reduce the facesim overhead if we cheat a bit
We Can Reduce the FaceSim Overhead (if we cheat a bit)
  • 30% overhead comes from the inability to perform CSE in
  • But argument evaluation in C is nondeterministic
    • The specification explicitly allows overlapped evaluation of function arguments

return MATRIX_3X3<T>(

x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2],

x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5],

x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5],

x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8],

x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );

improving performance of sc preserving compiler
Improving Performance of SC-Preserving Compiler
  • Request programmers to reduce shared accesses in hot loops
  • Use sophisticated static analysis
    • Infer more thread-local variables
    • Infer data-race-free shared variables
  • Use program annotations
    • Requires changing the program language
    • Minimum annotations sufficient to optimize the hot loops
  • Perform load-optimizations speculatively
    • Hardware exposes speculative-load optimization to the software
    • Load optimizations reduce the max overhead to 6%
conclusion
Conclusion
  • Hardware should support strong memory models
    • TSO is efficiently implementable [Mark Hill]
      • Speculation support for SC over TSO is not currently justifiable
    • Can we quantify the programmability cost for TSO?
  • Compiler optimizations should preserve the hardware memory model
  • High-level programming models can abstract TSO/SC
    • Further enable compiler/hardware optimizations
    • Improve programmer productivity, testability, and debuggability
eager load optimizations
Eager-Load Optimizations

L1: t = X*5;

L2: u = Y;

L3: v = X*5;

L1: X = 2;

L2: u = Y;

L3: v = X*5;

L1:

L2: for(…)

L3: t = X*5;

L1: t = X*5;

L2: u = Y;

L3: v = t;

L1: X = 2;

L2: u = Y;

L3: v = 10;

L1: u = X*5;

L2: for(…)

L3: t = u;

  • Eagerly perform loads or use values from previous loads or stores

Common

Subexpression

Elimination

Constant/copy

Propagation

Loop-invariant

Code

Motion

performance overhead
Performance overhead

173

480

373

237

132

200

116

159

298

154

Allowing eager-load optimizations alone reduces max overhead to 6%

correctness criteria for eager load optimizations
Correctness Criteria for Eager-Load Optimizations
  • Eager-loads optimizations rely on a variable remaining unmodified in a region of code
  • Sequential validity: No mods to X by the current thread in L1-L3
  • SC-preservation: No mods to X by any other thread in L1-L3

Enable invariant “t == 5.X”

L1: t = X*5;

L2: *p = q;

L3: v = X*5;

Maintain invariant “t == 5.X”

Use invariant “t == 5.X"

to transform L3 to v = t;

speculatively performing eager load optimizations
Speculatively Performing Eager-Load Optimizations
  • On monitor.load, hardware starts tracking coherence messages on X’s cache line
  • The interference check fails if X’s cache line has been downgraded since the monitor.load
  • In our implementation, a single instruction checks interference on up to 32 tags

L1: t = monitor.load(X, tag) * 5;

L2: u = Y;

L3: v = t;

C4: if (interference.check(tag))

C5: v = X*5;

L1: t = X*5;

L2: u = Y;

L3: v = X*5;