140 likes | 314 Views
Madan Musuvathi Microsoft Research. Dan Marino Todd Millstein. Abhay Singh Satish Narayanasamy. University of Michigan. UCLA. The Case for a SC-preserving Compiler. Talk Summary. SC-preserving compiler Every SC behavior of the binary is a SC behavior of the source
E N D
Madan Musuvathi Microsoft Research Dan Marino Todd Millstein Abhay Singh Satish Narayanasamy University of Michigan UCLA The Case for a SC-preserving Compiler
Talk Summary • SC-preserving compiler • Every SC behavior of the binary is a SC behavior of the source • Guarantees SC assuming SC hardware • A SC-preserving compiler is acceptably efficient • Enable optimizations only when provably SC-preserving • With simple, scalable, and readily implementable analysis • 2% avg, 30% max overhead on SPLASH & PARSEC benchmarks • Static and dynamic analyses can further reduce the performance overhead
Many Compiler Optimizations are not SC-Preserving • Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t;
Common Subexpression Elimination is not SC-Preserving Init: X = Y = 0; Init: X = Y = 0; L1: t = X*5; L2: u = Y; L3: v = X*5; M1: X = 1; M2: Y = 1; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; u == 1 v == 5 possibly u == 1 && v == 0
Implementing CSE in a SC-Preserving Compiler L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; • Enable this transformation when • X is a local variable, or • Y is a local variable • In these cases, the transformation is SC-preserving • Identifying local variables: • Compiler generated temporaries • Stack allocated variables whose address is not taken
A SC-preserving LLVM Compiler for C programs • Modify each of ~70 phases in LLVM to be SC-preserving • Enable trace-preserving optimizations • These do not change the order of memory operations • e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… • Enable transformations on local variables • Enable transformations involving a single shared variable • e.g. t= X; u=X; v=X; t=X; u=t; v=t;
Performance overhead 480 373 173 237 132 200 116 159 298 154 • Baseline: LLVM –O3 • Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM
The Overhead in Facesim float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … } • This transformation reduces the overhead from 34% to 6% • Optimizations in non-hot-loops do not buy much performance • A SC-preserving compiler slows down a program if • The hot-loops involve more than one shared variable, and • Aliasing constraints do not prevent optimizations in the loop
Improving Performance of SC-Preserving Compiler • Request programmers to reduce shared accesses in hot loops • Use sophisticated static analysis • Infer more thread-local variables • Infer data-race-free shared variables • Use program annotations • Requires changing the program language • Minimum annotations sufficient to optimize the hot loops • Perform load-optimizations speculatively • Hardware exposes speculative-load optimization to the software • Load optimizations reduce the max overhead to 6%
Eager-Load Optimizations L1: t = X*5; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: L2: for(…) L3: t = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = 10; L1: u = X*5; L2: for(…) L3: t = u; • Eagerly perform loads or use values from previous loads or stores Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion
Performance overhead 480 373 173 237 132 200 116 159 298 154 Allowing eager-load optimizations alone reduces max overhead to 6%
Correctness Criteria for Eager-Load Optimizations • Eager-loads optimizations rely on a variable remaining unmodified in a region of code • Sequential validity: No mods to X by the current thread in L1-L3 • SC-preservation: No mods to X by any other thread in L1-L3 Enable invariant “t == 5.X” L1: t = X*5; L2: *p = q; L3: v = X*5; Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t;
Speculatively Performing Eager-Load Optimizations • On monitor.load, hardware starts tracking coherence messages on X’s cache line • The interference check fails if X’s cache line has been downgraded since the monitor.load • In our implementation, a single instruction checks interference on up to 32 tags L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5;
Conclusion(s) • Performance cost of SC = 5% • Cost of SC hardware = 3% [Milo’s talk yesterday] • Cost of SC-preserving compiler = 2%