1 / 14

The Case for a SC-preserving Compiler

Madan Musuvathi Microsoft Research. Dan Marino Todd Millstein. Abhay Singh Satish Narayanasamy. University of Michigan. UCLA. The Case for a SC-preserving Compiler. Talk Summary. SC-preserving compiler Every SC behavior of the binary is a SC behavior of the source

karsen
Download Presentation

The Case for a SC-preserving Compiler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Madan Musuvathi Microsoft Research Dan Marino Todd Millstein Abhay Singh Satish Narayanasamy University of Michigan UCLA The Case for a SC-preserving Compiler

  2. Talk Summary • SC-preserving compiler • Every SC behavior of the binary is a SC behavior of the source • Guarantees SC assuming SC hardware • A SC-preserving compiler is acceptably efficient • Enable optimizations only when provably SC-preserving • With simple, scalable, and readily implementable analysis • 2% avg, 30% max overhead on SPLASH & PARSEC benchmarks • Static and dynamic analyses can further reduce the performance overhead

  3. Many Compiler Optimizations are not SC-Preserving • Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t;

  4. Common Subexpression Elimination is not SC-Preserving Init: X = Y = 0; Init: X = Y = 0; L1: t = X*5; L2: u = Y; L3: v = X*5; M1: X = 1; M2: Y = 1; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; u == 1  v == 5 possibly u == 1 && v == 0

  5. Implementing CSE in a SC-Preserving Compiler L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; • Enable this transformation when • X is a local variable, or • Y is a local variable • In these cases, the transformation is SC-preserving • Identifying local variables: • Compiler generated temporaries • Stack allocated variables whose address is not taken

  6. A SC-preserving LLVM Compiler for C programs • Modify each of ~70 phases in LLVM to be SC-preserving • Enable trace-preserving optimizations • These do not change the order of memory operations • e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… • Enable transformations on local variables • Enable transformations involving a single shared variable • e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

  7. Performance overhead 480 373 173 237 132 200 116 159 298 154 • Baseline: LLVM –O3 • Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

  8. The Overhead in Facesim float s, *x, *y; int i; … hot_for_loop(… i …){ s += (x[i]-y[i]) *(x[i]-y[i]); … } float s, t, *x, *y; int i; … hot_for_loop(… i …){ t = (x[i]-y[i]); s += t*t; … } • This transformation reduces the overhead from 34% to 6% • Optimizations in non-hot-loops do not buy much performance • A SC-preserving compiler slows down a program if • The hot-loops involve more than one shared variable, and • Aliasing constraints do not prevent optimizations in the loop

  9. Improving Performance of SC-Preserving Compiler • Request programmers to reduce shared accesses in hot loops • Use sophisticated static analysis • Infer more thread-local variables • Infer data-race-free shared variables • Use program annotations • Requires changing the program language • Minimum annotations sufficient to optimize the hot loops • Perform load-optimizations speculatively • Hardware exposes speculative-load optimization to the software • Load optimizations reduce the max overhead to 6%

  10. Eager-Load Optimizations L1: t = X*5; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: L2: for(…) L3: t = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = 10; L1: u = X*5; L2: for(…) L3: t = u; • Eagerly perform loads or use values from previous loads or stores Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

  11. Performance overhead 480 373 173 237 132 200 116 159 298 154 Allowing eager-load optimizations alone reduces max overhead to 6%

  12. Correctness Criteria for Eager-Load Optimizations • Eager-loads optimizations rely on a variable remaining unmodified in a region of code • Sequential validity: No mods to X by the current thread in L1-L3 • SC-preservation: No mods to X by any other thread in L1-L3 Enable invariant “t == 5.X” L1: t = X*5; L2: *p = q; L3: v = X*5; Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t;

  13. Speculatively Performing Eager-Load Optimizations • On monitor.load, hardware starts tracking coherence messages on X’s cache line • The interference check fails if X’s cache line has been downgraded since the monitor.load • In our implementation, a single instruction checks interference on up to 32 tags L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5;

  14. Conclusion(s) • Performance cost of SC = 5% • Cost of SC hardware = 3% [Milo’s talk yesterday] • Cost of SC-preserving compiler = 2%

More Related