1 / 52

CSE P501 – Compiler Construction

CSE P501 – Compiler Construction. Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength Reduction Loop Unrolling Memory Caches. Loops. Most of the time executing a program is spent in loops. (Why is this true?)

arnav
Download Presentation

CSE P501 – Compiler Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE P501 – Compiler Construction Loop optimizations Dominators – discovering loops Loop Hoisting Operator Strength Reduction Loop Unrolling Memory Caches Jim Hogg - UW - CSE - P501

  2. Loops • Most of the time executing a program is spent in loops. (Why is this true?) • So focus on loops when trying to make a program run faster • So: • How to recognize loops? • How to improve those loops Jim Hogg - UW - CSE - P501

  3. What’s a Loop? • In source code, a loop is the set of statements in the body of a for/while construct • But, in a language that permits free use of GOTOs, how do we recognize a loop? • In a control-flow-graph (node = basic-block, arc = flow-of-control), how do we recognize a loop? Jim Hogg - UW - CSE - P501

  4. Any Loops in this Code? i = 0 goto L8 L7: i++ L8: if (i < N) goto L9 s = 0 j = 0 goto L5 L4: j++ L5: N-- if(j >= N) goto L3 if (a[j+1] >= a[j]) goto L2 t = a[j+1] a[j+1] = a[j] a[j] = t s = 1 L2: gotoL4 L3: if(s != ) goto L1 else goto L9 L1: gotoL7 L9: return Anyone recognize or guess the algorithm? Jim Hogg - UW - CSE - P501

  5. Any Loops in this Flowgraph? Jim Hogg - UW - CSE - P501

  6. Loop in a Flowgraph: Intuition Header Node • Cluster of nodes, such that: • There's one node called the "header" • I can reach all nodes in the cluster from the header • I can get back to the header from all nodes in the cluster • Only once entrance - via the header • One or more exits Jim Hogg - UW - CSE - P501

  7. What’s a Loop? • In a control flow graph, a set of nodes S such that: • S includes a unique header node h • From any node in S there is a path (of directed edges) leading to h • There is a path from h to every node in S • There is no edge from any node outside S to any node in S other than via h Jim Hogg - UW - CSE - P501

  8. Entries and Exits • In a loop • The entry node is one with some predecessor outside the loop • An exit node is one that has a successor node outside the loop • Corollary: A loop may have multiple exit nodes, but only one entry node Jim Hogg - UW - CSE - P501

  9. Dominators • We use dominatorsto discover loops in flowgraph • Recall • Every control flow graph has a unique start node 0 • Node d dominates node n if every path from 0 to n must go thru d • A node x dominates itself • You can't reach y, from 0, without passing thru d Jim Hogg - UW - CSE - P501

  10. Dominators by Inspection 0 1 5 For each Node, n, which nodes dominate it? We denote this Dom(n) - the dominators of n 2 6 8 7 3 4 Jim Hogg - UW - CSE - P501

  11. Dominators by Inspection 0 1 5 2 6 8 7 3 4 Jim Hogg - UW - CSE - P501

  12. Dominators: Intuition Travel everywhere you can! It only counts if you reached here by every incoming path 0 0 0 0 01235678 01 1 01 1 5 015 5 015 0156 0158 2 6 8 012 2 6 8 0156 0158 012 0157 7 015678 7 3 01235678 013 3 4 014 4 012345678

  13. Dominators by Calculation 0 1 • Dom(n) is the set of nodes that dominate n • Calculate by iterating to fixed point: • Dom(n) = {n}  IntersectpDomp • Initial conditions: • Dom(0) = 0 • Otherwise, Dom(n) = U = set of all nodes 5 2 6 8 7 3 4

  14. Dominators by Calculation: Details Use most recent values to converge faster

  15. Dominator Algorithm • Iterative Dataflow • Easier than previous Iterative Dataflow solutions • Ignores content of each basic block • Concerned only with how blocks link - the structure of the flowgraph • Double-check Cooper&Torczon • page 481, bottom table • iteration #1 for node B3 • Answer = {0,1,3} • Should it be {0,1,2,3} ? Jim Hogg - UW - CSE - P501

  16. Immediate Dominators • Every node n has a single immediate dominator • Denote the immediate dominator of node n as IDom(n) • IDom(n) is not n itself - we are interested now in strict dominance • strict in the same sense as strict subset:  versus  • IDom(n) strictly dominates n and is the nearest dominator of n • nearest: IDom(n) does not dominate any other strict dominator of n • Theorem: If both a and b dominate n, then either a dominates b or b dominates a • Proof: reductio ad absurdum • Therefore, IDom(n) is unique Jim Hogg - UW - CSE - P501

  17. Dominator Tree (Strict) Dominator Tree. Can read off the Immediate Dominator of each node 0 1 0 5 1 2 6 8 3 5 7 2 4 6 7 8 3 Eg: 3 dominates 4 4 Jim Hogg - UW - CSE - P501

  18. Identifying Loops in Flowgraph Dominator Tree 0 0 1 1 2 3 2 3 5 4 4 5 9 9 6 7 6 10 10 8 11 7 11 8 Jim Hogg - UW - CSE - P501

  19. Loops in Flowgraph Dominator Tree 0 0 1 1 2 3 2 3 5 4 4 5 9 9 6 7 10 6 10 8 11 11 7 8 A back edgeis an edge from n to h where h dom n Natural Loop = h and n and all nodes that can reach n without going thru h Jim Hogg - UW - CSE - P501

  20. Inner Loops Inner loops are more important for optimization • most execution time is spent there for (i = 1; i <= 100; ++i) { ... // executes 100 times for (j = 1; j <= 100; ++j) { ... // executes 10,000 times! } ... // executes 100 times } So, how do we identify inner loops? . . . Jim Hogg - UW - CSE - P501

  21. Nested Loops in Flowgraph 0 1 2 3 5 4 9 6 7 10 8 11 • A nested, or inner, loop is one whose nodes are a strict subset of another loop • Eg: {6,7}  {4,6,7,8} Jim Hogg - UW - CSE - P501

  22. Loop Surprise • 4 dom 7 • 74 (back edge) • The natural loop includes 4 and 7 and all nodes that can reach 7 without passing thru 4 • Ie: {4,5,6,7,8,10} • So technically, loop 74 has inner loop 107 4 6 5 7 8 10 • We can specify a loop by giving its back-edge • We are not examining all kinds of loops. • Eg: nested loops that share a header • Eg: unnatural, or "irreducible" loops Jim Hogg - UW - CSE - P501

  23. An Unnatural Loop! 1 • Natural loops have one entry • Unnatural loops have multiple entries • All loops constructed using FOR, WHILE, BREAK are guaranteed natural, by-construction 2 3 • We feel that {2, 3} should be a loop . . . • But 2 does not dominate 3, since we can reach 3 without going thru 2 • And 3 does not dominate 2, since we can reach 2 without going thru 3 • 1 dominates 2 and 3, but there is no back edge 21 or 31 • Checkmate! Jim Hogg - UW - CSE - P501

  24. Loop 'Hoisting' • Idea: If x = a op b, inside the loop, always does the same thing each time, then: • hoist x = a op b out of the loop • inside the loop simply re-use x without re-calculating it • similar to CSE, but now a bigger payoff: 'trip-count' times! • But need a few safety checks before hoisting . . . Jim Hogg - UW - CSE - P501

  25. Loop Preheader • Often we need a place to park code right before the start of a loop. Sometimes called a "landing pad" • Easy if there is a single node preceding the loop header h • But this isn’t the case in general • We don't want to execute extra code every time thru the loop! • So insert a preheader node p • Include an edge ph • Change all edges xh to be xp p h back-edge h back-edge Jim Hogg - UW - CSE - P501

  26. Check for Loop-Invariance • We can hoist d: x = a op b to loop pre-header so long as: • a and b are constant, or • All the defs for a and b that reach d are outside the loop, or • One def of areaches d, but that def is loop-invariant • Use this to build an iterative algorithm • Base cases: constants and operands defined outside the loop • Then: repeatedly find definitions with loop-invariant operands Jim Hogg - UW - CSE - P501

  27. Hoisting is allowed if . . . • Assume we already checked d: x = a + bis loop-invariant • Hoist instruction to loop pre-header if • if x is live-out from the loop, d must dominate all loop exits, and • there is only one def of x in the loop, and • x is not live-out of the loop pre-header (ie, possibly used before reaching d:) • Need to modify this if a + b could have side-effects or could raise an exception Jim Hogg - UW - CSE - P501

  28. Hoisting: Possible? Example 1 Example 2 L0: t = 0 L1: if i ≥ n goto L2 i = i + 1 t = a +b M[i] = t gotoL1 L2: x = t L0: t = 0 L1: i = i + 1 t = a + b M[i] = t if i < n goto L1 L2: x = t • a and b are loop-invariant • L0: is the pre-header • can I hoist t = a + b? Jim Hogg - UW - CSE - P501

  29. Hoisting: Possible? Example 4 Example 3 L0: t = 0 L1: i= i + 1 t = a + b M[i] = t t = 0 M[j] = t if i < n goto L1 L2: x = t L0: t = 0 L1: M[j] = t i = i + 1 t = a + b M[i] = t if i < n goto L1 L2: x = t • a and b are loop-invariant • L0: is the pre-header • can I hoist t = a + b? Jim Hogg - UW - CSE - P501

  30. Operator Strength Reduction (OSR) • Replace an expensive operator (eg: multiplication) in a loop with a cheaper operator (eg: addition) • Usually requires replacing the original "induction variables" Jim Hogg - UW - CSE - P501

  31. OSR Example Original • Induction-variable analysis: discover i and j as related IVs • Strength reduction to replace *4 with an addition • Replace the i≥ n test • Assorted copy propagation s = 0 i = 0 L1: if i ≥ n goto L2 j = i*4 k = j+a x = M[k] s = s+x i = i+1 gotoL1 L2: To Optimize: Jim Hogg - UW - CSE - P501

  32. Induction Variable • An induction variable, i, is one whose value varies systematically inside a loop. Eg: • for i = 0 1 2 3 ... • i * c ; 0 5 10 15 ... • i c ; 7 8 9 10 ... • where c is a region constant • Replace multiplications with (cheaper) additions • Eg: i = i * c is replace with i += c, but generates same sequence of values Jim Hogg - UW - CSE - P501

  33. After OSR Transformed Original s = 0 k’ = a t = n * 4 c = a + b L1: if k’ ≥ c goto L2 x = M[k’] s = s + x k’ = k’ + 4 gotoL1 L2: s = 0 i = 0 L1: if i ≥ n goto L2 j = i * 4 k = j + a x = M[k] s = s + x i = i + 1 gotoL1 L2: i = 0, 1, 2, 3 ... j = 0, 4, 8, 12 ... k = a, 4+a, 8+a, 12+a ... • Loop counter i is gone! • k replaced with k' • tis temp - used to calculate c The algorithm is messy. See Cooper&Toczon, p584 (not required for the exam) Jim Hogg - UW - CSE - P501

  34. Optimizing Induction Variables (IVs) • Strength reduction: if a "derived" IV is defined as j = i * c, replace with an addition inside the loop • Elimination: after strength reduction some IVs are not used, so delete them • Rewrite comparisons: If a variable is used only in comparisons against CIVs and in its own definition, modify the comparison to use a related IV Jim Hogg - UW - CSE - P501

  35. Loop Unrolling • Candidates: • Loop with small body • Then loop overhead (increment+test+jump) is a significant fraction (or even bigger!) than the useful computation • Idea: reduce overhead by unrolling • put two or more copies of the body inside the loop • Bonus: more opportunities to optimize Jim Hogg - UW - CSE - P501

  36. Original sum = 0; for (i = 0; i< 100; ++i) sum += a[i]; sum = 0; for (i = 0; i < 100; i += 2) { sum += a[i]; sum += a[i+1]; } Unrolled 2x • Example depends upon loop trip-count being a multiple of 2 • In general, add a 'tidy-up' step Jim Hogg - UW - CSE - P501

  37. Unrolling Algorithm Results Original L1: x = M[i] sum = sum + x i = i + 4 if i < n gotoL1 L2: L1: x = M[i] sum = sum + x i = i + 4 x = M[i] sum = sum + x i = i + 4 if i < n gotoL1 L2: Unrolled 2x • Example depends upon loop trip-count being a multiple of 2 • In general, add a 'tidy-up' step • This code can be further optimized (eg: OSR on i) Jim Hogg - UW - CSE - P501

  38. Further Optimized After Before L1: x = M[i] sum = sum + x x = M[i+4] sum = sum + x i = i + 8 if i < n gotoL1 L2: L1: x = M[i] sum = sum + x i = i + 4 x = M[i] sum = sum + x i = i + 4 if i < n gotoL1 L2: • Actual optimizations depend partly upon the power of our IR • (and subsequently on the richness of the target ISA) Jim Hogg - UW - CSE - P501

  39. Postscript on Loop Unrolling • This example only unrolls the loop by a factor of 2 • In general, unroll by a factor of K • More aggressive unroll for reductions (like sum, product) • Need a 'tidy-up' mini-loop where trip-count is not a multiple of K • May unroll short loops entirely • Why not unroll more? • Code bloat - increases memory footprint for decreasing perf win • Increases register pressure Jim Hogg - UW - CSE - P501

  40. Memory Caches • One of the great triumphs of computer hardware design • Effect is a large, fast memory • Reality is a series of progressively larger, slower, cheaper stores, with frequently accessed data automatically staged to faster storage (cache, main memory, disk) • Programmer/compiler typically treats it as one large store. • Hardware maintains cache coherency - well, mostly! Jim Hogg - UW - CSE - P501

  41. Intel Haswell Caches Core Core L1 = 64 KB per core L2 = 256 KB per core L3 = 2-8 MB shared Main Memory

  42. Just How Slow is Operand Access? • Instruction ~5 per cycle • Register 1 cycle • L1 CACHE ~4 cycles • L2 CACHE ~10 cycles • L3 CACHE (unshared line) ~40 cycles • DRAM ~100 ns Jim Hogg - UW - CSE - P501

  43. Memory Issues • Byte load/store is often slower than whole (physical) word load/store • Unaligned access is often extremely slow • Temporal locality: accesses to recently accessed data will usually find it in the cache • Spatial locality: accesses to data near recently used data will usually be fast • “near” = in the same cache block • But – alternating accesses to blocks that map to the same cache line will cause thrashing (cache-line "aliasing") • Increases in CPU speed have out-paced increases in memory access times • Memory accesses now often determine overall program speed • "Instruction Count" is no longer the only performance metric to optimize-for Jim Hogg - UW - CSE - P501

  44. Data Alignment • Data objects (structs) often are similar in size to a cache line/block (≈ 64 bytes) • Better if objects don’t span blocks • Some strategies • Allocate objects sequentially; bump to next block boundary if useful • Allocate objects of same common size in separate pools (all size-2, size-4, etc) • Tradeoff: speed for some wasted space Jim Hogg - UW - CSE - P501

  45. Instruction Alignment • Align frequently executed basic blocks on cache boundaries (or avoid spanning cache blocks) • Branch targets (particularly loops) may be faster if they start on a cache line boundary • In optimized code, will often see multi-byte NOPs as padding, to align loop header • Alignment varies with chip. Eg: current-gen Intel 16 bytes; current-gen AMD prefers 32 bytes • Try to move infrequent code (startup, exceptions) away from hot code • Optimizing compiler may perform basic-block ordering ("layout") Jim Hogg - UW - CSE - P501

  46. Loop Interchange for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) a[i,j] += b[i,k] * c[k,j]; Textbook Matrix Multiply: • A = BC • aij += bik * ckj B C A i j j Askincrements (innermost/fastest), we move along a row of B, but down a column of C. The latter results in poor use of cache, and low performance

  47. Loop Interchange 64 bytes wide cache line C • Eg: int C[1000,1000] • Row 0 begins at address 0 = 0 % 64 • Row 1 begins at address 4000 = 16 % 64 • Row 2 begins at address 8000 = 32 % 64 j • As k increments (innermost/fastest), we move down a column of C. • Each read fills a cache line (typically 64 bytes) • Algorithm uses only 4 bytes of that read • Then fills another cache line - so every access to C incurs a cache-miss! • No reuse of cache line; poor locality; low performance

  48. Loop Interchange for (i = 0; i < n; i++) for (k = 0; k < n; k++) for (j = 0; j < n; j++) a[i,j] += b[i,k] * c[k,j]; Textbook Matrix Multiply: • A = BC • aij += bik * ckj B C A i k As j increments (innermost/fastest), we move along a row of A and along a row of C. This pattern produces good locality; reuse of cache line; and high performance. (Note that we now have to visit calculation of a[i,j] multiple times)

  49. Loop Interchange 64 bytes wide cache line C • As k increments (innermost/fastest), we move down a column of C. • Each read fills a cache line (typically 64 bytes) • Algorithm uses only 4 bytes of that read • Then fills another cache line • No reuse of cache line; poor locality; low performance j

  50. Blocking/Tiling Matrix Multiply B C A Textbook Matrix Multiply: • A = BC • aij += bik * ckj i j j a11 a12 a13 a14 a15 a16 a21 a22 a23 a24 a25 a26 a31 a32 a33 a34 a35 a36 a41 a42 a43 a44 a45 a46 a51 a52 a53 a54 a55 a56 a61 a62 a63 a64a64 a66 A11 A12 A21 A22 Aij+= Bik* Ckj Jim Hogg - UW - CSE - P501

More Related