1 / 50

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006. Topic C: Loop Fusion Kit Barton www.cs.ualberta.ca/~cbarton. Outline. Definition of loop fusion Basic concepts Prerequisites of loop fusion A loop fusion algorithm Example. Loop Fusion. Combine 2 or more loops into a single loop

tasha-woods
Download Presentation

CMPUT680 - Winter 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton www.cs.ualberta.ca/~cbarton

  2. Outline • Definition of loop fusion • Basic concepts • Prerequisites of loop fusion • A loop fusion algorithm • Example

  3. Loop Fusion • Combine 2 or more loops into a single loop • This cannot violate any dependencies between the loop bodies • Several conditions which must be met for fusion to occur • Often these conditions are not initially satisfied

  4. Advantages of Loop Fusion • Save increment and branch instructions • Creates opportunities for data reuse • Provide more instructions to instruction scheduler to balance the use of functional units

  5. Disadvantages of Loop Fusion • Increase code size effecting instruction cache performance • Increase register pressure within a loop • Could cause the formation of loops with more complex control flow

  6. Background • There has been extensive work done on loop fusion • Most has focused on weighted loop fusion (Gao et al., Kennedy and McKinley, Megiddo and Sarkar) • Extensive work has also been done it performing loop fusion to increase parallelism

  7. Weighted Loop Fusion • Associates non-negative weights with each pair of loop nests • Weights are a measurement of the expected gain if the two loops are fused • Gains include potential for array contraction, data reuse and improved local register allocation

  8. Optimal Loop Fusion • Fuse loops to optimize data reuse, taking into consideration resource constraints and register usage • This problem is NP-Hard

  9. Maximal Loop Fusion • Our approach is to perform maximal loop fusion • Fuse as many loops as possible, without considering resource constraints • Fuse loops as soon as possible, not considering the consequences

  10. Dominators and Post Dominators • A node x in a directed graph G with a single exit node dominates node y in G if any path from the entry node of G to y must pass through x • A node x in a directed graph G with a single exit node post-dominates node y in G if any path from y to the exit node of G must pass through x Allen & Kennedy, p. 150, 353

  11. Requirements for Loop Fusion • Loops must have identical iteration counts (be conforming) • Loops must be control-flow equivalent • Loops must be adjacent • There cannot be any negative distance dependencies between the loops

  12. Non-conforming Loops • If iteration counts are different, one loop must be manipulated to make the iteration counts the same • Loop peeling • Introduce a guard into one of the loops

  13. Loop Peeling • Find the difference between the iteration count of the two loops (n) • Duplicate the body of the loop with the higher iteration count n times • Update the iteration count of the peeled loop

  14. Loop Peeling Example while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 10) { b[j] = b[j - 1] - 2; j++; } b[j] = b[j - 1] - 2; j++; b[j] = b[j - 1] - 2; j++; while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 12) { b[j] = b[j - 1] - 2; j++; }

  15. Guarding Iterations • Increase the iteration count of the loop with fewer iterations • Insert a guard branch around statements that would not normally be executed

  16. Guarding Iterations Example while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 12) { b[j] = b[j - 1] - 2; j++; } while (i < 12) { if (i < 10) { a[i] = a[i - 1] * 2; i++; } } while (j < 12) { b[j] = b[j - 1] - 2; j++; }

  17. Loop Peeling • Advantage: • Does not generate control flow within a loop body • Disadvantage: • Generates additional code outside of loops, which could possible intervene with other loops

  18. Guarding Iterations • Advantages: • Does not introduce intervening code • Can be “undone” later • Disadvantage: • Generates control flow within a loop

  19. Control Flow Equivalence • Two loops are control-flow equivalent if when one executes, the other also executes Loop 1 BB BB Loop 1 Loop2 Loop2 Loop 3

  20. Determining Control Flow Equivalence • Use the concepts of dominators and post dominators. Two loops L1 and L2 are control-flow equivalent if the following two conditions are true: • L1 dominates L2; and • L2 post dominates L1.

  21. Intervening Code • Two loops are adjacent if there are no statements between the two loops • Can be determined using the CFG: • If the immediate successor of the first loop is the second loop, the two loops are adjacent • If two loops are not adjacent, there is intervening code between them.

  22. Dealing with Non-Adjacent Loops • If two loops are not adjacent, we attempt to make them adjacent by moving the intervening code • Intervening code can be moved: • Above the first loop • Below the second loop • Both • as long as no data dependencies are violated

  23. Intervening Code Example 6 Loop 1 • Assume CFG has 20 nodes • 0-5 are above Loop 1 • 17-19 are below Loop 2 • What algorithm should be used to determine which nodes are between Loop1 and Loop2? 7 8 9 10 11 12 14 13 15 16 Loop 2

  24. Gathering Intervening Code • Given two loops L1 and L2, a basic block B is intervening code between L1 and L2 if and only if: • B is strictly dominated by L1 • B is not dominated by L2 • Once the dominance relations are known, the set subtraction can be efficiently computed using bit vectors

  25. Intervening Code Example 6 Loop 1 Loop 1 0000 0011 1111 1111 1111 1 7 8 9 Loop 2 0000 0000 0000 0000 1111 1 10 11 12 14 13 Difference 15 0000 0011 1111 1111 0000 0 16 Loop 2

  26. Analyze Intervening Code • Build a DDG of the intervening code • Put all nodes with no predecessors into queue • For each node in the queue: • If there are no dependencies between the node and the loop • Mark node as moveable • Add all of the nodes immediate successors to the queue • All nodes marked can be moved around the loop

  27. Non-Adjacent loops example while (i < N) { a += i; i++; } b := a * 2; c := b + 6; g := 0; h := g + 10; if (c < 100) d := c/2; else e := c * 2; while (j < N) { f := g + 6; j++; } b := a * 2; g := 0; c := b + 6; h := g + 10; • if (c < 100) • d := c/2; • else • e := c * 2;

  28. Non-Adjacent loops example while (i < N) { a += i; i++; } b := a * 2; c := b + 6; g := 0; h := g + 10; if (c < 100) d := c/2; else e := c * 2; while (j < N) { f := g + 6; j++; } • g := 0; • h := g + 10; • while (i < N) { • a += i; • i++; • } • while (j < N) { • f := g + 6; • j++; • } • b := a * 2; • c := b + 6; • if (c < 100) • d := c/2; • else • e := c * 2;

  29. Non-Adjacent loops example DDG Loop 2 Node Queue while (j < N) { f := g + 6; j++; } b := a * 2; b := a * 2; g := 0; g := 0; c := b + 6; • if (c < 100) • d := c/2; • else • e := c * 2; c := b + 6; h := g + 10; Moveable Nodes b := a * 2; c := b + 6; • if (c < 100) • d := c/2; • else • e := c * 2; • if (c < 100) • d := c/2; • else • e := c * 2;

  30. Non-Adjacent loops example DDG Loop 1 Node Queue • while (i < N) { • a += i; • i++; • } b := a * 2; b := a * 2; g := 0; g := 0; h := g + 10; c := b + 6; h := g + 10; Moveable Nodes g := 0; • if (c < 100) • d := c/2; • else • e := c * 2; h := g + 10;

  31. i = j = 1; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; } Dependencies Preventing Fusion Can the following loops be fused?

  32. Dependencies Preventing Fusion • If we look at the array access patterns of a[], we see the following a[i] = c[i] + 10; b[j] = a[j+1] * 2;

  33. Dependencies Preventing Fusion • By aligning the array access patterns, we get the following: a[i] = c[i] + 10; b[j] = a[j+1] * 2;

  34. i = j = 1; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; } Loop Alignment j = 1; i = 2 a[1] = c[1] + 10; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; }

  35. Loop Alignment • Loop alignment can be used to remove dependencies between loop bodies • Easy to do when all dependencies have the same distance • Gets tricky when there are multiple dependencies with different distances

  36. Putting it all together • We’ve seen ways to deal with each of the preconditions of loop fusion • If the conditions are not met, we apply transformations to try and modify the code • If the transformations are successful, loop fusion can occur • But in what order should these transformations be applied?

  37. Loop Fusion Algorithm For each Ni from outermost to innermost: Gather control equivalent loops in Ni into LoopSets For each set Si in LoopSets remove non-eligible loops from Si FusedLoops = true Direction = forward while FusedLoops == true if |Si| < 2 break Compute Dominance Relation FusedLoops = LoopFusionPass(Si, Direction) Reverse Direction

  38. Loop Fusion Algorithm LoopFusionPass(S, Direction) FusedLoops = false For each pair of loops Lj and Lk in S such that Ljdominates Lkin Direction if (DependenceDistance(Lj, Lk) < 0) continue if (InterveningCode(Lj, Lk) == true and IsInterveningCodeMoveable(Lj, Lk) == false) continue d = | IterationCount(Lj) – IterationCount(Lk) | if (Ljand Lkare non-conforming and (d cannot be determined at compile time or d > MAXPEEL)) continue if (Lj and Lkare non-conforming) Peel iterations MoveInterveningCode(Lj, Lk) if InterveningCode(Lj, Lk) == false FuseLoops(Lj, Lk) FusedLoops = true Return FusedLoops

  39. Example Loop Set L1: do i1 = 1, n a(i1) = a(i1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L1 L2 L3 L4

  40. Peeling Loop 1 S7: a(1) = a(1) * k1 L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L1: do i1 = 1, n a(i1) = a(i1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  41. Fuse L1 and L2 S7: a(1) = a(1) * k1 L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  42. We now compare loops L5 and L3 They are not adjacent, but the intervening code can move Difference in iteration count is not know, so fusion fails Compare L5 and L3 S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  43. Intervening Code Compare L5 and L4 S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m

  44. Peel L5 S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  45. Move Intervening Code S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  46. Reverse Pass S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do Loop Set L1 L3 L4 Sorted in Reverse Dominance Direction L4 L3 L1

  47. Compare L4 and L3 No dependencies to prevent fusion Iteration count cannot be determined at compile time Fusion fails Compare L4 and L3 S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

  48. Compare L4 and L5 S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do Intervening Code L3: do i3 = 1, m ds = ds + d(i3) end do

  49. Move Intervening Code S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L3: do i3 = 1, m ds = ds + d(i3) end do

  50. Fuse L4 and L1 S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L3: do i3 = 1, m ds = ds + d(i3) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L6: do i5 = 1, n-2 a(i6+2) = a(i6+2) * k1 d(i6+1) = a(i6+1) - b(i6+2) * k2 b(i6) = a(i6) + b(i6) / c(i6) end do L3: do i3 = 1, m ds = ds + d(i3) end do

More Related