1 / 88

Compiler Improvement of Register Usage

Compiler Improvement of Register Usage. Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky. Roadmap. Introduction Scalar Replacement Unroll-and-Jam. Introduction. Processor cycle time is decreasing; while memory time remains almost the same.

Download Presentation

Compiler Improvement of Register Usage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

  2. Roadmap • Introduction • Scalar Replacement • Unroll-and-Jam

  3. Introduction • Processor cycle time is decreasing; while memory time remains almost the same. • Better usage of registers set (especially for RISC architectures) • No cache discussion here

  4. Register Allocation Algorithms • Defining live range per variable. • Building interference graph to model which pairs of live ranges can not be assigned to the same register. • Using a fast heuristic coloring algorithm. • If coloring fails take at least one live range from registers and repeat from step 3.

  5. Register Allocation Algorithms – so what’s the problem? • Only non-array variables are assigned to registers. • Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables. Register R For Array A: 5 7 2 9 4 0 6 1 8 3 Array A:

  6. Register Allocation Algorithms – so what’s the problem? • Only non-array variables are assigned to registers. • Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables. 5 7 2 9 4 0 6 1 8 3 Array A:

  7. Register Allocation Algorithms – so what’s the problem? • Only non-array variables are assigned to registers. • Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables. • We want to eliminate the unneeded loads and stores. 5 7 2 9 4 0 6 1 8 3 Array A:

  8. Introduction – example: • Load from some memory location A(I) to register RA. • Load from some memory location B(J) to register RB. • RA = RA + RB • StoreRA to memory – A(I) DO I=1,N DO J=1,M A(I)=A(I)+B(J) ENDDO ENDDO • Load from some memory location A(I) to register RA. • Load from some memory location B(J+1) to register RB. • RA = RA + RB • StoreRA to memory – A(I)

  9. DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm Source-to-source transformation Introduction – example (cont.)

  10. Data Dependence for Register Reuse • Two application of dependence: • To determine the correctness of different transformations • To determine the transformations to improve the performance of particular memory accesses

  11. R1 R2 R1 R2 Types of dependences • A true or flow dependence • Assignment to variable and then usage of it S1: V = …  Store the register representing V to V’s memory location S2: … = V  Load V to register representing V • Load can be saved (store after the usage). Cache miss can be saved

  12. R1 R2 Types of dependences • An antidepencence dependence • Usage of variable and then assignment to it S1: … = V  Load V to register representing V S2: V = …  Store the register representing V to V’s memory location • Nothing can be done to improve the registers usage (if used once), but cache miss can be saved.

  13. R1 R2 Types of dependences • An output dependence • Assignment to variable repeats S1: V = …  Store the register representing V to V’s memory location S2: V = …  Store the register representing V to V’s memory location • First store is not needed.

  14. S1: A(I) = … S2: … = A(I) S3: A(I) = … True dependence between S1 and S2 Output dependence between S1 and S3 Types of dependences - example Store Load Store

  15. R1 R2 Types of dependences NEW! • An input dependence • Usage of variable repeats S1: … = V  Load V to register representing V S2: … = V  Load V to register representing V • Second load is not needed.

  16. Types of dependences • Loop-independent dependence • Loop-carried dependence: • Consistent dependence – a loop-carried dependence with a constant dependence distance throughout the loop. • Non consistent dependence – the opposite.

  17. S1: A(I)=… S2: …=A(I) S3: …=A(I)load from A(I) Two True dependences between S1 and S2 (Load can be saved) and between S1 and S3 (Load can be saved) Input dependence from S2 to S3 (Load can be saved) Dependence graph modification store to A(I) load from A(I) • It is not possible to save three memory references  the dependence graph should be pruned.

  18. True V = … … = V Save load Anti- dependence …=V V=… Save nothing Output V=… V=… Save first store Input …=V …=V Save second load Sum reduction example(again) DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO

  19. Sum reduction example(again) … DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO Load from A(I) to R1 Load from B(J) to R2 R1 = R1 + R2 Store R1 to A(I) Load from A(I) to R1 Load from B(J+1) to R2 R1 = R1 + R2 Store R1 to A(I) Load from A(I) to R1 …

  20. DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO Scalar Register Allocation

  21. Roadmap • Introduction • Scalar Replacement • Unroll-and-Jam

  22. Simple replacement DO I=1,N A(I) = B(I) + C X(I) = A(I) + Q ENDDO DO I=1,N t = B(I) + C X(I) = t + Q A(I) = t ENDDO Loop Independent Dependence

  23. Spanning single iteration DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO I=1 A(1) = B(0) B(1) = A(1)+C(1) I=2 A(2) = B(1) B(2) = A(2)+C(2) I=3 A(3) = B(2) B(3) = A(3)+C(3) … Loop Carried Dependences Loop Carried True dependence

  24. Spanning single iteration DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO I=1 A(1) = B(0) or T B(1) or T = A(1)+C(1) I=2 A(2) = B(1) or T B(2) or T = A(2)+C(2) I=3 A(3) = B(2) or T B(3) or T = A(3)+C(3) … Loop Carried Dependences T = B(0) DO I=1,N A(I) = T T = A(I) + C(I) B(I) = T ENDDO

  25. Dependences Spanning Multiple Iterations • Distance for the dependence is 2 iterations DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO • Use 3 different scalar temporaries defined as follows: • t1 = B(I-1) t2 = B(I) t3 = B(I+1) Input dependence

  26. DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO t1 = B(I-1) t2 = B(I) t3 = B(I+1) DO I=1, N t1 = B(I-1) t2 = B(I) t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO B[0] t1 t1 B[1] t2 t2 B[2] B[3] t3 t3 B[4] Dependences Spanning Multiple Iterations

  27. DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO t1 = B(I-1) t2 = B(I) t3 = B(I+1) t1 = B(0) t2 = B(1) DO I=1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO Dependences Spanning Multiple Iterations 2 pipeline cycles/slots wasted for each loop!

  28. DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO t1 = B(I-1) t2 = B(I) t3 = B(I+1) t1 = B(0) t2 = B(1) DO I=1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO t1 t1 t2 t2 t3 t3 Dependences Spanning Multiple Iterations B[0] B[1] B[2] B[3] 2 pipeline cycles/slots wasted for each loop! B[4] B[5] B[6] …

  29. t1 = B(0) t2 = B(1) DO I=1, N t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = + t2 = B(I+3) A(I+2) = t3 + t2 ENDDO I = 1 A(1)=B(0)+B(2) I = 2 A(2)=B(1)+B(3) I = 3 A(3)=B(2)+B(4) I = 4 A(4)=B(3)+B(5) … Dependences Spanning Multiple Iterations ,3 t1 B[0] t2 B[1] B[2] t3 t1 B[3] t2 t1 B[4] B[5] B[6] …

  30. Eliminating Scalar Copies • t1 = B(0) • t2 = B(1) • DO I=1, N • t3 = B(I+1) • A(I) = t1 + t3 • t1 = t2 • t2 = t3 • ENDDO t1 = B(0) t2 = B(1) mN3 = MOD(N,3) DO I = 1, mN3 t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO DO I = mN3 + 1, N, 3 t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = t2 + t1 t2 = B(I+3) A(I+2) = t3 + t2 ENDDO Pre-loop the same as the previous solution Main loop - three sums in one iteration

  31. Scalar Replacement • Loop-independent dependence – simple replacement • Handling loop-carried dependence – spanning one iteration • Dependences spanning multiple iterations • Eliminating scalar copies • Pruning the dependence graph • Moderating register pressure

  32. Break?

  33. True V=… …=V Save load Anti-dependence …=V V=… Save nothing Output V=… V=… Save first store Input …=V …=V Save last load Pruning the dependence graph I = 1 A(2)=A(0)+B(0) A(1)=A(1)+B(1)+B(2) I = 2 A(3)=A(1)+B(1) A(2)=A(2)+B(2)+B(3) I = 3 A(4)=A(2)+B(2) A(3)=A(3)+B(3)+B(4) … DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO

  34. Pruning the dependence graph • Two kinds of edges to be pruned: • True and input dependence edges which are killed by an intervening assignment to the same location. • Input dependence edges that are redundant.

  35. Three-phase algorithm for pruning edges • Phase 1:Eliminate killed dependences • Can be true or input dependences • Looking for the store to the location involved in the dependence between the endpoints of the dependence.

  36. Three-phase algorithm for pruning edges • Phase 1:Eliminate killed dependences • True dependence • Looking for output dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink). • S1: A(I+1) = … • S2: A(I) = … • S3: … = A(I) Output True True

  37. Three-phase algorithm for pruning edges • Phase 1:Eliminate killed dependences • Input dependence • Looking for anti-dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink). • S1: …=A(I+1) • S2: A(I) = … • S3: … = A(I) Anti Input True

  38. Three-phase algorithm for pruning edges • Phase 2:Identify generators • Generator is a reference • with at least one true or input dependence starting from it • AND • with no input or true dependence into it • Actually if we are looking on the graph represented only by input and true dependences, generators are the sources.

  39. Three-phase algorithm for pruning edges • Phase 3:Find name partitions and eliminate input dependences • A name partition is a set of references that can be replaced by a reference to a single scalar variable. B(I) A(I-1) B(I) A(I+2) A(I) B(I) A(I) A(I+1)

  40. Three-phase algorithm for pruning edges • Phase 3:Find name partitions and eliminate input dependences • A name partition is a set of references that can be replaced by a reference to a single scalar variable. • Starting at each generator mark each reference reachable from the generator by a flow or input dependence as part of the name partition for that variable. (similar to the typed fusion problem) • Eliminate input dependences within same name partition, unless source is generator.

  41. Three-phase algorithm for pruning edges • Phase 3 and a half:Anti-dependence • Note that anti-dependence can’t directly give a rise to register reuse and are always pruned as the last step in the graph pruning.

  42. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Phase 1 Try to eliminate true dependences Try to eliminate input dependences Phase 2 Identify generators Candidates are sources of true and input dependences Pruning the dependence graph

  43. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Phase 3 Starting each generator mark each reference reachable from the generator by input or true dependence as part of the name partition for that variable. Phase 3.5 Pruning the dependence graph

  44. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO How to start scalar replacement from here? Generators: A(I+1) A(I) B(I+1) The number of scalars for each generator = how many iterations are spanned by the dependence. A(I-1) B(I-1) B(I) Pruning the dependence graph

  45. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO References: A(I-1) = tA1 A(I) = tA2 A(I+1) = tA3 B(I-1) = tB1 B(I) = tB2 B(I+1) = tB3 Pruning the dependence graph

  46. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO tA1 = A(I-1), tA2=A(I), tA3=A(I+1) tB1 = B(I-1), tB2=B(I), tB3=B(I+1) tA1 = A(0) tA2 = A(1) tB1 = B(0) tB2 = B(1) DO I = 1, N tA3 = tA1 + tB1 tB3 = B(I+1) tA1 = tA2 + tB2 + tB3 // Above: tA1=tA2 A(I) = tA1 tA2 = tA3 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = tA3 Pruning the dependence graph  One Load  One Store

  47. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Load A(I-1) to R1 Load B(I-1) to R2 R1 = R1 + R2 Store R1 to A(I+1) Load A(I) to R1 Load B(I)to R2 R1 = R1 + R2 Load B(I+1) to R2 R1 = R1 + R2 Store R1 to A(I) Pruning the dependence graph

  48. Pruning the dependence graph • Special cases: References in the Loop DO I=1, N = B(I) + C(I,J) C(I,J) = + D(I) ENDDO A(J) tA A(J) tA A(J) = tA Assumption: N>1

  49. Pruning the dependence graph • Special cases: Forcing stores and loads DO I = 1, N A(I) = A(I-1) + B(I) A(J) = A(J) + A(I) ENDDO I=1 A(1) = A(0) + B(1) A(J) = A(J) + A(1) I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2) I=3 …

  50. Pruning the dependence graph • Special cases: Forcing stores and loads DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO I=1 A(1) = A(0) + B(1) A(J) = A(J) + A(1) I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2) I=3 …

More Related