1 / 72

Register Pressure in Instruction Level Parallelism

Register Pressure in Instruction Level Parallelism. TOUATI Sid-Ahmed-Ali. Outline. Prologue Part one : Basic Blocks Part two : Simple Innermost Loops Epilogue. Memory Bottleneck.

Download Presentation

Register Pressure in Instruction Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Register Pressure in Instruction Level Parallelism TOUATI Sid-Ahmed-Ali

  2. Outline • Prologue • Part one : Basic Blocks • Part two : Simple Innermost Loops • Epilogue Thesis defense

  3. Memory Bottleneck From [Lin et al 01], in HPCA 2001. Simulated performance on an Alpha 21364 processor (1.6Ghz). Recent Compaq compiler (peak-optimization compiler flags). Thesis defense

  4. To avoid To tolerate • ILP/TLP • Software prefetching • Using registers Solutions by Software Thesis defense

  5. Original Graph Combined Complex Pass Scheduling + Register Allocation We do not advocate this method Thesis defense

  6. Why Decoupling ? • Memory wall • Memory bottleneck >> ILP enhancement • Useless spilling  Early RA still generates faster codes [Freudenberg et al 92, Brasier et al 95, Janssen 01]. • Register constraints are more generic • Heterogeneous, complex resource constraints. • Registers : types and number of available registers. • Complexity of register constraints • While there is always a schedule for a DDG under any resource constraints, it is not the case with limited number of registers (spilling is sometimes unavoidable). Thesis defense

  7. Our chart • Performance enhancement • priority to registers against ILP scheduling, but the former must respect the latter (if possible). • Portability • No major re-writing of compilers (investment cost). • Generic processor : meet most of existing ILP processors. Thesis defense

  8. Our Generic ILP processor • Non restriction on ILP degree. • Infinite parallelism (infinite resources). • Multiple register types. • A statement may produces multiple values, but with distinct types. • Visible delays in reading from and writing into registers. • A register is not occupied until the result is available (later after the issue time in the pipeline). Thesis defense

  9. Original DDG Register Constraints Register Pressure Management Modified DDG Register Allocation Code Scheduling First Strategy : Register Pressure Management Minimize Critical Path Increase Thesis defense

  10. Add arcs Spilling R Register Saturation and Sufficiency RS RS RF RS RF RF Thesis defense

  11. Original DDG Register Constraints Early Register Allocation Allocated DDG Code Scheduling Second Strategy : Schedule Independent Register Allocation Minimize Critical Path Increase Thesis defense

  12. Part I : Basic Blocks • Register Requirement (Use) • Register Saturation • Register Sufficiency • Local Schedule Independent Register Allocation • Related Work • Conclusion Thesis defense

  13. x + + 1 2 3 + 4 5 + + 6 7 st 8 + + 9 10 ld 11 12 ld Local Register Requirement + + x + + + st + Thesis defense

  14. Without Assuming a Schedule… • Value lifetime intervals are not defined • Register Requirement not defined. • Two notions in this case: • Register Saturation per register type (max RR) • Guarantees that registers do not constraint the ILP scheduling. • Register Sufficiency per register type (min RR) • Prevents from obsolete spilling. Thesis defense

  15. Computing Register Saturation • Given a DAG, compute the exact maximal register requirement for all valid schedules. • NP-complete problem [Touati 00]. • Optimal method (integer linear programming). • Algorithmic heuristics. Thesis defense

  16. Integer Programming Techniques • We use binary variables for expressing disjunction, implication, equivalence and max operator. • Disjunction : the domain set of the variables must be bounded. Thesis defense

  17. Integer Programming Techniques • Implication • Equivalence • Max Thesis defense

  18. Optimal RS Computation • Scheduling constraints : • e=(u,v) : v - u (e) • Killing dates : kill(ut)=max(v+r(v)),  v reads ut • Interferences : • Stu,v =1 (kill(u)def(v)  kill(v) def(u)) • Maximal clique = independent set in the complementary graph: • Stu,v= 0 xut + xvt 1 • Objective function = maximize (independent set) • Maximize  xut • At most O(n2) variables and O(n2+m) constraints. Thesis defense

  19. Problem Formulation with Graphs • RS computation  chose a unique killer for each value. • Computing a killing function that associates a unique killer to each value. • Two constraints : • The killing function must not introduce a circuit in the DAG. • The killing function must maximize the register requirement. Thesis defense

  20. + + x + + + + + ld Killing Function... + + x + + + + st + ld Killing function Disjoint Value DAG : interval order Thesis defense

  21. Register Saturation Problem • Find a valid killing function such that the maximal antichain in the disjoint value DAG is maximal among other killing functions. • NP-complete Problem. • Polynomial heuristics. Thesis defense

  22. Our Heuristics (Greedy-k) • Decompose the potential killing graph into connected bipartite components • cb=(S, T, Ecb) • Find a Saturating Killing Set: maximize the parallel values with S (minimize the number of arcs in the disjoint value DAG). Thesis defense

  23. S S T-T’ T’ T Descendant values Descendant values Saturating Killing Set Descendant values Thesis defense

  24. Greedy-k versus Optimal RS • Benchmarks : 27 loops from Spec-FP-95, whetstone, livermore. • DAGs=unrolled loops. • 134 experimented DAGs (#nodes up to 120). • Maximal difference empirical difference between optimal RS and approximated RS* by Greedy-kis 1 FP register (5% of DAGs). Thesis defense

  25. Representative RS Behaviour Thesis defense

  26. Reducing Register Saturation • Problem : does there exist an extended DDG G’ from G such that RS(G’)R and Critical Path  P ? • NP-hard problem [Touati 01] • Optimal solution with integer programming. • Algorithmic heuristics. Thesis defense

  27. Optimal RS Reduction • The problem is equivalent to computing a schedule  that does not require more than R registers (NP-complete), while the total schedule time is  P. • Given such schedule, we report arcs into G so as to guarantee the same interval order as defined by . Thesis defense

  28. Integer Program for RS Reduction • We bound the register requirement for each register type, and the total schedule time: •  xut Rt •   P • The objective function maximizes the RR of a considered register type: • Maximize xut • At most O(n2) variables and O(n2+m) constraints Thesis defense

  29. Algorithmic Heuristics for RS Reduction • Serialize Saturating Values lifetime intervals. • Do not extend the critical path if possible. Thesis defense

  30. st r-w Interval Serialization + + x + + + + + ld DAG Extended Thesis defense

  31. Experiments (RS Reduction) • Optimal versus approximated • Loops were unrolled till 4, #nodes up to 80. • We parameterise R (#available registers) as 1, next power of 2, and 32. • Maximal empirical error is two registers. Thesis defense

  32. Experiments (RS reduction) Thesis defense

  33. Experiments (ILP loss) Thesis defense

  34. Part I : Basic Blocks • Register Requirement (Use) • Register Saturation • Register Sufficiency • Local Schedule Independent Register Allocation • Related Work • Conclusion Thesis defense

  35. Computing Register Sufficiency • Its complexity is still an open problem !! • Proved NP-complete for sequential codes, but not for parallel ones. • Proved NP-complete for ILP codes if we restrict the total schedule time. • Integer programming • Same intLP system as RS, but we bound the register requirement :  xutRt • Algorithmic heuristics • Lifetime interval serialization (as RS reduction) • Do not care about critical path increase. • Set R=1 (reduce RS as low as possible) Thesis defense

  36. Experiments (RF Computation) • 27 loop bodies, maximal empirical error is 1 register (7 cases). Thesis defense

  37. Part I : Basic Blocks • Register Requirement (Use) • Register Saturation • Register Sufficiency • Local Schedule Independent Register Allocation • Related Work • Conclusion Thesis defense

  38. + + x + + + + + Example of Early RA Register Allocation is a minimal chain decomposition ld Thesis defense

  39. Two Critical Loops Thesis defense

  40. Related Work (RS) • Our RS study is an extension to URSA framework [Berson 96]. We provide an adequate formulation to this problem. • URSA Assumption • DAG=pure data-flow graph. No multiple register types, no delays, all nodes are assumed values. • The URSA problem formalisation is not correct. • The efficiency of URSA was not compared to the optimal solutions. Thesis defense

  41. Conclusion of Part I RS and RF are analysed before ILP scheduling : the DAG becomes free from register constraints. RS management maximizes the register requirement in order to minimize the # of introduced false dependences. RF analysis enables to check if spill code is useless. Our heuristics are nearly optimal (empirical results). Thesis defense

  42. Part II : Simple Innermost Loops • Cyclic Register Requirement (Use) • Cyclic Register Saturation • Cyclic Register Sufficiency • Cyclic Schedule Independent Register Allocation • Related Work • Conclusion Thesis defense

  43. 1st cn a b - - c - d - - e 2 1 0 - c a e - b - d - - - - 0 1 2 3 2nd a b - - c - d - - e h rn L 3rd a b - - c - d - - e h h h Software Pipelining Motif iterations time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Thesis defense

  44. h=4 h=4 v1 v2 v3 0 v3 v1 3 1 2 v2 Cyclic Register Requirement It i v1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 v3 It i+1 v1 v2 v3 It i+2 v1 v2 v3 v2 Thesis defense

  45. Computing CRN • In a circular interval graph, the size of a maximal clique in the interference graph  the width [Tucker 75]. • We decompose the circular graph into two parts : • Complete turns around the circle (# of distinct interfering instances) • In_fraction_of_h intervals. Thesis defense

  46. v1 v2 v3 h=4 h h In_fraction_of_h Intervals v1 v2 v3 • In_fraction_of_h intervals are the remainder of circular intervals after removing the complete turns around the circle. • If we unroll twice the kernel of the in_fraction_of_h intervals, the maximal clique of the interference graph is equal to the width of the in_fraction_of_h intervals [Touati 2002]. Thesis defense

  47. Part II : Simple Innermost Loops • Cyclic Register Requirement (Use) • Cyclic Register Saturation (CRS) • Computing Cyclic Register Saturation • Reducing Cyclic Register Saturation • Cyclic Register Sufficiency • Cyclic Schedule Independent Register Allocation • Related Work • Conclusion Thesis defense

  48. Computing CRS • CRSt is the exact maximal cyclic register requirement of all valid SWP schedules. • Absolute CRSt is infinite. • If MII=0 (acyclic DDG), the loop is completely parallel : cannot be implemented by a SWP kernel. • If L is not bounded, we may have an infinite # of values simultaneously live. • Optimal method by integer programming. Thesis defense

  49. Optimal CRS Computation (1) • The intLP system is written for a fixed h and a bounded L. • At most O(n2) variables and O(n2+m) constraints. • Scheduling constraints : • Killing dates : Thesis defense

  50. Optimal CRS Computation (2) • # of complete turns around the circle • Two acyclic intervals (]a,b], ]a’,b’]) for each in_fraction_of_h intervals (]l,r]). Thesis defense

More Related