Register pressure in instruction level parallelism
Download
1 / 72

Register Pressure in Instruction Level Parallelism - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Register Pressure in Instruction Level Parallelism. TOUATI Sid-Ahmed-Ali. Outline. Prologue Part one : Basic Blocks Part two : Simple Innermost Loops Epilogue. Memory Bottleneck.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Register Pressure in Instruction Level Parallelism' - price-walters


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Outline
Outline

  • Prologue

  • Part one : Basic Blocks

  • Part two : Simple Innermost Loops

  • Epilogue

Thesis defense


Memory bottleneck
Memory Bottleneck

From [Lin et al 01], in HPCA 2001. Simulated performance on an Alpha 21364 processor (1.6Ghz). Recent Compaq compiler (peak-optimization compiler flags).

Thesis defense


Solutions by software

To avoid

To tolerate

  • ILP/TLP

  • Software prefetching

  • Using registers

Solutions by Software

Thesis defense


Combined complex pass

Original Graph

Combined Complex Pass

Scheduling +

Register Allocation

We do not advocate this method

Thesis defense


Why decoupling
Why Decoupling ?

  • Memory wall

    • Memory bottleneck >> ILP enhancement

    • Useless spilling  Early RA still generates faster codes [Freudenberg et al 92, Brasier et al 95, Janssen 01].

  • Register constraints are more generic

    • Heterogeneous, complex resource constraints.

    • Registers : types and number of available registers.

  • Complexity of register constraints

    • While there is always a schedule for a DDG under any resource constraints, it is not the case with limited number of registers (spilling is sometimes unavoidable).

Thesis defense


Our chart
Our chart

  • Performance enhancement

    • priority to registers against ILP scheduling, but the former must respect the latter (if possible).

  • Portability

    • No major re-writing of compilers (investment cost).

    • Generic processor : meet most of existing ILP processors.

Thesis defense


Our generic ilp processor
Our Generic ILP processor

  • Non restriction on ILP degree.

    • Infinite parallelism (infinite resources).

  • Multiple register types.

    • A statement may produces multiple values, but with distinct types.

  • Visible delays in reading from and writing into registers.

    • A register is not occupied until the result is available (later after the issue time in the pipeline).

Thesis defense


First strategy register pressure management

Original DDG

Register

Constraints

Register Pressure Management

Modified DDG

Register Allocation

Code Scheduling

First Strategy : Register Pressure Management

Minimize

Critical Path

Increase

Thesis defense


Register saturation and sufficiency

Add

arcs

Spilling

R

Register Saturation and Sufficiency

RS

RS

RF

RS

RF

RF

Thesis defense


Second strategy schedule independent register allocation

Original DDG

Register

Constraints

Early Register Allocation

Allocated DDG

Code Scheduling

Second Strategy : Schedule Independent Register Allocation

Minimize

Critical Path

Increase

Thesis defense


Part i basic blocks
Part I : Basic Blocks

  • Register Requirement (Use)

  • Register Saturation

  • Register Sufficiency

  • Local Schedule Independent Register Allocation

  • Related Work

  • Conclusion

Thesis defense


Local register requirement

x

+

+

1

2

3

+

4

5

+

+

6

7

st

8

+

+

9

10

ld

11

12

ld

Local Register Requirement

+

+

x

+

+

+

st

+

Thesis defense


Without assuming a schedule
Without Assuming a Schedule…

  • Value lifetime intervals are not defined

    • Register Requirement not defined.

  • Two notions in this case:

    • Register Saturation per register type (max RR)

      • Guarantees that registers do not constraint the ILP scheduling.

    • Register Sufficiency per register type (min RR)

      • Prevents from obsolete spilling.

Thesis defense


Computing register saturation
Computing Register Saturation

  • Given a DAG, compute the exact maximal register requirement for all valid schedules.

  • NP-complete problem [Touati 00].

    • Optimal method (integer linear programming).

    • Algorithmic heuristics.

Thesis defense


Integer programming techniques
Integer Programming Techniques

  • We use binary variables for expressing disjunction, implication, equivalence and max operator.

  • Disjunction : the domain set of the variables must be bounded.

Thesis defense


Integer programming techniques1
Integer Programming Techniques

  • Implication

  • Equivalence

  • Max

Thesis defense


Optimal rs computation
Optimal RS Computation

  • Scheduling constraints :

    • e=(u,v) : v - u (e)

    • Killing dates : kill(ut)=max(v+r(v)),  v reads ut

  • Interferences :

    • Stu,v =1 (kill(u)def(v)  kill(v) def(u))

  • Maximal clique = independent set in the complementary graph:

    • Stu,v= 0 xut + xvt 1

  • Objective function = maximize (independent set)

    • Maximize  xut

  • At most O(n2) variables and O(n2+m) constraints.

Thesis defense


Problem formulation with graphs
Problem Formulation with Graphs

  • RS computation  chose a unique killer for each value.

  • Computing a killing function that associates a unique killer to each value.

  • Two constraints :

    • The killing function must not introduce a circuit in the DAG.

    • The killing function must maximize the register requirement.

Thesis defense


Killing function

+

+

x

+

+

+

+

+

ld

Killing Function...

+

+

x

+

+

+

+

st

+

ld

Killing function

Disjoint Value DAG : interval order

Thesis defense


Register saturation problem
Register Saturation Problem

  • Find a valid killing function such that the maximal antichain in the disjoint value DAG is maximal among other killing functions.

  • NP-complete Problem.

  • Polynomial heuristics.

Thesis defense


Our heuristics greedy k
Our Heuristics (Greedy-k)

  • Decompose the potential killing graph into connected bipartite components

    • cb=(S, T, Ecb)

  • Find a Saturating Killing Set: maximize the parallel values with S (minimize the number of arcs in the disjoint value DAG).

Thesis defense


Saturating killing set

S

S

T-T’

T’

T

Descendant

values

Descendant

values

Saturating Killing Set

Descendant

values

Thesis defense


Greedy k versus optimal rs
Greedy-k versus Optimal RS

  • Benchmarks : 27 loops from Spec-FP-95, whetstone, livermore.

  • DAGs=unrolled loops.

  • 134 experimented DAGs (#nodes up to 120).

  • Maximal difference empirical difference between optimal RS and approximated RS* by Greedy-kis 1 FP register (5% of DAGs).

Thesis defense



Reducing register saturation
Reducing Register Saturation

  • Problem : does there exist an extended DDG G’ from G such that RS(G’)R and Critical Path  P ?

  • NP-hard problem [Touati 01]

    • Optimal solution with integer programming.

    • Algorithmic heuristics.

Thesis defense


Optimal rs reduction
Optimal RS Reduction

  • The problem is equivalent to computing a schedule  that does not require more than R registers (NP-complete), while the total schedule time is  P.

  • Given such schedule, we report arcs into G so as to guarantee the same interval order as defined by .

Thesis defense


Integer program for rs reduction
Integer Program for RS Reduction

  • We bound the register requirement for each register type, and the total schedule time:

    •  xut Rt

    •   P

  • The objective function maximizes the RR of a considered register type:

    • Maximize xut

  • At most O(n2) variables and O(n2+m) constraints

Thesis defense


Algorithmic heuristics for rs reduction
Algorithmic Heuristics for RS Reduction

  • Serialize Saturating Values lifetime intervals.

  • Do not extend the critical path if possible.

Thesis defense


Interval serialization

st

r-w

Interval Serialization

+

+

x

+

+

+

+

+

ld

DAG

Extended

Thesis defense


Experiments rs reduction
Experiments (RS Reduction)

  • Optimal versus approximated

    • Loops were unrolled till 4, #nodes up to 80.

    • We parameterise R (#available registers) as 1, next power of 2, and 32.

    • Maximal empirical error is two registers.

Thesis defense



Experiments ilp loss
Experiments (ILP loss)

Thesis defense


Part i basic blocks1
Part I : Basic Blocks

  • Register Requirement (Use)

  • Register Saturation

  • Register Sufficiency

  • Local Schedule Independent Register Allocation

  • Related Work

  • Conclusion

Thesis defense


Computing register sufficiency
Computing Register Sufficiency

  • Its complexity is still an open problem !!

    • Proved NP-complete for sequential codes, but not for parallel ones.

    • Proved NP-complete for ILP codes if we restrict the total schedule time.

  • Integer programming

    • Same intLP system as RS, but we bound the register requirement :  xutRt

  • Algorithmic heuristics

    • Lifetime interval serialization (as RS reduction)

    • Do not care about critical path increase.

    • Set R=1 (reduce RS as low as possible)

Thesis defense


Experiments rf computation
Experiments (RF Computation)

  • 27 loop bodies, maximal empirical error is 1 register (7 cases).

Thesis defense


Part i basic blocks2
Part I : Basic Blocks

  • Register Requirement (Use)

  • Register Saturation

  • Register Sufficiency

  • Local Schedule Independent Register Allocation

  • Related Work

  • Conclusion

Thesis defense


Example of early ra

+

+

x

+

+

+

+

+

Example of Early RA

Register Allocation is

a minimal chain decomposition

ld

Thesis defense


Two critical loops
Two Critical Loops

Thesis defense


Related work rs
Related Work (RS)

  • Our RS study is an extension to URSA framework [Berson 96]. We provide an adequate formulation to this problem.

  • URSA Assumption

    • DAG=pure data-flow graph. No multiple register types, no delays, all nodes are assumed values.

  • The URSA problem formalisation is not correct.

  • The efficiency of URSA was not compared to the optimal solutions.

Thesis defense


Conclusion of part i
Conclusion of Part I

RS and RF are analysed before ILP scheduling : the DAG becomes free from register constraints.

RS management maximizes the register requirement in order to minimize the # of introduced false dependences.

RF analysis enables to check if spill code is useless.

Our heuristics are nearly optimal (empirical results).

Thesis defense


Part ii simple innermost loops
Part II : Simple Innermost Loops

  • Cyclic Register Requirement (Use)

  • Cyclic Register Saturation

  • Cyclic Register Sufficiency

  • Cyclic Schedule Independent Register Allocation

  • Related Work

  • Conclusion

Thesis defense


Software pipelining motif

1st

cn

a

b

-

-

c

-

d

-

-

e

2 1 0

- c a

e - b

- d -

- - -

0

1

2

3

2nd

a

b

-

-

c

-

d

-

-

e

h

rn

L

3rd

a

b

-

-

c

-

d

-

-

e

h

h

h

Software Pipelining Motif

iterations

time

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Thesis defense


Cyclic register requirement

h=4

h=4

v1

v2

v3

0

v3

v1

3

1

2

v2

Cyclic Register Requirement

It i

v1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

v3

It i+1

v1

v2

v3

It i+2

v1

v2

v3

v2

Thesis defense


Computing crn
Computing CRN

  • In a circular interval graph, the size of a maximal clique in the interference graph  the width [Tucker 75].

  • We decompose the circular graph into two parts :

    • Complete turns around the circle (# of distinct interfering instances)

    • In_fraction_of_h intervals.

Thesis defense


In fraction of h intervals

v1

v2

v3

h=4

h

h

In_fraction_of_h Intervals

v1

v2

v3

  • In_fraction_of_h intervals are the remainder of circular intervals after removing the complete turns around the circle.

  • If we unroll twice the kernel of the in_fraction_of_h intervals, the maximal clique of the interference graph is equal to the width of the in_fraction_of_h intervals [Touati 2002].

Thesis defense


Part ii simple innermost loops1
Part II : Simple Innermost Loops

  • Cyclic Register Requirement (Use)

  • Cyclic Register Saturation (CRS)

    • Computing Cyclic Register Saturation

    • Reducing Cyclic Register Saturation

  • Cyclic Register Sufficiency

  • Cyclic Schedule Independent Register Allocation

  • Related Work

  • Conclusion

Thesis defense


Computing crs
Computing CRS

  • CRSt is the exact maximal cyclic register requirement of all valid SWP schedules.

  • Absolute CRSt is infinite.

    • If MII=0 (acyclic DDG), the loop is completely parallel : cannot be implemented by a SWP kernel.

    • If L is not bounded, we may have an infinite # of values simultaneously live.

  • Optimal method by integer programming.

Thesis defense


Optimal crs computation 1
Optimal CRS Computation (1)

  • The intLP system is written for a fixed h and a bounded L.

  • At most O(n2) variables and O(n2+m) constraints.

  • Scheduling constraints :

  • Killing dates :

Thesis defense


Optimal crs computation 2
Optimal CRS Computation (2)

  • # of complete turns around the circle

  • Two acyclic intervals (]a,b], ]a’,b’]) for each in_fraction_of_h intervals (]l,r]).

Thesis defense


Optimal crs computation 3
Optimal CRS Computation (3)

  • The interferences of acyclic intervals are computed as in the acyclic case

  • A maximal clique is computed like in the acyclic case

  • Objective function

Thesis defense


Crs experiments
CRS Experiments

Thesis defense


Reducing crs
Reducing CRS

  • Problem : Given a DDG G, does there exist an extended DDG G’ such that CRSt(G’) Rtand its critical circuit is h ?

  • NP-hard [Touati 2002].

  • Is reduced from the problem of computing , a SWP motif, with a initiation interval equal to h and does not require more than Rt registers.

  • G’ is constructed from G such that we guarantee the same lifetime interval order as defined by .

Thesis defense


Optimal crs reduction
Optimal CRS Reduction

  • The same intLP system as CRS computation, except that we bound the objective function by Rt

Thesis defense


Part ii simple innermost loops2
Part II : Simple Innermost Loops

  • Cyclic Register Requirement (Use)

  • Cyclic Register Saturation

  • Cyclic Register Sufficiency

  • Cyclic Schedule Independent Register Allocation

    • With loop unrolling.

    • Rotating register files.

    • Polynomial cases.

  • Related Work

  • Conclusion

Thesis defense


Motivating example

It i

It i+1

It i+

It i+ 

u

u

u

u

u

-

-

v

v

v

v

v

Motivating Example

R

R1

R1

R2

Thesis defense


Killing tasks

’

u

u’

’

v2

v2

v1

ku

v1

ku’

Killing Tasks

1

2

’1

’2

-1

-2

-’1

-’2

Thesis defense


Reuse graphs

1

2

2

2

2

3

3

1

1

2

6

3

4

8

6

4

4

5

5

5

7

7

7

6

6

6

8

8

Reuse Graphs

Thesis defense


Sira problem
SIRA Problem

  • Given a DDG, find a valid reuse graph for each register type such that (C) R while the critical circuit is minimized.

  • Classical NP-complete problem [Eisenbeis and Gasperoni 95].

  • Minimizing the unrolling degree is a difficult problem (non linear).

Thesis defense


Sira exact formulation 1
SIRA Exact Formulation (1)

  • The existence of a SWP

  • Reuse Arc (bijection)

  • Anti-dependences

Thesis defense


Sira exact formulation 2
SIRA Exact Formulation (2)

  • Register Requirement

  • Objective function

  • At most O(n2) variables and O(n2+m) constraints.

Thesis defense


Rotating register files

kernel

iteration

Physical

registers

h

R1=…

h

r5

r4

r3

r2

r1

r0

Rotating Register Files

  • No need to unroll the kernel to allocate registers.

  • Cost (if we do not unroll the loop) : at most one extra register for the same h[Rau et al 92, Lelait96].

Thesis defense


Hamiltonian ordering

0

1

2

6

3

5

4

Hamiltonian Ordering

  • (u,v) is a reuse arc  (ho(v)=ho(u)+1) mod n

Thesis defense


Polynomial problem
Polynomial Problem

Theorem [Touati 2002]: if we fix statically the reuse arcs, computing the  distances so as to minimize the register requirement under a fixed execution rate has a totally unimodular constraints matrix.

Thesis defense


Polynomial instances
Polynomial Instances

  • Disable register sharing among statements : fix statically a self reuse scheme (each variables reuses the register freed by itself) [Ning 93].

  • Fix an arbitrary (on in a cleverer way) hamiltonian reuse circuit.

  • Look for other cleverer method !

Thesis defense


Experiments sira vs ham sira
Experiments (SIRA vs. Ham SIRA)

Hamiltonian SIRA needs at most one extra register than SIRA (under the same II) in very few cases.

Thesis defense


Optimal vs polynomial sira
Optimal vs. Polynomial SIRA

  • Two strategies : self reuse (disable register sharing) and arbitrary hamiltonian reuse.

  • To synthesize :

    • Self-reuse strategy is the worst decision in terms of register requirement. Arbitrary Hamiltonian reuse scheme exhibits better behaviour.

    • Self-reuse strategy exhibits better behaviour in terms of unrolling degrees (if no RRF exists).

Thesis defense


Related work in loops
Related Work in Loops

  • As far I we know, nothing about cyclic register saturation and sufficiency.

  • SWP under register constraints

    • Heuristics : [Huff 93, Ning93, Wang et al 95, Sanchez96, Llosa 96]

    • Parallelism vs. Storage : [Strout et al 98, Thies et al 01].

    • Optimal : buffers [Altman 95], stage scheduling [Eichenberger et al 96, Huard 01], MAXLIVE [Sawya 96, Fimmel et al 01].

  • Register Allocation of an already scheduled loop

    • RRF : [Rau et al 92], [Duesterwald et al 92].

    • Loop unrolling : [Hendren et al 92].

    • Meeting graph [Lelait et al 96].

Thesis defense


Conclusion of part ii
Conclusion of Part II

  • Contrary to acyclic RS, absolute CRS may be infinite. Practically, we can compute it if we bound L.

  • CRF exists, and we can compute it.

  • SIRA allows to construct an optimal (early) cyclic register allocation with a minimal critical circuit of the loop.

    • Fixing at compile time the reuse arcs makes the problem polynomial.

Thesis defense


Future research proposals 1
Future Research Proposals (1)

  • A fact :a thesis can never be exhaustive ! Our efforts open a wide range of future work subjects.

  • Open Problems

    • Complexity of Register Sufficiency in ILP Codes.

  • Experiments

    • CRS, CRF, and spilling heuristics.

    • Machine with finite resources.

Thesis defense


Future research proposals 2
Future Research Proposals (2)

  • Extend processor model (write multiple results per type).

  • Loops with branches.

  • Multi-dimensional scheduling : loop nest.

Thesis defense


Conclusion our thesis in two points
Conclusion : ourTHESIS in two points

  • Registers should be handled (early) at the intermediate level of compilation, but with register saturation (maximize and not minimize the register requirement).

  • Avoid spilling, even if you degrade static ILP extraction.

Thesis defense


ad