Register Pressure in Instruction Level Parallelism. TOUATI SidAhmedAli. Outline. Prologue Part one : Basic Blocks Part two : Simple Innermost Loops Epilogue. Memory Bottleneck.
From [Lin et al 01], in HPCA 2001. Simulated performance on an Alpha 21364 processor (1.6Ghz). Recent Compaq compiler (peakoptimization compiler flags).
To tolerate
Scheduling +
Register Allocation
We do not advocate this method
Register
Constraints
Register Pressure Management
Modified DDG
Register Allocation
Code Scheduling
First Strategy : Register Pressure ManagementMinimize
Critical Path
Increase
Register
Constraints
Early Register Allocation
Allocated DDG
Code Scheduling
Second Strategy : Schedule Independent Register AllocationMinimize
Critical Path
Increase
+
+
1
2
3
+
4
5
+
+
6
7
st
8
+
+
9
10
ld
11
12
ld
Local Register Requirement+
+
x
+
+
+
st
+
+
x
+
+
+
+
+
ld
Killing Function...+
+
x
+
+
+
+
st
+
ld
Killing function
Disjoint Value DAG : interval order
S
TT’
T’
T
Descendant
values
Descendant
values
Saturating Killing SetDescendant
values
+
x
+
+
+
+
+
Example of Early RARegister Allocation is
a minimal chain decomposition
ld
RS and RF are analysed before ILP scheduling : the DAG becomes free from register constraints.
RS management maximizes the register requirement in order to minimize the # of introduced false dependences.
RF analysis enables to check if spill code is useless.
Our heuristics are nearly optimal (empirical results).
1st
cn
a
b


c

d


e
2 1 0
 c a
e  b
 d 
  
0
1
2
3
2nd
a
b


c

d


e
h
rn
L
3rd
a
b


c

d


e
h
h
h
Software Pipelining Motifiterations
time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
h=4
v1
v2
v3
0
v3
v1
3
1
2
v2
Cyclic Register RequirementIt i
v1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
v3
It i+1
v1
v2
v3
It i+2
v1
v2
v3
v2
v1
v2
v3
h=4
h
h
In_fraction_of_h Intervalsv1
v2
v3
It i
It i+1
It i+
It i+
u
u
u
u
u


v
v
v
v
v
Motivating ExampleR
R1
R1
R2
iteration
Physical
registers
h
R1=…
h
r5
r4
r3
r2
r1
r0
Rotating Register FilesThesis defense
Theorem [Touati 2002]: if we fix statically the reuse arcs, computing the distances so as to minimize the register requirement under a fixed execution rate has a totally unimodular constraints matrix.
Hamiltonian SIRA needs at most one extra register than SIRA (under the same II) in very few cases.
