Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France

Lattice-Based Memory Allocation Alain Darte Compsys Project: Compilation and Embedded Systems CNRS, LIP, ENS-Lyon, France Joint work with Rob Schreiber (HP Labs) and Gilles Villard (CNRS, LIP). References: CASES’03, IEEE Transactions on Computers (to appear). WOG’04, April. 25 th, 2004. Recent trends in Compiler Construction. Sven Verdoolaege’s PhD Defense.

Outline • Introduction: • The initial context: PICO, HP Labs software tool for compiling high-level programs (e.g., C code) into NPAs (Non Programmable Accelerators). How to store intermediate results? • Mathematical tools for high-level program transformations. • An example of communicating pipelined loops. • Lattice-based memory allocation. • Examples of previous work limitations. • Main results and open questions.

 Output “code”: • synthesizable VHDL • netlists for FPGA • VLIW code (interface) Input code: • C code PiCo (Program In Chip Out) HP Labs automatic generation of non programmable accelerator (NPA) Similar tools: MMAlpha (Inria), Atomium (IMEC), Compaan (Leiden) Other possible inputs: Recurrence equations, Matlab, Kahn processes

High-Level Program Optimizations • Program analysis: dependence analysis, lifetime analysis, footprint analysis, array expansion, array renaming, etc. • Code and loop transformations: tiling, scheduling, nested loop transformations, modulo scheduling, etc.  Well-established mathematical tools and theory:graph algorithms, polyhedral manipulations, Hermite/Smith forms, integer linear programming, Ehrhart polynomials, etc. BUT • Memory optimizations: • optimization of local memory (intra-loop buffer); • optimization of inter-loop buffers forcommunicating NPAs.  No suitable mathematical tools so far.

Huge gap! Example: DCT-like code. First NPA: do br = 0, 63 do bc = 0, 63 do r = 0, 7 A(br, bc, r, …) = … enddo enddo enddo Second NPA: do br = 0, 63 do bc = 0, 63 do c = 0, 7 … = A(br, bc, …, c) enddo enddo enddo pipelined with Memory for A How to schedule the computations? How to allocate elements of A in local memory so as to reduce its size? a) Full array 256K elements. b) Optimized size = 112 elements (< 2 blocks). A(br, bc, r, c) mapped to (r mod 4, 16(br+bc) + 2r +c mod 28)

Outline • Introduction. • Lattice-based memory allocation: • Definition of modular allocations. • Conflicting indices and critical lattices. • Examples of limitations of previous work. • Main results and open questions.

Memory Reduction Problem for Arrays Given a scheduled program (i.e., operations are not reordered), or several communicating programs, find the minimal memory size to store intermediate values and an adequate memory mapping. • Lifetime analysis: • Schedule of computations  Lifetime for each value (similar to dependence analysis, exact or over-approximated). • Memory reuse: • Values simultaneously live should not share the same location (constraints similar to register allocation). • Restrict to “simple” addressing functions (for code generation): • canonical linearization, linear mapping in multi-dimensional arrays + wrapping with modulo operations (reuse). All are special cases of modular memory allocations.

Modular Mappings • Generalization of (rotating) registers in higher dimensions: Value indexed by i writes in multi-dimensional position Mi mod b, where b is a positive integral vector, and M an integral matrix. Ex: i=(i1,i2) stored at @ (2i1+i2 mod 3, i1+i2 mod 6)  b=(3,6), size = 18. • Given a schedule and a lifetime analysis, find a valid allocation (M,b) such that the product of the components of b (memory size) is minimized. • Generalizes all previous approaches: • De Greef, Catthoor, De Man (1996-1997): linearizations + 1 modulo • Lefebvre, Feautrier (1996-1997): successive modulos. • Wilde, Rajopadhye (1996), Quilleré, Rajopadhye (2000): projections. • Strout, Carter, Ferrante, Simon (ASPLOS’98): only 1 modulo. • Thies, Vivien, Sheldon, Amarasinghe (PLDI’01): same.

Our Main Contributions [Thies et al., PLDI’01]: There is a need for a technique able “to consider more general storage mappings” and that “would allow variations in the number of array dimensions, while still capturing the directional and modular reuse of the occupancy vector”. • We identify the fundamental object to work with: • The set S of all differences of conflicting indices. • We show the link with critical lattices: • Finding the best allocation Mi mod b among ALL possible modular allocation amounts to find the critical integer lattice for the set S. • We give guaranteed heuristics to approximate the optimal:  It explains previous work;  It gives new (and better) solutions;  It shows the link with theoretical work on successive minima, basis reduction, Minkowski’s theorems, etc.

Outline • Introduction. • Lattice-based memory allocation. • Examples of previous work limitations: • rely on particular linearizations, • or may wrap along the wrong axis. • Main results and open questions.

De Greef, Catthoor, and De Man • Were the first to identify the need for memory reduction techniques for embedded multimedia applications.  Patent (1996) for intra- and inter-array memory reuse. • Inter-array reuse: • Geometrical heuristics for packing different arrays in a given memory buffer.  will not be discussed here. • Intra-array memory reuse: • Consider each original d-dimensional array and its 2dd! canonical linearizations. (Example in 2D for an NxM array, look at 8 linearizations: Mi+j, Mi-j, -Mi+j, -Mi-j, i+Nj, i-Nj, -i+Nj, -i-Nj). • Compute the maximal address difference D between two simultaneously live values. • Select the linearization with smallest distance D and wrap the array modulo (D+1).

j i De Greef, Catthoor, De Man: Example 1 do i = 1,N do j = 1,N a(i,j) = ... b(i,j) = a(i-1,j) enddo enddo do i = 1,N do j = 1,N a(Ni+j mod (N+1)) = ... b(i,j) = a(Ni+j+1 mod (N+1)) enddo enddo do i = 1,N do j = 1,N a(-i+j mod (N+1)) = ... b(i,j) = a(-i+j+1 mod (N+1)) enddo enddo Column-major order (Fortran-like): i+Nj, maximal distance = N(N-1)+1 Row-major order (C-like): Ni+j, maximal distance = N  Best canonical linearization: Ni+j mod (N+1).

j i De Greef, Catthoor, De Man: Example 2 How could we have missed this? do i = 1,N do j = 1,N a(i,j) = ... b(i,j) = a(i-1,j) enddo enddo do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) a(t-j,j) = ... b(i,j) = a(t-j-1,j) enddo enddo do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) a(t-j) = ... b(i,j) = a(t-j-1) enddo enddo Any canonical linearization leads to a distance Θ(N2)! But the allocation i mod N, or even i is just fine!

Lefebvre and Feautrier • Developed in the context of parallelizing compilers: • a) Eliminate spurious memory dependences thanks to single assignment form; b) Wrap memory back when possible. • Inter-array reuse: • Coloring heuristics on array names (as for register allocation). • Intra-array memory reuse: • Idea 1: forget about original arrays, focus on original loop indices. • Idea 2: wrapsuccessively in each dimension with modulos.  As a computational point of view, use classical techniques based on (rational) linear programming.

j i Lefebvre, Feautrier: Example 1 revisited do i = 1,N do j = 1,N a(i,j) = ... b(i,j) = a(i-1,j) enddo enddo do i = 1,N do j = 1,N a(i mod 2, j) = ... b(i,j) = a(i-1 mod 2, j) enddo enddo Along i, maximal distance = 1  i mod 2. Along j (for a fixed i), maximal distance = N-1  j mod N, i.e., j.  Selected allocation (i mod 2, j), with a memory size 2N (note: N+1 in previous solution).

j do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) a(t-j,j) = ... b(i,j) = a(t-j-1,j) enddo enddo i Lefebvre, Feautrier: Example 2 revisited do i = 1,N do j = 1,N a(i,j) = ... b(i,j) = a(i-1,j) enddo enddo do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) a(t-j) = ... b(i,j) = a(t-j-1) enddo enddo Along i, maximal distance = N-1  i mod N, i.e., i. Along j (for a fixed i), maximal distance = 0  no extra dimension.  Selected allocation i mod N, i.e., i. (Note: order N2 in previous solution)

j i Lefebvre, Feautrier: Example 3 do i = 1,N do j = 1,N a(i,j) = ... enddo enddo pipelined 1 clock cycle later with do i = 1,N do j = 1,N b(i,j) = a(i,j)+... enddo enddo • Along i, maximal distance = 1  i mod 2 • Along j (for a fixed i), maximal distance = 1  j mod 2. • Selected allocation (i mod 2, j mod 2) and size 4. OK.

j i Lefebvre, Feautrier: Example 3 (variant) do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) a(t-j,j) = ... enddo enddo pipelined 1 clock cycle later with do t = 2,2N /* t = i+j */ do j = max(1,t-N),min(N,t-1) b(t-,j) = a(t-j,j)+... enddo enddo • Along i, maximal distance = N-1  i mod N • Along j (for fixed i), max. dist = 0  j mod 1. • Corresponding memory size N! Same if starting with j. FAIL!

Outline • Introduction. • Lattice-based memory allocation. • Examples of previous work limitations. • Main results and open questions: • No way to explain quickly all details, even to experts in lattice theory and reduction theory... • See CASES’03 proceedings, research report (http://perso.ens-lyon.fr/alain.darte) or, IEEE TC journal version (to appear). • But I can try to: • Explain basic concepts of critical lattice and modular allocations. • Illustrate different mechanisms. • State results.

There was a Need for a Framework for Memory Reduction Based on Modular Allocations • Lower bounds: • Given a lifetime analysis, can we give a lower bound for the best achievable memory size? What is the best modular memory allocation? • Upper bounds: • Can we find mechanisms leading to allocations whose corresponding memory size is not arbitrarily bad compared to the lower bound (guaranteed heuristics)? • Robustness: • We need a framework that can possibly capture parameters, that does not depend on the basis in which the problem is described, etc.  Geometrical model. • Computability: • We need to make sure the mechanisms areconstructive and lead to heuristics (or algorithms) that can be implemented.

Set of Conflicting Index Differences • Index description: • Choose an index description for values that are going to share a given array (the allocation will be linear with respect to these indices). Typically, loop indices, array indices, etc. • Sef of conflicting index differences: • Build the set CS of pairs of conflicting (i.e., simultaneously live) indices (i,j), and the set DS of differences (i-j). We want: { (i,j)  CS, i  j }  { Mi mod b  Mj mod b}, or equivalently { d  DS, d  0 }  { Md mod b  0 }, or equivalently Md mod b = 0, d  DS  d = 0

Admissible and Critical Lattices • The kernel of (M,b): • The set Λ = {i | Mi mod b = 0} is a full-dimensional lattice. • (M,b) is valid iff Λ  DS {0}, i.e., Λ is an admissible lattice for DS. • Conversely: • If A is a basis for Λ, admissible integral lattice for DS, compute the Smith form A = Q1 S Q2 with Q1 and Q2 unimodular, S = diag(b). • The mapping (M,b) where M is the inverse of Q1 has the kernel Λ, thus is a valid allocation with memory size det(S) = det(Λ).  The modular allocation with smallest memory size corresponds to a critical integer lattice for DS, i.e., an admissible integer lattice for DS with smallest determinant.

Critical lattice: basis (4,3), (8,0)  Corresponding allocation (3i-4j mod 24). Modular Mappings: Toy Example Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Bounding Box: (i mod 9, j mod 6)  Size = 54 Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Successive modulos: (i mod 9, j mod 5)  Size = 45 Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Skewed Bounding Box: (i-j mod 8, j mod 6)  Size = 48 Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Skewed successive modulos: (i-j mod 8, j mod 4)  Size = 32 Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Better allocation: (i-j mod 7, j mod 4)  Size = 28 Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Modular Mappings: Toy Example Critical lattice: basis (4,3), (8,0)  Bestallocation (3i-4j mod 24). Corners: (-1,5), (1,-5), (8,1), (-8,-1)

Results for 0-Symmetric Convex Bodies • We work with a 0-symmetric polytope K such that DS ⊆K. (actually, we assume that the vector spaces generated by the points in K and the integer points in K are equal:  K is full-dimensional) • Lower bound in terms of volume: Vol(K)/2n • Optimal solution found by optimized enumeration + ILP. • Heuristics exist with memory size  cn Vol(K) where cn depends on the dimension n only. guaranteed heuristics. • One heuristic uses exactly Lefebvre-Feautrier mechanism but in a well-chosen basis. Always equivalent (i.e., with same memory size) to a particular linearization (= 1D mapping). • Another heuristic (Rogers’ principle) works even for arbitrary sets, but equivalent linearization not clear. • In practice: follow the schedule, when possible... Reference: Gruber and Lekkerkerker, Geometry of Numbers.

Remarks on critical lattices • a) Hard to find the critical lattice, starting from 3D, even for simple bodies. b) critical integer lattice  critical lattice for large bodies.  Hard to find the optimal, heuristics needed. • Lower bound in terms of volume: Δ(K)  Vol(K)/2n • If S-S ⊆K, then all elements in S are mapped to different locations:  Δ(K)  Card(S). • Minkowski’s first theorem: if Λ is a lattice and K is 0-symmetric with Vol(K)  2n det(Λ), then K contains a nonzero lattice point of Λ. • Gauge function: F(x) = inf{λ>0 | x in λK} is a distance function. • Successive minima: λi(K)= inf{λ  0 | dim(Vect(λK ⋂ Zn))  i}. • Minkowski’s second theorem: • (2n/n!)det(Λ)  λ1(K) … λn(K)  2ndet(Λ)

Looking for the optimal solution • Generate all possible lattices of a given determinant: • Avoid duplicates: each lattice is uniquely determined by its Hermite form (triangular matrix). (Remark: not clear we could do the same for non-equivalent mappings without reasoning with the corresponding lattices.) • Check that the lattice is admissible for K, either by ILP, or by enumeration if integer points in K can be enumerated. • For the DCT example: • in 4D, optimal = 112, there are 86.416.644 lattices to check, it takes roughly 2 days! • rewritten in 3D, optimal = 112, there are 941.901 lattices to check, it takes roughly 30 minutes.  Feasible only for small sets K and small dimensions.

Rogers’ heuristic adapted • Choose n positive integers ρi such that ρi is a multiple of ρ(i+1) and dim(Li)  i-1 where Li = Vect(K/ρi ⋂ Zn). • Choose a basis (a1, …, an) of Zn s. t. Li Vect(a1, … , ai-1). • Define Λ the lattice generated by the vectors ρi ai.  det(Λ)  n! Vol(K)

Heuristic based on K (i.e., lattice) • Choose n linearly independent integer vectors (a1, …, an) • Compute Fi(ai) = inf{ F(y) | y in ai + Vect(a1, … , ai-1)}. • Choose n integers ρi such that ρi Fi(ai) > 1. • Define Λ the lattice generated by the vectors ρi ai.  det(Λ)  (n!)2 Vol(K) if Fi(ai)  1 for all i

Heuristic based on K* (i.e., mapping) • K* = dual (or polar reciprocal) of K = {y | y.x  1 for all x in K} • K** = K, F* related to F, Vol(K) related to Vol(K*), successive minima related, etc. • Choose n linearly independent integer vectors (c1, …, cn) • Compute F*i(ci) = sup{ci.x | x in K, c1.x = … = ci-1.x = 0} • Choose n integers ρi such that ρi >F*i(ci). • Define the mapping (M,b) with the ci as rows of M and b=ρ.  det(Λ)  (n!)2 Vol(K) if Fi(ci)  1 for all I  Dual of the previous heuristic. Exactly Lefebvre-Feautrier in a well-chosen basis.

Important practical factors The set DS can be skewed for 3 reasons: • Skewed iteration sub-domain with respect to full domain. • Skewed schedule with respect to iteration domain. • Skewed access function when reasoning with array indices.  In practice, “following” the schedule -- if it is expressed as a basis -- is not too bad.  But, ad-hoc counter-examples can be built. And schedule basis may be hidden in a “linearized” schedule.

Open or On-Going Questions • How much do we loose if we restrict to 1D mappings? • How much do we loose, when restricting to modular mappings, compared to MAXLIVE? • Mixing both Lefebvre-Feautrier (successive modulos) and Quilleré-Rajopadhye (choice of basis) is often ok (i.e., follow the schedule and wrap…). Can we quickly identify when? • How costly and how good are the heuristics in practice? • How to handle more general cases (union of polyhedra for conflicting differences, multiple arrays, etc.). • Can this be used as a basis for solving thegeneral problem (i.e., find the schedule with minimal memory requirements)? • Fully implemented in Cl@K: parameters still in progress…

Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France

Alain Darte Compsys Project : Compilation and Embedded Systems CNRS, LIP, ENS-Lyon , France

Presentation Transcript

Peter van Lith

525.415 : Embedded Microprocessor Systems

UBC104 Embedded Systems

Predictable Integration of Safety-Critical Software on COTS- based Embedded Systems

Fundamentals of Embedded Operating Systems

Precision Timed Embedded Systems Using TickPAD Memory

Chapter 2 Embedded Software Design and Development

Juan Rodríguez-Carvajal Laboratoire L é on Brillouin (CEA-CNRS), Saclay, France

Embedded System Overview

Catholic University, PUCRS (Brazil) Faculty of Informatics and Faculty of Engineering

Modeling Embedded Systems

Software for embedded multiprocessors

Embedded Systems

Storage Systems

Hardware/Software Codesign of Embedded Systems

Modeling Embedded Systems

Compilation of Government Accounts