1 / 24

Mapping of Regular Nested Loop Programs to

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology. F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich University of Erlangen-Nuremberg, Germany.

keren
Download Presentation

Mapping of Regular Nested Loop Programs to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich University of Erlangen-Nuremberg, Germany Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04) Presented by: Luis Ortiz Department of Computer Science The University of Texas at San Antonio

  2. Outline • Overview • The Problem • Reconfigurable Architectures • Design Flow for Regular Mapping • Parallelizing Transformations • Constraints Related to CG Reconfigurable Arrays • Case Study • Results • Conclusions and Future Work

  3. Overview • Constructing a parallel program is equivalent to specifying its execution order • the operations of a program form a set, and its execution order is a binary, transitive and asymmetric relation • the relevant sets are (unions of) Z-polytopes • most of the optimizations may be presented as transformation of the original program • The problem of automatic parallelization • given a set of operations E and a strict total order on it • find a partial order on E such that execution of E under it is determinate and gives the same results as the original program

  4. Overview (cont.) • Defining a polyhedron • a set of linear inequalities: Ax + a ≥ 0 • the polyhedron is the set of all x which satisfies these inequalities • the basic property of a polyhedron is convexity: • if two points a and b belong to a polyhedron, then so all convex combinations • λa + (1 – λ)b, 0 ≤ λ ≤ 1 • a bounded polyhedron is called a polytope

  5. Overview (cont.) • The essence of the polytope model is to apply affine transformations to the iteration spaces of a program • the iteration domain of statement S: Dom(S) = {x | Dsx + ds≥ 0} • Ds and ds are the matrix and constant vector which define the iteration polytope. ds may depend linearly on the structure parameters

  6. Overview (cont.) • Coarse-grained reconfigurable architectures • provide flexibility of software combined with the performance of hardware • but, hardware complexity is a problem due to a lack of mapping tools • Parallelization techniques and compilers • map computationally intensive algorithms efficiently to coarse-grained reconfigurable arrays

  7. The Problem “Mapping a certain class of regular nested loop programs onto a dedicated processor array”

  8. Reconfigurable Architectures • Span a wide range of abstraction levels • from fine-grained Look-Up Table (LUT) based reconfigurable logic devices to distributed and hierarchical systems with heterogeneous reconfigurable components • Efficiency comparison • standard arithmetic is less efficient on fine-grained architectures • due to the large routing area overhead • Few research work which deals with the compilation to coarse-grained reconfigurable architecture

  9. Design Flow for Regular Mapping

  10. Design Flow for Regular Mapping (cont.) • A piecewise regular algorithm contains N quantified equations • each equation Si[I] is of the form • xi[I] are indexed variables • fi are arbitrary functions • dji ∈ ℤn are constant data dependence vectors, and denote similar arguments • Ii are called index spaces

  11. Design Flow for Regular Mapping (cont.) • Linearly bounded lattice • this set is affinely mapped onto iteration vectors I using an affine transformation • Block pipelining period • time interval between the initiations of two successive problem instances (β)

  12. Parallelizing Transformations • Based on the representation of equations and index spaces several combinations of parallelizing transformations in the polytope model can be applied • Affine Transformations • Localization • Operator Splitting • Exploration of Space-Time Mappings • Partitioning • Control Generation • HDL Generation & Synthesis

  13. Constraints Related to CG Reconfigurable Arrays • Coarse-grained (re)configurable architectures consist of an array of processor elements (PE) • array of processor elements (PE) • one or more dedicated functional units or • one or more arithmetic logic units (ALU) • memory • local memory → register files • memory banks • an instruction memory is required if the PE contains an instruction programmable ALU • interconnect structures • I/O ports • synchronization and reconfiguration mechanisms

  14. Case Study • Regular mapping methodology applied for a matrix multiplication algorithm • target architecture • PACT XPP64-A reconfigurable processor array • 64 ALU-PAEs of 24 bit data with in an 8x8 array • each ALU-PAE contains of three objects • the ALU-PAE • Back-Register-object (BREG) • Forward-Register-object (FREG) • all objects are connected to horizontal routing channels

  15. Case Study (cont.) • RAM-PAE are located in two columns at the left and the right border of the array, two ports for independent r/w operations • RAM can be configured to FIFO mode • each RAM-PAE has a 512x24 bit storage capacity • four independent I/O interfaces located in the corners of the array

  16. Case Study (cont.) Structure of the PACT XPP64-A reconfigurable processor ALU-PAE objects

  17. Case Study (cont.) • Matrix multiplication algorithm • C = A * B • A ∈ ZNxN • B ∈ ZNxN • computations may be represented by a dependence graph (DG) • dependence graphs can be represented in a reduced form • Reduced Dependence Graph: to each edge e = (vi, vj) there is associated a dependence vector dij∈ Zn • virtual Processor Elements (VPEs) are used to map the PE obtained from the design flow to the given architecture

  18. Case Study (cont.) Matrix multiplication algorithm, C-code Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization

  19. Case Study (cont.) DG of transformed matrix multiplication algorithm N = 2 4 x 4 processor array Reduced dependence graph

  20. Case Study (cont.) • Output data • Ox the output-variable space of variable x of the space-time mapped or partitioned index space • the output can be two-dimensional • the transformed output variables are distributed over the entire array • collect the data from one processor’s line PL and feed them out to an array border • m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x produces an output at processor element Pi,j

  21. Case Study (cont.) • if one of the following conditions holds, output data can be serialized

  22. Case Study (cont.) • Partitioned implementation of the matrix multiplication algorithm Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 example Dataflow graph after performing localization inside each file Array implementation of the partitioned example

  23. Results • Both implementations (full-size and partitioned) show optimal utilization of resources • Each configured MAC-unit performs one operation per cycle • It is observed that using fewer resources with better implementation more performance per cycle can be achieved • The number of ALUs is reduced from O(3N) to O(N) • Merging and writing of output data streams is overlapped with computations in PEs

  24. Conclusions and Future Work • The mapping methodology based on loop parallelization in the polytope model provides results that are efficient in terms of utilization of resources and execution time • Future work is focused on perform automatic compilation of nested loop programs

More Related