1 / 34

Hierarchically Tiled Arrays

Hierarchically Tiled Arrays. Presented by, Kenneth Detweiler. Overview. C++ Implementation Logical Index Space HTA Class Machine Mapping Operator Framework C++ Optimizations Automatic Memory Management Template Class Specialized Methods Inlined Hot Methods Lazy Evaluation

lauren
Download Presentation

Hierarchically Tiled Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchically Tiled Arrays Presented by, Kenneth Detweiler

  2. Overview • C++ Implementation • Logical Index Space • HTA Class • Machine Mapping • Operator Framework • C++ Optimizations • Automatic Memory Management • Template Class • Specialized Methods • Inlined Hot Methods • Lazy Evaluation • Relaxation of serial evaluation • Current State of HTA • My Thoughts • Design and Creation • Background • Parallel Operators • Tiling • Locality • HTA • Illustrated • Creation • Data Layout • Accessing • Implementation • Sparse Matrix • Cannon Algorithm • Estimate PI using Monte Carlo method • MATLAB • Internal Structure • What it does

  3. Design and Creation • Developed in 2006 by: • University of Coruna • University of Illinois • IBM • IBM: Yorktown Heights, NY • UC: Coruna, Spain • UIC: Chicago, Illinois

  4. Background • Data type based on higher level data parallel operators • Hierarchical Tiling • HTA creates tiles a part of programming language • Tiling helps control locality and Data Distribution • Referenced explicitly by the compiler • Operators extend to function on tiles • C++ Library Implementation • Distributed Memory using MPI • Shared Memory using TBI

  5. Parallel Operators • In order to give benefits of supercomputer power the developer resorts to low level parallel constructs. • This is both time consuming and an easily error prone process • The solution to this Parallel Operators, which exists simultaneously on all processors involved in distributed contribution. • Acts as a single entity capable of processing shared data in parallel.

  6. Hierarchical Tiling • Arrays that are partitioned into tiles • Exploits parallelism and locality of all levels of memory hierarchy. • Can be represented iteratively or recursively. • Implemented during program design • Cut costs in parallel programs by organizing computations between tiles

  7. Locality • Known as the locality of reference or principle of locality • Describes the same value or related storage locations being frequently accessed. • Two major types: • Temporal: • If a particular memory location is referenced then it is likely that same location will be referenced again in the future. • Spatial: • If a particular memory location is referenced then it is likely that the nearby memory locations will also be referenced

  8. HTA, Illustrated

  9. HTA, Creation • Need: • Source Array • Series of Delimiters • Function: • hta(MATRIX{[delim1],[delim2],…}, [processor mesh]) • Example: • h = hta(a,{[1,4].,[1,3]},[2,2]);

  10. HTA, Data Layout • Inner most tiles are the leaf tiles. • Leaf tiles contain: • ROW • COLUMN • TITLE: stores the elements of tile in continuous memory locations • Memory mapping of HTA is determined by: • How tiles are allocated • Memory layout of tiles

  11. HTA, Accessing • Tiles: • Calling C{2,2} is the lower right quadrant • Elements: • Calling C(2,2)

  12. Parallel Programming using HTASparse Matrix a = hta(MX,{dist}, [P 1]); b = hta(P, 1, [P 1]); b{:} = V; r = a * b; Communication between client and each server Multiples a sparse matrix mx by a dense vector V using P processor. Distribute MX in chunks of rows into a HTA by calling the constructor P servers handling the HTA are organized into a single column Dist argument is used to distribute the array MX such that it results in a uniform computation across the servers

  13. Parallel Programming using HTACannon Algorithm for i = 1:n c = c + a * b; a = circshift(a, [0, -1]); b = circshift(b, [-1, 0]); end Requires communication between client and servers, but also communication between the servers Each iteration of the loop each server executes a matrix multiplication of tile a and b that reside on the server Result of multiplication is stored in a local HTA c Tiles a are shifted along the first dimension, and tiles b are shifted along second dimension. Thus tiles a are sent to the left processor in mesh and tiles b to the right processor in mesh. The left processor sends its tiles of a to the right most processor in its row and the bottom most processor transfers its tiles of b to the top most processor in its column The end result is that HTA c = a*b

  14. Parallel Programming using HTAEstimate pi using Monte Carlo method input = hta(P, 1, [P 1]); input{:} = eP; output = parHTAFunc(@estimatePi, input); myPi = mean(output(:)); function r = estimatePi(n) x = randx(1, n); y = randx(1, n); pos = x .* x + y .* y; r = sum(pos < 1) / n * 4; A distributed HTA input with one tile per processor is built Tiles are filled with eP, the number of experiments to run per processor Experiments are made on each processor by the function estimatePi() The result of the parallel execution of the function is a distributed HTA output that has the same mapping as input and keeps a single tile per processor with the local estimation of pi.

  15. Implementation • MATLAB (matrix laboratory) • A numerical computing environment and fourth generation programming language • Started as a wrapper on Fortran libraries

  16. MATLAB – Internal Structure • MATLAB is used as a client where code is executed • On the server where its used as a computational engine for the distributed operations on the HTA’s.

  17. MATLAB – Internal Structure (cont) • All communications are done through the MPI • Lower layers of the HTA toolbox take care of communications requirements. • Higher layers implement the syntax expected by the MATLAB users • HTA programs have a single thread that is interpreted and executed by the MATLAB client • HTA’s are just objects within the environment of the interpreter

  18. MATLAB – Internal Structure (cont) • When HTA is local it is not distributed on the array servers • Client HTA keeps both the structure and the content of the HTA • When HTA is distributed • Client holds the structure of the HTA at all its levels • Keeps all information of the mapping of tiles on the top level of the mesh servers • Client is always able to regardless of local or distributed to: • Test the legality of the operation • Calculate the structure and mapping of the output HTA • Send messages that encode the command and its arguments to the servers

  19. MATLAB • Is a linear algebra language with a large base of users who write scientific code • HTA allows users to harness powers of cluster of work stations instead of a single machine • Is polymorphic • Allowing HTA s to substitute regular arrays almost without changing the rest of the code, thereby adding parallelism painlessly. • Is designed to be extensible • Third party developers can provide so call toolboxes of functions for specialized purposes. • Provides a native method called Mex • Allows functions to implemented in languages like C or Python

  20. C++ Implementations • HTAs can be implemented in C++ by adding the htalib library • HTAs are represented as composite objects with methods to operate on both distributed and sequential HTAs • Two communication layers are available MPI and UPC • Implementation follows SPMD execution model while the programming model is still single-threaded • Core Data Structures of htalib • Logical index space • HTA Class • Machine Mapping • Operator Framework

  21. C++ ImplementationsLogical Index Space • Classes used to define index space and tiling of an HTA • Tuple<N> //an n dimensional index value • Triplet //a 1D range with optional stride • (low:high:step) • Region<N> //N-dimensional rectangular index space spanned by N triplets

  22. C++ ImplementationsHTA Class • Defines an HTA with scalar elements of type T and N dimensions. • Data type implementation scalar access • (operator[], tile access(operator()) • Built in array operations • transpose • permute • dpermute • reduce

  23. C++ ImplementationMachine Mapping • Specifies where the HTA is allocated in a distributed system • Memory layout of the scalar data array • Captured by instances of class distribution that specifies home location of the scalar data for each of the tiles of an HTA • Memory Mapping • Specify the layout • (row-major across tiles, row major per tile) • Size and Stride of the flat array data underlining the HTA

  24. C++ ImplementationOperator Framework • Htalib provides a powerful operator framework following the design of the STL operator class • Consists of routines that evaluate specific operators on HTAs and base classes • Serves as a foundation for user-defined operator

  25. C++ Optimizations • Represented as composite objects • Methods available for both distributed and local HTA’s • Two implementations MPI and UPC • Performance Optimizations • Automatic Memory Management • Template Class • Specialized Methods • Inlined Hot Methods • Lazy evaluation • Relaxation of serial evaluation semantics

  26. C++ OptimizationsAutomatic Memory Management • HTA’s are allocated through factory methods on the heap for automatic memory management • Methods return handle which is assigned to a stack allocated variable • All access occurs through this handle • Once all handles to an HTA disappear from the stack the HTA and its related structures are automatically deleted from memory

  27. C++ OptimizationsTemplate Class • Used in htalib to handle data with different types and dimensions • Provides flexibility and opportunities for optimizations at compile time • Data type of HTA can be any type or user defined type

  28. C++ ImplementationSpecialized Methods • Methods are optimized and whenever possible specialized for specific cases • IE: A specialized method that avoids multiplication by stride is invoked when the data being accessed is known to be stored consecutive locations.

  29. C++ ImplementationInlined Hot Methods / Lazy Evaluation • Inlined Hot Methods • Inlining is performed to methods that are used frequently • The tile access functions and scalar functions are carefully inlined to reduce the overhead of function calls. • Lazy Evaluation • In HTA Assignment when RHS has more than one variable htalib uses lazy evaluation to avoid or reduce the temporary variables generated from one or more binary operations • Then if the LHS and RHS have no data dependencies then the assignment is directly evaluated to the RHS

  30. C++ ExamplesMatrix Multiplication

  31. C++ ImplementationRelaxation of serial evaluation semantics • Htalib provides a mechanism to temporarily relax the serial evaluations ordering • Helps the overlapping of different communications and of communications with computations htalib::async(); B(1:n)[0] = B(0:n-1)[d]; B(0:n-1)[d+1] = B(1:n)[1]; htalib::sync(); • The above code shows the boundary exchange in the 1D Jacobi • There is no data dependence among the assignments, both statements can proceed concurrently • This is achieved through the runtime calls to async and sync

  32. C++ ExamplesMatrix Transposition

  33. State • Currently HTA library support is being extended to support • Sparse Data Partitioning • Hierarchical Place Trees • Continually optimized to increase performance • Searching for new ways to implement into new languages to increase productivity

  34. My Thoughts By increasing the productivity of parallel computing we can increase processor power across machines More powerful programming languages due to having powerful library tools HTA is something that I will keep in mind with my future programming projects

More Related