Hierarchically Tiled Arrays (HTAs)

Hierarchically Tiled Arrays (HTAs) Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski

Overview • History and Purposes • Tiles, Locality and Parallelism • Structure, Creation of and Accessing HTAs • Operations on HTAs • Communication and Global • HTA Implementations • HTA vs MPI • Conclusion • Resources

History Developed in 2004 by: Jia Guo, Ganesh Bikshandi, María J. Garzarán and David Padua Dept. of Computer Science U. of Illinois at Urbana-Champaign Basilio B. Fraguela Dept. de Electrónica e Sistemas Universidade da Coruña, Spain Gheorghe Almási and José Moreira IBM Thomas J. Watson, Research Center Yorktown Heights, NY, USA

Purpose • Hierarchically Tiled Array (HTA) is a Object Oriented programming library • The purpose of the library was to improve the programmability of distributed memory environments. This allows for improving performance by enhancing locality and parallelism. This was done through creating a new data type, HTA, allowing for easier manipulation of tiles.

What are Tiles, Locality and Parallelism? Tile • A tile is a block of information. • It is used in scientific computing • An example of a tile would be a matrix. Locality • When the same value or location is frequently accessed. • It is a predictable behavior that occurs in computers and is a good candidate for performance optimization

What are Tiles, Locality and Parallelism? (cont.) Parallelism • Computation where many calculations are carried out simultaneously. • Based on the principle of taking a large problem and dividing it into smaller ones and solving them at the same time.

Structure of an HTA • HTAs are arrays partitioned into tiles. • The tiles can be arrays or other HTAs. • Allows for easier access to a specific location in an array • By distributing the tiles across processors, parallelism is carried out. By arranging the tiles in a certain order, locality can be utilized.

Structure of an HTA (cont.)

Creating an HTA Using Existing Array • Matrix and delimiters • See picture New Empty HTA • F = hta(3,3) • Must be assigned data to complete

Accessing the HTAs Contents Notation - { } used to index tiles - ( ) used to access elements within HTA or its tiles Accessing Tiles - C{2,1} refers to the lower left tile Accessing Elements Directly - C(5,4) refers directly to a specific element at 5,4 Accessing Elements Relatively - C{2,1}(1,4) refers to lower left tile, element at 1,4 - C{2,1}{1,2}(1,2) refers to lower left tile, upper right tile of {2,1}, element at 1,2

Accessing the HTAs Contents (cont.) Regional Access (Flattening) - Ignores tiling and returns array - C(1:2,3:6) returns a 2x4 matrix Logical Indexing/Selection - Matrix of boolean values with same dimensions as HTA

Communication Operations • Communication is represented as assignments on distributed HTAs. • V{2:3,:}(1,:) = V{1:2,:}(5,:) • Can also be represented by overloaded HTA library methods. • Permute Operations • permute(h,[x,y]) • dpermute{h,[x,y]} • Operates on a 3D array. Some of the overloaded array operations provided by the HTA library. • HTAs execute these operations at the tile level. • When circular shift is called, whole tiles are shifted instead of individual array elements.

Matrix Matrix Multiplication • Cannon's Algorithm for matrix-matrix multiplication shows how circular shift can be used to distribute work. • Normal Implementation • Shift rows and columns • Perform multiplication by element • HTA Implementation • Shift entire tiles • Perform multiplication as matrix multiplication of tiles where each processor or unit owns a tile. • HTA Implementation increases locality due to single matrix multiplication. • Can increased even further if more levels of tiling are used.

Global Computations Passing an HTA to a function/operation • Operates in parallel on a set of tiles from an HTA distributed across a parallel machine • parHTA(@func, H) where func is a function pointer Reduction reduce(+, [5, 1, 3, 8]) = 17 • An operation applied to all the specified regions of a n dimensional vector to produce a scalar, producing a n-1 dimensional array. • If no dimension is given, the output contains only one scalar in each tile, corresponding the associated input HTA tile in every dimension. • reduceHTA(op, dim, recurLevel, replicFlag) • op = any associative and commutative operation • dim = dimension of the reduction • recurLevel = termination level of recursion • replicFlag = replication flag

Matrix-Vector Product • The simplest global computation is achieved by operating in parallel on a set of distributed tiles from an HTA. • Matrix-vector multiplication is one example of utilizing HTA global computation. • A is an HTA containing the matrix MX • Distributed across m n processors. • B is a two-dimensional HTA obtained by replicating the HTA V which contains the vector VX to multiply. • The HTA V is replicated m times as specified by the operator repmat(V,m,1). • Before multiplication, the row-vector B is transposed to a column. • The matrix-vector product A * B takes place locally and each processor multiplies its portion of the matrix A by its portion of the vector in B.

HTA Implementations • HTAs can be added to almost any object-based or object oriented language. • Most research was done on MATLAB and C++

MATLAB Implementation Pros • Overall MATLABS syntax lends itself to HTAs • MATLAB provides a rich set of scientiﬁc operations which can be easily incorporated in the HTA toolbox. Cons • There is an immense overhead when MATLAB is interpreted. • MATLAB creates temporary variables to hold the partial results of an expression. Greatly slows the program. • MATLAB passes parameters by value and copies of the data are created from assignment statements.

C++ Implementation: htalib Why C++ over MATLAB? • Allocation/Deallocation improves performance: • HTAs are allocated onto the heap. Return a handle • Typically small in size • Once all handles are removed, HTAs are deleted. • Inline functioning • Compiler will replace functions with their full body of instructions. • Used for Tile access

HTA compared to MPI • HTAs can be naturally implemented into many different languages. • MPI's unstructured manner, can potentially lead to programs that are difficult to understand and maintain • Follows a single threaded programming approach. • eases the programmer from sequential to parallel programming • HTAs are partitioned using the single HTA constructor. • MPI has to make a lot more computations • The lines of code for communication is significantly lower in HTA.

Conclusion • Data Tiling is an effective mechanism for improving performance for both locality and parallelism. • HTA as a library gives the programmer more control. • HTAs facilitate algorithms that use multiple independent CPUs.

Resources Bikshandi, Ganesh, et al. "Programming for parallelism and locality with hierarchically tiled arrays." Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006. Basilio B. Fraguela , Jia Guo , Ganesh Bikshandi , María J. Garzarán , Gheorghe Almási , José Moreira , David Padua, The Hierarchically Tiled Arrays programming approach, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas [doi>10.1145/1066650.1066657]

Hierarchically Tiled Arrays (HTAs) Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski

Hierarchically Tiled Arrays (HTAs)

Hierarchically Tiled Arrays (HTAs)

Presentation Transcript

Tiled Convolutional Neural Networks

Arrays: Higher Dimensional Arrays

3D Tiled Display Walls

Hierarchically Tiled Arrays

Heuristic Evaluation

All Tiled Up

Building Partnerships: U.S. Community Foundations and Mexican Hometown Associations

INAHTA Working Group 3

Hierarchically Tiled Arrays

TILED DATA RECONSTRUCTION AND CORRECTION

Astronomical Tiled Image Compression

Hierarchically Structured Optical Materials

Arrays of Arrays (Multidimensional Arrays)

TILED DATA RECONSTRUCTION AND CORRECTION

Hierarchically nested factor models

Synthesis and characterization of hierarchically structured aluminosilicates

Arrays of Arrays

Pointers, Arrays, Multidimensional Arrays

Arrays of Arrays: