1 / 63

New Abstractions For Data Parallel Programming

New Abstractions For Data Parallel Programming. James C. Brodman Department of Computer Science brodman2@illinois.edu In collaboration with: George Almási , Basilio Fraguela , María Garzarán , David Padua. Outline. Introduction Hierarchically Tiled Arrays

tamma
Download Presentation

New Abstractions For Data Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Abstractions For Data Parallel Programming James C. Brodman Department of Computer Science brodman2@illinois.edu In collaboration with: George Almási, BasilioFraguela, MaríaGarzarán, David Padua

  2. Outline • Introduction • Hierarchically Tiled Arrays • Additional Abstractions for Data Parallel Programming • Conclusions

  3. 1. Introduction

  4. Going Beyond Arrays • Parallel programming has been well studied for numerical programs • Shared/Distributed Memory APIs, Array languages • Hierarchically Tiled Arrays (HTAs) • However, many important problems today are non numerical • Examine non-numerical programs and find new abstractions • Data Structures • Parallel Primitives

  5. Array Languages • Many numerical programs were written using Array languages • Popular among scientists and engineers. • Fortran 90 and successors • MATLAB • Parallelism not the reason for this notation.

  6. Array Languages • Convenient notation for linear algebra and other algorithms • More compact • Higher level of abstraction do i=1,n do i=1,n do j=1,n do j=1,n C(i,j)= A(i,j)+B(i,j) S = S + A(i,j) end do end do end do end do C = A + B S += sum(A)

  7. Data Parallel Programming • Array languages seem a natural fit for parallelism • Parallel programming with aggregate-based or loop-based languages is data centric, or, data parallel • Phrased in terms of performing the same operation on multiple pieces of data • Contrast with task parallelism where parallel tasks may perform completely different operations • Many reasons to prefer data parallel programming over task parallel approaches

  8. Data Parallel Advantages • Data parallel programming is scalable • Scales with increasing number of processors by increasing the size of the data • Data parallel programs based on array operations resemble conventional, serial programs. • Parallelism is encapsulated. • Parallelism is structured • Portable • Can run on any class of machine for which the appropriate operators are implemented • Shared/Distributed Memory, Vector Intrinsics, GPUs Operations implemented as parallel loops in shared memory Operations implemented as messages if distributed memory Operations implemented with vector intrinsics for SIMD

  9. Data Parallel Advantages • Data parallel programming can both: • Enforce determinacy • Encapsulate non-determinacy • Data parallel programming facilitates autotuning

  10. 2. Hierarchically Tiled Arrays

  11. Numerical Programs and Tiling • Blocking/Tiling is important: • Data Distribution • Locality • Parallelism • Who is responsible for tiling: • The Compiler? • The Programmer?

  12. Tiling and Compilers (Matrix Multiplication) Intel MKL Clearly, the Compiler isn’t doing a good job at tiling MFLOPS 20x icc -O3 -xT icc -O3 Matrix Size

  13. Tiling and Array Languages • Another option is to leave it up to the programmer • What does the code look like? • Notation can get complicated • Additional Dimensions • Arrays of Arrays • Operators not built to handle tiling

  14. Hierarchically Tiled Arrays • The complexity of the tiling problem directly motivates the Hierarchically Tiled Array (HTA) • Makes tiles first class objects • Referenced explicitly • Extended array operations to operate with tiles

  15. Hierarchically Tiled Arrays Distributed Multicore Locality

  16. Higher Level Operations • Many operators part of the library • Map, reduce, circular shift, replicate, transpose, etc • Programmers can create new complex parallel operators through the primitives hmap (and MapReduce) • Applies user defined operators to each tile of the HTA • And corresponding tiles if multiple HTAs are involved • Operator applied in parallel across tiles

  17. User Defined Operations hmap( F(), X, Y) X F() Y

  18. HTA Examples • We can handle many basic types of computations using the HTA library • Cannon’s Matrix Multiplication • Sparse Matrix/Vector Multiplication • We also support more complicated computations • Recursive parallelism • Dynamic partitioning

  19. Cannon's Matrix Multiplication A00 A01 A02 B00 B01 B02 A10 A11 A12 initial skew B10 B11 B12 A20 A21 A22 B20 B21 B22 A00 B00 A01 B11 A02 B22 A00 B00 A01 B11 A02 B22 A12 B21 A12 B21 A11 B10 A10 B02 A11 B10 A10 B02 shift-multiply-add A22 B20 A20 B01 A21 B12 A22 B20 A20 B01 A21 B12

  20. Cannon's Matrix Multiplication HTA A, B, C do i = 1:m // initial skew A(i,:) = circ_shift( A(i,:), [ 0, -(i-1)] ) // shift rows left B(:,i) = circ_shift( B(:,i), [ -(i-1), 0 ] ) // shift rows up do i = 1:n // main loop C = C + A * B // matrix add. and mult. A = circ_shift( A, [ 0 -1 ] ) B = circ_shift( B, [ -1 0 ] )

  21. Sparse Matrix/Vector Multiplication * =

  22. Sparse Matrix/Vector Multiplication Transpose Replicate Reduce(+) .* =

  23. Sparse Matrix/Vector Multiplication Sparse_HTA A HTA In, Res Res = transpose( In ) Res = replicate( Res, [3 1] ) // replicate Res = map( *, A, Res) // element-by-element mult. Res = reduce( +, Res, [0 1] ) // row reduction

  24. User Defined Operations - Merge Merge(HTA input1, HTA input2, HTA output ) { … if (output.size() < THRESHOLD) SerialMerge( input1, input2, output ) else { i = input1.size() / 2 input1.addPartition( i ) j = h2.location_first_gt( input1[i] ) input2.addPartition(j) k = i + j output.addPartition(k) hmap( Merge(), input1, input2, output ) } … } input1 input2 input1 Dynamic Partitioning input2 Merge Merge

  25. Advantages of tiling as a first class object for optimization • HTAs have been implemented as C++ and MATLAB libraries. • For shared and distributed memory machines • A GPU version is planned • Implemented several benchmark suites. • Performance is competitive with OpenMP, MPI, and TBB counterparts • Furthermore, the HTA notation produces code more readable than other notations. It significantly reduces number of lines of code.

  26. Advantages of tiling as a first class object Lines of code EP CG MG FT LU Lines of Code. HTA vs. MPI

  27. Performance Results With basic compiler optimizations, can match Fortran/MPI MG FT IS CG

  28. 3. Additional Abstractions for Data Parallel Programming

  29. Extending Data Parallel Programming • Many of today’s programs are amenable to data parallelism but not with today’s abstractions • Need to identify new primitives to extend data parallelism to these types of programs • Non numerical • Non deterministic • Traditionally task parallel

  30. 3.1. Non Numerical Computations

  31. New Data Structures for Non Numerical Computations • Operations on aggregates do not have to be confined to arrays • Trees • Graphs • Sets

  32. Sets • Sets are a possible aggregate to consider for data parallelism • Have been examined before (The Connection Machine – Hillis) • What primitives do we need? • Map – apply some function to every element of a set • Reduce – apply reductions across a set or multiple sets (Union, Intersection, etc) • MapReduce • Scan – perform a prefix operation on sets

  33. What problem domains can be solved in parallel using set operations • We have studied several areas including • Search • Datamining • Mesh Refinement • In all cases, it was possible to obtain a highly parallel and readable version using set operations

  34. Example – Search – 15 Puzzle • 4x4 grid of tiles with a “hole” • Slide tiles to go from a start state to the Goal • States (puzzle configurations) and transitions (moves) form a graph • Solve using a Best-First Search

  35. Example - Search

  36. Parallel Search Algorithms • Best-First search uses a heuristic to guide the search to examine “good” nodes first • If the search space is very large, prefer nodes that are closer to a solution over nodes less likely to quickly reach the goal • Ex. The 15 puzzle search space size is ~16! • For the puzzle, the heuristic function takes a state and gives it a score • Better scores are likely to lead to solutions more quickly • Metric is sum of: • Steps so far • Sum of distances of each tile from its final position

  37. Parallel Search Algorithms Expand expand W W select select

  38. Parallel Search Algorithms Search( initial_state ) work_list.add( initial_state ) while ( work_list not empty ) n = SELECT( work_list ) If ( n contains GOAL ) break work_list = work_list – n successors = expand( n ) update( work_list, successors ) • The implementation of SELECT determines the type of search: • ALL  Breadth-First • DEEPEST  Depth-First • BEST (Heuristic)  Best-First Code looks sequential Operators can be parallel

  39. Parallel Search Algorithms • One way to efficiently implement the parallel operators is to used tiled setsand use a map primitive (as before we used tiled arrays and HTA’s hmap) • Want to tile for same reasons as before: • Data distribution • Locality • Parallelism

  40. Mapping and Tiled Sets • Cannot create a tiled set as easily as a tiled array • Specifying tiling is trivial for Arrays • A Tiled Set requires two parameters: • The number of tiles • A mapping function that takes a piece of data from the set and specifies a destination tile number 3 2 4 Tile 1 Set

  41. Locality vs Load Balance • Choosing a “good” mapping function is important as it affects: • Load Balance • Locality • Load imbalance can occur if data is not evenly mapped to tiles • One possible solution is Overdecomposition • Compromise between extra overhead and better load balance • Specify more tiles than processors and have a “smart” runtime (i.e. Cilk’s and Intel TBB’s Task Stealing, CHARM++)

  42. Tiled Sets and Locality • The mapping function affects locality • Ideally, all the red nodes would end up in the original tile • Shared Memory – new nodes in cache for the next iteration • Distributed Memory – minimizes communication for mapping new nodes • However, this is not always the case expand select

  43. Tiled Sets and Locality expand MapReduce Mapping Function Set Union

  44. 15 Puzzle Performance

  45. Non Numerical Computations • Many non numerical computations amenable to data parallelism when it is properly extended • Search, etc • Tiling can benefit Sets just as it does Arrays when properly extended • Mapping function explicit • “Quality” of mapping important

  46. 3.2. Non Deterministic Computations

  47. Non Deterministic Computations • Many non deterministic problems could be amenable to data parallelism with the proper extensions • Need new primitives that can either: • Enforce determinacy • Encapsulate the non determinacy • Two examples • Vector operations with indirect indices • Delaunay Mesh Refinement

  48. Vector Operations with Indirect Indices • Consider A( X(i) ) += V(i) : • Fully parallel if X does not contain duplicate values • Potential races if duplicates exist • One possible way to parallelize is to annotate that all updates to A must be atomic

  49. Vector Operations with Indirect Indices • A( X(i) ) += V(i) : • Let A represent the balances of accounts • Let the values of X represent the indices of specific accounts in A • Let V be a series of transactions sorted chronologically • If the bank imposes penalties for negative balances, the transactions associated with an individual account cannot be reordered • Can be successfully parallelized if the programmer can specify that updates are not commutative • Allow the parallel update of different accounts, but serialize updates to the same account • Inspector/Executor

  50. Delaunay Mesh Refinement • Given a mesh of triangles, want to refine the mesh such that all the triangles meet certain properties • The circumcircle of any triangle does not contain points of any other triangle • The minimum degree of any triangle is at least a certain size • Can be written as a sequence of data parallel operators • Given a set of triangles, find those that are “bad” • For each bad triangle, calculate the affected neighboring triangles, or cavity • For each set of triangles, remove the bad triangle and its neighbors and replace them with new triangles • Might create new bad triangles • Repeat until the mesh contains no bad triangles

More Related