1 / 29

Dense Linear Algebra (Data Distributions)

Dense Linear Algebra (Data Distributions). Sathish Vadhiyar. Gaussian Elimination - Review. Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n

donna-tran
Download Presentation

Dense Linear Algebra (Data Distributions)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dense Linear Algebra(Data Distributions) Sathish Vadhiyar

  2. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  3. Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  4. Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  5. Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  6. GE - Runtime • Divisions • Multiplications / subtractions • Total 1+ 2 + 3 + … (n-1) = n2/2 (approx.) 12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2 2n3/3

  7. Parallel GE • 1st step – 1-D block partitioning along blocks of n columns by p processors i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j

  8. 1D block partitioning - Steps 1. Divisions n2/2 2. Broadcast xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp 3. Multiplications and Subtractions (n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.) Runtime: < n2/2 +n2logp + n3/p

  9. 2-D block • To speedup the divisions P i 0 0 0 0 0 0 k 0 0 0 0 0 i Q i,i X X x j

  10. 2D block partitioning - Steps 1. Broadcast of (k,k) logQ 2. Divisions n2/Q (approx.) 3. Broadcast of multipliers xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP 4. Multiplications and subtractions n3/PQ (approx.)

  11. Problem with block partitioning for GE • Once a block is finished, the corresponding processor remains idle for the rest of the execution • Solution? -

  12. Onto cyclic • The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm. • Onto cyclic 0 1 2 3 0 1 2 3 0 1 2 3 0 1-D block-cyclic 2-D block-cyclic cyclic Load balance, block operations, but column factorization bottleneck Has everything Load balance

  13. Block cyclic • Having blocks in a processor can lead to block-based operations (block matrix multiply etc.) • Block based operations lead to high performance

  14. GE: MiscellaneousGE with Partial Pivoting • 1D block-column partitioning: which is better? Column or row pivoting • 2D block partitioning: Can restrict the pivot search to limited number of columns • Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) • The exchange information is passed to the other processes by piggybacking with the multiplier information • Row pivoting • Involves distributed search and exchange – O(n/P)+O(logP)

  15. Triangular Solve – Unit Upper triangular matrix • Sequential Complexity - • Complexity of parallel algorithm with 2D block partitioning (P0.5*P0.5) O(n2) O(n2)/P0.5 • Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) • Overall (GE+TS) – O(n3/P)

  16. Dense LU on GPUs

  17. LU for Hybrid Multicore + GPU Systems(Tomov et al., Parallel Computing, 2010) • Assume the CPU host has 8 cores • Assume NxN matrix; Divided into blocks of size NB; • Split such that the first N-7NB columns are on GPU memory • Last 7NB on the host

  18. Load Splitting for Hybrid LU

  19. Steps • Current panel is downloaded to CPU; the dark blue part in the figure • Panel factored on CPU and result sent to GPU to update trailing sub-matrix; the red part; • GPU updates the first NB columns of the trailing submatrix • Updated panel sent to CPU; Asynchronously factored on CPU while the GPU updates the rest of the trailing submatrix • The rest of the 7 host cores update the last 7 NB host columns

  20. Dense Cholesky on GPUs

  21. Cholesky Factorization • Symmetric positive definite matrix A • A = LLT • Similar to Gaussian elimination; succession of block column factorizations followed by updates of the trailing submatrix • Can be parallelized using fork-join model similar to parallel GE • Matrix-matrix multiplications in parallel (fork) and synchronization needed for next block column factorization (join)

  22. Heterogeneous Tile Algorithms (Song et al., ICS 2012) • Divide a matrix into a set of small and large tiles • Small tiles for CPU, and large tiles for GPUs • At the top level, divide the matrix into large square tiles of size B x B • Then subdivide each top level tile into a number of small rectangular tiles of size Bxb and a remaining tile

  23. Heterogeneous Tile Cholesky Factorization • Factor the first tile to solve L(1,1) • Apply L(1,1) to update its right A(1,2)

  24. Heterogeneous Tile Cholesky Factorization • Factor the two tiles below L(1,1) • Update all tiles on the right of the first tile column

  25. Heterogeneous Tile Cholesky Factorization • At the second iteration, repeat the above steps on the trailing submatrix starting from the second tile column

  26. Two Level Block-Cyclic Distribution • Divide a matrix A into p x (s.p) rectangular tiles discussed earlier • i.e., first divided into pxp large tiles, then partition each large tile into s tiles • On a hybrid CPU and GPU machine, allocate the tiles to the host and a number of P GPUs in a 1-D block-cyclic way • Those columns whose indices are multiples of s are mapped to the P GPUs in a cyclic way, and remaining columns go to all CPU cores in the host

  27. Two Level Block-Cyclic Distribution - Example

  28. Tuning Tile Size • Depends on relative performance of CPU and GPU cores • For example, to determine the tile width, Bh for CPU:

  29. References • E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, S. Tomov, and W. m. Hu. Faster, Cheaper, Better: a Hybridization Methodology to Develop Linear Algebra Software for GPUs. GPU Computing Gems, 2, 2010. • F. Song, S. Tomov, and J. Dongarra. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and Multi-GPU Systems. In In the Proceedings of IEEE/ACM Supercomputing Conference (SC), pages 365{376, 2012.

More Related