1 / 38

Dense Matrix Algorithms

Dense Matrix Algorithms. CS 524 – High-Performance Computing. Definitions. p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) t c = computation time for one flop t s = communication startup time

orson-case
Download Presentation

Dense Matrix Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dense Matrix Algorithms CS 524 – High-Performance Computing

  2. Definitions • p = number of processors (0 to p-1) • n = dimension of array/matrix (0 to n-1) • q = number of blocks along one dimension (0 to q-1) • tc = computation time for one flop • ts = communication startup time • tw = communication transfer time per word • Interconnection network: crossbar switch with bi-directional links CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  3. Uniform Striped Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  4. Checkerboard Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  5. Matrix Transpose (MT) • AT(i, j) = A(j, i) for all I and j • Sequential run-time do i = 0, n-1 do j = 0, n-1 B(i, j) = A(j, i) end do end do • Run time is (n2 – n)/2 or n2/2 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  6. MT - Checkerboard Partitioning (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  7. MT – Checkerboard Partitioning (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  8. MT – Striped Partitioning CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  9. Matrix-Vector Multiplication (MVM) • MVM: y = Ax do i = 0, n-1 do j = 0, n-1 y(i) = y(i) + A(i, j)*x(j) end do end do • Sequential algorithm requires n2 multiplications and additions • Assuming one flop takes tc time, sequential run time is 2tcn2 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  10. Row-wise Striping – p = n (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  11. Row-wise Striping – p = n (2) • Data partitioning: Pi has row i of A and element i of x • Communication: Each processor broadcasts its element of x • Computation: Each processor perform n additions and multiplications • Parallel run time: Tp = 2ntc + p(ts + tw) = 2ntc + n(ts + tw) • Algorithm is cost-optimal as both parallel and serial cost is O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  12. Row-wise Striping – p < n • Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x • Communication: Each processor broadcasts its elements of x • Computation: Each processor perform n2/p additions and multiplications • Parallel run time: Tp = 2tcn2/p+ p[ts + (n/p)tw] • Algorithm is cost-optimal for p = O(n) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  13. Checkerboard Partitioning – p = n2 (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  14. Checkerboard Partitioning – p = n2 (2) • Data partitioning: Each processor has one element of A; only processors in last column have one element of x • Communication • One element of x from last column to diagonal processor • Broadcast from diagonal processor to all processors in column • Global sum of y from all processors in row to last processor • Computation: one multiplication + addition • Parallel run time: Tp = 2tc + 3(ts + tw) • Algorithm is cost-optimal as serial and parallel cost is O(n2) • For bus network, communication time is 3n(ts + tw); system is not cost-optimal as cost is O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  15. Checkerboard Partitioning – p < n2 • Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x • Communication • n/√p elements of x from last column to diagonal processor • Broadcast from diagonal processor to all processors in column • Global sum of y from all processors in row to last processor • Computation: n2/p multiplications + additions • Parallel run time: Tp = 2tcn2/p+ 3(ts + tw n/√p) • Algorithm is cost-optimal only if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  16. Matrix-Matrix Multiplication (MMM) • C = A x B, n x n square matrices • Block matrix multiplication: algebraic operations on sub-matrices or blocks of matrices. This view of MMM aids parallelization. do i = 0, q-1 do j = 0, q-1 do k = 0, q-1 Ci,j = Ci,j + Ai,k x Bk,j end do end do end do • Number of multiplications + additions = n3 . Sequential run time = 2tcn3 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  17. Checkerboard Partitioning – q = √p • Data partitioning: Pi,j has Ai,j and Bi,j blocks of A and B of dimension n/√p x n/√p • Communication: Each processor broadcasts its submatrix Ai,j to all processors in row; each processor broadcasts its submatrix Bi,j to all processors in column • Computation: Each processor performs n*n/√p* n/√p = n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal only if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  18. Cannon’s Algorithm (1) • Memory-efficient version of the checkerboard partitioned block MMM • At any time, each processor has one block of A and B • Blocks are cycled after each computation in such a way that after √p computations the multiplication is done for Ci,j • Initial distribution of matrices is same as checkerboard partitioning • Communication • Initial: block Ai,j is moved left by i steps (with wraparound); block Bi,j is moved up by j steps (with wraparound) • Subsequent √p-1 : block Ai,j is moved left by one step; block Bi,j moved up by one step (both with wraparound) • After √p computation and communication steps the multiplication is complete for Ci,j CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  19. Cannon’s Algorithm (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  20. Cannon’s Algorithm (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  21. Cannon’s Algorithm (4) • Communication • √p point-to-point communications of size n2/p along rows • √p point-to-point communications of size n2/p along columns • Computation: over √p steps, each processors performs n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  22. Fox’s Algorithm (1) • Another memory-efficient version of the checkerboard partitioned block MMM • Initial distribution of matrices is same as checkerboard partitioning • At any time, each processor has one block of A and B • Steps (repeated √p times) • Broadcast Ai,i to all processors in the row • Multiply block of A received with resident block of B • Send the block of B up one step (with wraparound) • Select block Ai,(j+1)mod√p and broadcast to all processors in row. Go to 2. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  23. Fox’s Algorithm (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  24. Fox’s Algorithm (3) • Communication • √p broadcasts of size n2/p along rows • √p point-to-point communications of size n2/p along columns • Computation: Each processor performs n3/p multiplications + additions • Parallel run time: Tp = 2tcn3/p + 2√p[ts + (n2/p)tw] • Algorithm is cost-optimal if p = O(n2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  25. Solving a System of Linear Equations • System of linear equations, Ax = b • A is dense n x n matrix of coefficients • b is n x 1 vector of RHS values • x is n x 1 vector of unknowns • Solving x is usually done in two stages • First, Ax = b is reduced to Ux = y, where U is an unit upper triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination. • Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  26. Gaussian Elimination (1) do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end do end do CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  27. Gaussian Elimination (2) • Computations • Approximately n2/2 divisions • Approximately n3/3 – n2/2 multiplications + subtractions • Approx. sequential run time: Ts = 2tcn3/3 CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  28. Striped Partitioning – p = n (1) • Data partitioning: Each processor has one row of matrix A • Communication during k (outermost loop) • broadcast of active part of kth (size: n–k–1) row to processors k+1 to n-1 • Computation during iteration k (outermost loop) • n – k -1 divisions at processor Pk • n –k -1 multiplications + subtractions for processors Pi (k < i < n) • Parallel run time: Tp = (3/2)n(n-1)tc + nts + 0.5n(n-1)tw • Algorithm is not cost-optimal since serial and parallel costs are O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  29. Striped Partitioning – p = n (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  30. Striped Partitioning – p = n (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  31. Pipelined Version (Striped Partitioning) • In the non-pipelined or synchronous version, outer loop k is executed in order. • When Pk is performing the division step, all other processors are idle • When performing the elimination step, only processors k+1 to n-1 are active; rest are idle • In pipelined version, the division step, communication, and elimination step are overlapped. • Each processor: communicates, if it has data to communicate; computes, if it has computations to be done; or waits, if none of these can be done. • Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors. CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  32. Pipelined Version (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  33. Pipelined Version (3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  34. Striped Partitioning – p < n (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  35. Striped Partitioning – p < n (2) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  36. Checkerboard Partitioning – p = n2 (1) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  37. Checkerboard Partitioning – p = n2 (2) • Data partitioning: Pi,j has element A(i, j) of matrix A • Communication during iteration k (outermost loop) • Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row • Broadcast of modified A(i,k) along ith row for k ≤ i < n • Broadcast of modified A(k,j) along jth column for k ≤ j < n • Computation during iteration k (outermost loop) • One division at Pk,k • One multiplication + subtraction at processors Pi,j (k < i ,j< n) • Parallel run time: Tp = (3/2)n(n-1)tc + n[ts + 0.5(n-1)tw] • Algorithm is cost-optimal since serial and parallel costs are O(n3) CS 524 (Wi 2003/04)- Asim Karim @ LUMS

  38. Back-Substitution • Solution of Ux = y, where U is unit upper triangular matrix do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end do end do • Computation: approx. n2/2 multiplications + subtractions • Parallel algorithm is similar to that for the Gaussian elimination stage CS 524 (Wi 2003/04)- Asim Karim @ LUMS

More Related