1 / 17

2013.11.25 20126128 이창원

Matrix-Vector Multiplication Ver.1 Rowwise Block-Stripped Decomposition. 2013.11.25 20126128 이창원. SEQUENTIAL ALGORITHM. Sequential matrix-vector multiplication algorithm Input : a[0..m-1, 0..n-1] – matrix with dimensions m * n b[0..n-1] – vector with dimensions n * 1

tricia
Download Presentation

2013.11.25 20126128 이창원

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix-Vector Multiplication Ver.1 Rowwise Block-Stripped Decomposition 2013.11.25 20126128 이창원

  2. SEQUENTIAL ALGORITHM • Sequential matrix-vector multiplication algorithm Input: a[0..m-1, 0..n-1] – matrix with dimensions m * n b[0..n-1] – vector with dimensions n * 1 output: c[0..m-1] – vector with dimensions m * 1 1: for i <= 0 to m-1 2: c[ i ] <= 0 3: for j <= 0 to n-1 4: c[ i ] <= c[ i ] + a[i, j] * b[ j ] 5: endfor 6: endfor • Matrix A size: m x n • Complexity of inner product: O(n) (n multiplication and n-1 additions) • Total complexity: O(mn) • When the matrix is square, the algorithm’s complexity is O(n2) b A c x =

  3. DATA DECOMPOSITION OPTIONS • There are three straightforward ways to decompose an matrix • In this presentation I’ll explain rowwise block-striped decomposition Rowwise decomposition Columnwise decomposition Checkerboard decomposition

  4. ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis 1. Data decomposition b b c Row i of A Row i of A Task i Task i Inner product computation All-gather communication b ci Row i of A Task i

  5. ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis 2. Agglomeration • Agglomerate primitive tasks associated with contiguous groups of rows and assign each of these combined tasks to a single process Row i P0 i+1 i+2 P1 Process j P2

  6. ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis • Computation (n by n matrix) • Sequential: O(n2) • Parallel: Each process multiplies its portion of matrix A by vector b. No process is responsible for more than ceil(n/p) rows. ∴ O(n2/p) • Communication (All-gather) • Each process sends ceil(log p) messages • (λ: To initiate a message) • ∴ O(log p + n) • Overall complexity • O(n2/p + log p + n)

  7. ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis • Scalability of parallel algorithm • The time complexity of the sequential algorithm is O(n2) • When n is reasonably large, message transmission time in the all-gather operation is greater than the message latency. ∴ communication complexity: O(n) • Isoefficiency function: T(n, 1) ≥ CT0(n, p) (T0(n, p) = (p-1) σ(n) + p k(n, p) • ∴ n2 ≥ Cpn => n ≥ Cp • When the problem size is n, the matrix has n2 elements. ∴ M(n) = n2 • Scalability: M(Cp)/p = C2p2/p = C2p • The algorithm is not highly scalable

  8. ROWWISE BLOCK-STRIPED DECOMPOSITION • Replicating a Block-Mapped Vector • After each process performs its portion of the matrix-vector product, it has produced a block of result vector c. We must transform this block-mapped vector into a replicated vector. • Each process needs to allocate memory to accommodate the entire vector c • The processes must concatenate their pieces of the vector into a complete vector P0 P0 All-gather P1 P1 P2 P2

  9. ROWWISE BLOCK-STRIPED DECOMPOSITION • Function MPI_Allgatherv intMPI_Allgatherv ( void* send_buffer, //the starting address of the data this process is contributing to the “all gather” int send_cnt, //the number of data items this process is contributing MPI_Datatype send_type, //the types of data items this process is contributing void* receive_buffer, //the address of the beginning of the buffer used to store the gathered elements int* receive_cnt, //an array indicating the number of data items to be received from each process int* receive_disp, //an array indicating for each process the first index in the receive buffer where that process’s items should be put MPI_Datatype receive_type, //the type of the received elements MPI_Comm communicator //the communicator in which this collective communication is occuring )

  10. ROWWISE BLOCK-STRIPED DECOMPOSITION • Function MPI_Allgatherv send_buffer receive_buffer send_cnt = 3 receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 0 Process 0 send_buffer receive_buffer send_cnt = 4 MPI_Allgatherv receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 1 Process 1 send_buffer receive_buffer send_cnt = 4 receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 2 Process 2

  11. ROWWISE BLOCK-STRIPED DECOMPOSITION • Replicated Vector Input/Output • Process p-1 tries to open the data file for reading • If it can open the file, it reads n and broadcasts it to the other process • Every process allocates memory to store the vector • Process p-1 reads the vector and broadcasts it to the other processes • Function for read: read_replicated_vector( ) • Function for print: print_replicated_vector( ) • Ensure that only a signle processexecutes the calls to printf( ) n - 8 13 ... 1 number of elements in the vector n vector elements

  12. ROWWISE BLOCK-STRIPED DECOMPOSITION • Documenting the Parallel Program typedef double dtype; #define mpitype MPI_DOUBLE int main(int argc, char *argv[ ]){ dtype **a; /* First factor, a matrix */ dtype *b; /* Second factor, a vector */ dtype *c_block; /* Partial product vector */ dtype *c; /* Replicated product vector */ dtype *storage; /* Matrix elements stored here */ int id; /* Process ID number */ int m; /* Rows in matrix */ int n; /* Columns in matrix */ int nprime; /* Elements in vector */ int p; /* Number of processes */ int rows; /* Number of rows on this process */

  13. ROWWISE BLOCK-STRIPED DECOMPOSITION MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); read_row_striped_matrix (argv[ l ], (void *) &a, (void *) &storage, mpitype, &m, &n, MPI_COMM_WORLD); rows = BLOCK_SIZE(id,p,m); // BLOCK_LOW((id)+1) – BLOCK_LOW(id) print_row_striped_matrix((void **) a, mpitype, m, n, MPI_COMM_WORLD) ; read_replicated_vector (argv[2], (void *) &b, mpitype, &nprime, MPI_COMM_WORLD); print_replicated_vector(b, mpitype, nprime, MPI_COMM_WORLD) ;

  14. ROWWISE BLOCK-STRIPED DECOMPOSITION c_block = (dtype *)malloc(rows * sizeof(dtype)); c = (dtype *) malloc (n * sizeof(dtype)); for(i = 0; i < rows; i++) { c_block[ i ] = 0.0; for (i = 0; i < n; i++) c_block[ i ] += a[ i ][ j ] * b[ j ]; } replicate_block_vector(c_block, n, (void *) c, mpitype, MPI_COMM_WORLD) ; print_replicated_vector(c, mpitype, n, MPI_COMM_WORLD); MPI_Finalize( ); return 0; }

  15. ROWWISE BLOCK-STRIPED DECOMPOSITION • Benchmarking • χ: The time needed to compute a single iteration of the loop performing the inner product • λ: To initiate a message • β: Number of data items that can be send down a channel in one unit of time • Computation time • χ * n * ceil(n/p) (χ = 1 addition time + 1 multiplication time) • All-gather communication time • Each vector element is a double-precision floating-point number occupying 8 bytes

  16. ROWWISE BLOCK-STRIPED DECOMPOSITION • Benchmarking • Benchmarking on a commodity cluster of 450MH Pentium II processors connected by fast Ethernet reveals that χ=63.4 nsec, λ=250 μsec, and β=106byte/sec • Megaflops: total number of floating-point operations / (execution time * million)

  17. Q&A

More Related