2013.11.25 20126128 이창원

Matrix-Vector Multiplication Ver.1 Rowwise Block-Stripped Decomposition 2013.11.25 20126128 이창원

SEQUENTIAL ALGORITHM • Sequential matrix-vector multiplication algorithm Input: a[0..m-1, 0..n-1] – matrix with dimensions m * n b[0..n-1] – vector with dimensions n * 1 output: c[0..m-1] – vector with dimensions m * 1 1: for i <= 0 to m-1 2: c[ i ] <= 0 3: for j <= 0 to n-1 4: c[ i ] <= c[ i ] + a[i, j] * b[ j ] 5: endfor 6: endfor • Matrix A size: m x n • Complexity of inner product: O(n) (n multiplication and n-1 additions) • Total complexity: O(mn) • When the matrix is square, the algorithm’s complexity is O(n2) b A c x =

DATA DECOMPOSITION OPTIONS • There are three straightforward ways to decompose an matrix • In this presentation I’ll explain rowwise block-striped decomposition Rowwise decomposition Columnwise decomposition Checkerboard decomposition

ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis 1. Data decomposition b b c Row i of A Row i of A Task i Task i Inner product computation All-gather communication b ci Row i of A Task i

ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis 2. Agglomeration • Agglomerate primitive tasks associated with contiguous groups of rows and assign each of these combined tasks to a single process Row i P0 i+1 i+2 P1 Process j P2

ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis • Computation (n by n matrix) • Sequential: O(n2) • Parallel: Each process multiplies its portion of matrix A by vector b. No process is responsible for more than ceil(n/p) rows. ∴ O(n2/p) • Communication (All-gather) • Each process sends ceil(log p) messages • (λ: To initiate a message) • ∴ O(log p + n) • Overall complexity • O(n2/p + log p + n)

ROWWISE BLOCK-STRIPED DECOMPOSITION • Design and Analysis • Scalability of parallel algorithm • The time complexity of the sequential algorithm is O(n2) • When n is reasonably large, message transmission time in the all-gather operation is greater than the message latency. ∴ communication complexity: O(n) • Isoefficiency function: T(n, 1) ≥ CT0(n, p) (T0(n, p) = (p-1) σ(n) + p k(n, p) • ∴ n2 ≥ Cpn => n ≥ Cp • When the problem size is n, the matrix has n2 elements. ∴ M(n) = n2 • Scalability: M(Cp)/p = C2p2/p = C2p • The algorithm is not highly scalable

ROWWISE BLOCK-STRIPED DECOMPOSITION • Replicating a Block-Mapped Vector • After each process performs its portion of the matrix-vector product, it has produced a block of result vector c. We must transform this block-mapped vector into a replicated vector. • Each process needs to allocate memory to accommodate the entire vector c • The processes must concatenate their pieces of the vector into a complete vector P0 P0 All-gather P1 P1 P2 P2

ROWWISE BLOCK-STRIPED DECOMPOSITION • Function MPI_Allgatherv intMPI_Allgatherv ( void* send_buffer, //the starting address of the data this process is contributing to the “all gather” int send_cnt, //the number of data items this process is contributing MPI_Datatype send_type, //the types of data items this process is contributing void* receive_buffer, //the address of the beginning of the buffer used to store the gathered elements int* receive_cnt, //an array indicating the number of data items to be received from each process int* receive_disp, //an array indicating for each process the first index in the receive buffer where that process’s items should be put MPI_Datatype receive_type, //the type of the received elements MPI_Comm communicator //the communicator in which this collective communication is occuring )

ROWWISE BLOCK-STRIPED DECOMPOSITION • Function MPI_Allgatherv send_buffer receive_buffer send_cnt = 3 receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 0 Process 0 send_buffer receive_buffer send_cnt = 4 MPI_Allgatherv receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 1 Process 1 send_buffer receive_buffer send_cnt = 4 receive_cnt receive_disp 0 1 2 3 4 5 6 7 8 9 10 Process 2 Process 2

ROWWISE BLOCK-STRIPED DECOMPOSITION • Replicated Vector Input/Output • Process p-1 tries to open the data file for reading • If it can open the file, it reads n and broadcasts it to the other process • Every process allocates memory to store the vector • Process p-1 reads the vector and broadcasts it to the other processes • Function for read: read_replicated_vector( ) • Function for print: print_replicated_vector( ) • Ensure that only a signle processexecutes the calls to printf( ) n - 8 13 ... 1 number of elements in the vector n vector elements

ROWWISE BLOCK-STRIPED DECOMPOSITION • Documenting the Parallel Program typedef double dtype; #define mpitype MPI_DOUBLE int main(int argc, char *argv[ ]){ dtype **a; /* First factor, a matrix */ dtype *b; /* Second factor, a vector */ dtype *c_block; /* Partial product vector */ dtype *c; /* Replicated product vector */ dtype *storage; /* Matrix elements stored here */ int id; /* Process ID number */ int m; /* Rows in matrix */ int n; /* Columns in matrix */ int nprime; /* Elements in vector */ int p; /* Number of processes */ int rows; /* Number of rows on this process */

ROWWISE BLOCK-STRIPED DECOMPOSITION MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); read_row_striped_matrix (argv[ l ], (void *) &a, (void *) &storage, mpitype, &m, &n, MPI_COMM_WORLD); rows = BLOCK_SIZE(id,p,m); // BLOCK_LOW((id)+1) – BLOCK_LOW(id) print_row_striped_matrix((void **) a, mpitype, m, n, MPI_COMM_WORLD) ; read_replicated_vector (argv[2], (void *) &b, mpitype, &nprime, MPI_COMM_WORLD); print_replicated_vector(b, mpitype, nprime, MPI_COMM_WORLD) ;

ROWWISE BLOCK-STRIPED DECOMPOSITION c_block = (dtype *)malloc(rows * sizeof(dtype)); c = (dtype *) malloc (n * sizeof(dtype)); for(i = 0; i < rows; i++) { c_block[ i ] = 0.0; for (i = 0; i < n; i++) c_block[ i ] += a[ i ][ j ] * b[ j ]; } replicate_block_vector(c_block, n, (void *) c, mpitype, MPI_COMM_WORLD) ; print_replicated_vector(c, mpitype, n, MPI_COMM_WORLD); MPI_Finalize( ); return 0; }

ROWWISE BLOCK-STRIPED DECOMPOSITION • Benchmarking • χ: The time needed to compute a single iteration of the loop performing the inner product • λ: To initiate a message • β: Number of data items that can be send down a channel in one unit of time • Computation time • χ * n * ceil(n/p) (χ = 1 addition time + 1 multiplication time) • All-gather communication time • Each vector element is a double-precision floating-point number occupying 8 bytes

ROWWISE BLOCK-STRIPED DECOMPOSITION • Benchmarking • Benchmarking on a commodity cluster of 450MH Pentium II processors connected by fast Ethernet reveals that χ=63.4 nsec, λ=250 μsec, and β=106byte/sec • Megaflops: total number of floating-point operations / (execution time * million)

Q&A

2013.11.25 20126128 이창원