Lecture 9 Architecture Independent (MPI) Algorithm Design

Lecture 9 Architecture Independent (MPI) Algorithm Design Parallel Computing Spring 2010

Matrix Computations • SPMD program design stipulates that processors executes a single program on different pieces of data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below. • One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same. • The most common block distributions are. • • column-wise (block) distribution. Split matrix into p column stripes so that n/p consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and p2 = p. • • row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1. • • block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size n/√p× n/√p and store block i to processor i. • There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable.

Matrix block distributions

Matrix-Vector Multiplication • Sequential Alg: the running time is O(n2). • n^2 multiplications and additions MAT_VECT(A,x,y) { for i=0 to n-1 do { y[i]=0; for j=0 to n-1 do y[i]=y[i]+A[i][j]*x[j]; } }

Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Assume p=n (p – no. of processors). • Steps: • Step 1: Initial partition of matrix and vector: • Matrix distribution: Each process get one complete row of the matrix. • Vector distribution: The n*1 vector is distributed such that each process owns one of its elements. • Step 2: All-to-all broadcast • Every process has one element of the vector, but every process needs the entire vector. • Step 3: computation • Process Pi computes • Running time: • All-to-all broadcast: θ(n) at any architecture • Multiplication of a single row of A and with vector x is θ(n) • Total running time is θ(n). • Total work is θ(n^2) – cost-optimal

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Assume p<n (p – no. of processors). • Three Steps: • Initial partition of matrix and vector: • Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p • All-to-all broadcast: • Among p processes and involved messages of size n/p • Computation: • Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector. • Running Time: • All-to-all broadcast: • T=(ts+ n/p tw)(p-1) on any architecture • T=ts logp + n/p tw(p-1) on hypercube • Computation: T=n* n/p =θ(n2/p) • Total running time T= θ(n2/p+ts logp + n tw) • Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal

Matrix-Vector Multiplication: Columnwise 1-D Partitioning • Similar to rowwise 1-D Partitioning

Matrix-Vector Multiplication: 2-D Partitioning • Assume p=n2 • Steps: • Step 1: Initial partitioning • Each process get one element of matrix • The vector is distributed only processes in the diagonal, each of which owns one element. • Step 2: broadcast • The ith element of vector should be available to the ith element of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. • Step 3: computation • Each process multiplies its matrix element with the corresponding element of x. • Step 4: All-to-one reduction of partial results. • The products computed for each row must be added, leaving the sums in the last column of processes. • Running time: • One-to-all broadcast: θ(log n) • Computation in each process: θ(1) • All-to-one reduction: θ(log n) • Total running time: θ(log n) • Total work: θ(n2 log n) – not cost-optimal

Matrix-Vector Multiplication: 2-D Partitioning

Matrix-Vector Multiplication: 2-D Partitioning • Assume p<n2 • Steps: • Step 1: Initial partitioning • Each process get (n/p)*(n/p) of matrix • The vector is distributed only processes in the diagonal, each of which owns n/p element. • Step 2: columwise one-to-all broadcast • The ith group of elements of vector should be available to the ith group of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. • Step 3: computation • Each process multiplies its n/p matrix element with the corresponding element of x. • Step 4: All-to-one reduction of partial results. • The products computed for each row must be added, leaving the sums in the last column of processes. • Running time: • Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture • Computation in each process: T=n/p* n/p • All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture • Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture

Matrix-Vector Multiplication: 1-D Partitioning vs. 2-D Partitioning • Matrix-vector multiplication is faster with block 2-D partitioning of the matrix than with block 1-D partitioning for the same number of processes. • If the number of processes is greater than n, then the 1-D partitioning cannot be used. • If the number of processes is less than or equal to n, 2-D partitioning is preferable.

Matrix Distributions : Block cyclic • In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion. • • column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i.e. q = n. • • row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i.e. q = n. • • scattered distribution. Let p = qi · qjprocessors be divided into qjgroups each group Pjconsisting of qiprocessors. Particularly, Pj= {jqi+ l | 0 ≤ l ≤ qi − 1}. Processor jqi+ l is called the l-th processor of group Pj. This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j mod qj). A scattered distribution refers to the special case qi= qj= √p.

Block cyclic distributions

Scattered Distribution

Matrix Multiplication – Serial algorithm

Matrix Multiplication • The algorithm for matrix multiplication presented below was presented in the seminal work of Valiant. It works for p ≤ n2. • Three steps: • Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1 <=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1. • All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. • Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai, √p -1 and B0,j, B1,j, …, B √p -1,j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8.3. • Running time: • All-to-all broadcast: • T=(ts+ n^2/p tw)(p-1) on any architecture • T=ts log  p + n^2/p tw( p-1) on hypercube • Computation: • T= p*(n/p)^3=n^3/p.

Matrix Multiplication • The input matrices A and B are divided into p block-submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor. • Let Ai(respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm.

Matrix Multiplication

Lecture 9 Architecture Independent (MPI) Algorithm Design

Lecture 9 Architecture Independent (MPI) Algorithm Design

Presentation Transcript

Lecture 2

Scientific Computing Lecture 5

Lecture 9: UI Software Architecture

Semantic Registries Workshop December 9, 2013

Scientific Computing Lecture 10

CS 267 Applications of Parallel Computers Lecture 9: Split-C

Architecture of Parallel Computers CSC / ECE 506 Message Passing and Open MPI Lecture 17

Lecture 2: Part II Message Passing Programming: MPI

CSE 160 – Lecture 16

Open MPI

Semantic Web layered architecture

Service Oriented Architecture Lecture 9

CS 668: Lecture 2 An Introduction to MPI

CS 668: Lecture 3 An Introduction to MPI

Lecture 9 : DBMS Architecture

CSE 160 – Lecture 16

CS 267 Applications of Parallel Computers Lecture 9: Split-C

Service Oriented Architecture Lecture 9

CS160 – Lecture 3

DOBES/MPI Archive - architecture -