MPI_Alltoall

MPI_Alltoall By: Jason Michalske

What is MPI_Alltoall? • Each process sends distinct data to each receiver. The Jth block of process I is received by process J in the Ith block of the receive buffer. • Simple Visual. • Can be used to perform a transpose over multiple processors. • Example of this.

What is MPI_Alltoall? • int MPI_Alltoall(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm ) • Scount of the type stype are sent from each process from the sbuf into the rbuf. • In this version of Alltoall, the stype and rtype, and the scount and rcount must be the same.

Using MPI_Alltoall • MPI_Alltoall(localA, sendcount, MPI_FLOAT, localB, recvcount, MPI_FLOAT, MPI_COMM_WORLD); • Here, localA and localB are arrays of the (number of processors * the sendcount or recvcount) ( they are the same) floating point numbers. • Example: my code.

How Does MPI_Alltoall work? • Each process executes a send to each process (including itself) with a call to MPI_Send(sendbuf+I, sendcount, sendtype, I, …) and a receive from every other process with a call to MPI_Recv(recvbuf+I, recvcount, recvtype, I, …). • Example: Simplest Solution: Conditional Statements.

Problems with MPI_Send and Conditional Statements • Code is very repetitive for any number of processors. • What about different data types? This solution would make the code very complex. • What if a call to Receive is called on 2 processes and they wait for each other to finish. This yields a problem which will cause the execution to not finish.

Fixing the Problems • Repetitive code is generally best solved by looping. • If there are different data types between calls to MPI_Alltoall, then this can be dealt with much more easily because of the looping construct. • The problem with MPI_Send or MPI_Recv is that it is a blocking function. This means that no code beyond that statement can be executed until it is locally complete.

Fixing the Problems • To solve the blocking problem, you can use a non-blocking form of communication called MPI_Isend and MPI_Irecv. • The send buffer may not be modified until the request has been completed by MPI_WAIT, MPI_TEST, or one of the other MPI wait or test functions.

Better Solution MPI_Send(sendbuf+I, sendcount, sendtype, I, …) and a receive from every other process with a call to MPI_Recv(recvbuf+I, recvcount, recvcount, I, …). if (sendtype == MPI_FLOAT) { fsendbuf = (float*)sendbuf; frecvbuf = (float*) recvbuf; for (source = 0; source < numtasks; source++){ dest = source; MPI_Isend((fsendbuf+(source*sendcount)), sendcount, sendtype, dest, tag, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv((frecvbuf+(source*sendcount)), recvcount, recvtype, source, tag, MPI_COMM_WORLD, &reqs[1]); MPI_Waitall(2, reqs, stats); } recvbuf = frecvbuf; sendbuf = fsendbuf; }

Does the Code Work? • With 6 processors and a sendcount of 1, my resulting receive buffers on each were as shown in the next visual (These are the same parameters from the graphic used to explain what Alltoall should do). • With 10 processors and a sendcount of 2, my resulting receive buffers on each were as shown in the next visual.

Comparisons • Until the data was near 60 million in size, both versions of MPI_Alltoall were very close in time. • After 60 million, when either program did not exceed 600 seconds, my version seemed to be much faster. • Printed out a sample location for verification -> my version was consistent with the real Alltoall

Conclusions • The resulting Timing Data showed that both versions seemed to perform very closely with smaller data sets. • As the Send and Receive Buffers got bigger, so did the time. • More data means bigger messages sent between the processors, which leads to the function taking longer. • For the same length send and receive buffers, and an increasing number of processors, the time also increased. • More processors means more communications for the same data set, which leads to more function calls, resulting in increased times.

MPI_Alltoall

MPI_Alltoall

Presentation Transcript

MPI 2.2