1 / 15

Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

Teng Ma @PAL Group Meeting. Design and Implementation of Non-Blocking Communication for Cell Messaging Layer. Outline. Cell Messaging Layer(CML2.5) ‏ Non-Blocking Communication Performance Discussion. Cell Messaging Layer(CML2.5) ‏. Supports a subset of the MPI library.

Download Presentation

Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Teng Ma @PAL Group Meeting Design and Implementation of Non-Blocking Communication for Cell Messaging Layer

  2. Outline • Cell Messaging Layer(CML2.5)‏ • Non-Blocking Communication • Performance • Discussion

  3. Cell Messaging Layer(CML2.5)‏ • Supports a subset of the MPI library. • MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Send, MPI_Recv, MPI_Barrier, MPI_Bcast, MPI_Reduce,MPI_Allreduce, MPI_Wtime, MPI_Abort, and MPI_Finalize. • My target: • MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Waitall, MPI_Test

  4. Inter-nodes communication Inter-Cell communication in the Cell Messaging Layer Non-blocking communication does not change the method of intra-nodes communication.

  5. Intra-nodes communication • This is semantic of CML receiver-initiated method. Recvreqs[rank][tag]

  6. Problems introduced by Non-Blocking for intra-nodes • Different from the blocking communication, we need another information 'issue order' to do matching for non-blocking communication. Sender(rank0)‏ Receiver(rank1)‏ MPI_Isend(buf,count,MPI_INT, 2,0,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,3,...)‏ MPI_Isend(buf,count,MPI_INT, 1,3,...)‏ MPI_Irecv(buf,count,MPI_INT, 3,0,...)‏ MPI_Isend(buf,count,MPI_INT, 1,4,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,3,...)‏ MPI_Isend(buf,count,MPI_INT, 1,3,...)‏ MPI_Irecv(buf,count,MPI_INT, 0,4,...)‏ Rank uses red font; Tag uses blue font We need three information to do matching(Rank, Tag, index(issue order))!

  7. 3D array implementation • Two changes: • Changes 'Recvreqs' to 3D array like Recvreqs[Rank][Tag][Index]. • Sender and receiver generate index locally to stand for the issue order for the same rank and tag. After the operation is done, return this index. • Requirement • Requires a fast algorithm which can get and return index in constant time. Vector algorithm is better.

  8. Two index generation methods Method 1 • The index is the index of 1st 0 bit in a bit string is the index and meanwhile set this position's bit as `1'. Returning index is just to set this bit as `0'. • Pros: O(1) algorithm. • Cons: Need modulo operation which is expensive in SPE. It's scalar algorithm which can't make use of SPU intrinsics instructions. Method 2 • Preallocated a vector with 16 char variables and initialized as (0, 1,2,...,15). Getting index is just to get an element from the vector, and set the element as '-1' and rotate left. Returning index is searching from the rightmost element of the vector, and find '-1' and set back as the index. • Pros: vector algorithm which can make use of spu instrict instructions. • Cons: worst case is O(N) algorithm which needs search '-1' to return index and non '-1' to get index. And it only supports 0-15 index.

  9. Pros and cons of 3D implementation • Pros: • Sender can use local info(rank, tag and index)to do matching fast. • Less changes from CML2.5. • Cons: • Waste of memory. (70KB memory pre-allocated for Recvreqs to support 4 tags and 16 outstanding operations. 64*17*4*16)‏ • Index generation and returning is expensive in SPU. • Only support limited tags.

  10. 2D array implementation • Recvreqs[rank+1][Tag][Index] ==>Recvreqs[rank+1][OUTSTANDING_OP] An example of using searching to do matching on 2D array Recvreqs[rank+1][OUTSTANDING_OP] for out of order finishing requests.

  11. Pros and Cons of 2D array implementation • Pros: • Save memory use in SPU. (17KB memory preallocated for Recvreqs. It can support any tag and maximum 16 outstanding operations)‏ • Get rid of expensive operation--index generation and index returning. • Cons: • Sender needs to search in a row for the matching. The worst case is O(#OUTSTANDING).

  12. Performance—latency

  13. Performance—Bandwidth(2D array)‏

  14. Effect of increasing outstanding op number (2D array implementation) 128KB message 0 Bytes message # of outstanding op is configured by users according to the application.

  15. Conclusion • CML has Non-blocking communication now!! • The bandwidth of CML_2D for 192KB messages is 23.908GB/s(93.4% of theory peak performance 25.6GB/s). • The overhead of latency brought by non-blocking can be accepted. • Users can configure outstanding # according to applications.

More Related