basics of message passing
Download
Skip this Video
Download Presentation
Basics of Message-passing

Loading in 2 Seconds...

play fullscreen
1 / 36

Basics of Message-passing - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Basics of Message-passing. Mechanics of message-passing A means of creating separate processes on different computers A way to send and receive messages Single program multiple data (SPMD) model Logic for multiple processes merged into one program

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Basics of Message-passing' - lou


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
basics of message passing
Basics of Message-passing
  • Mechanics of message-passing
    • A means of creating separate processes on different computers
    • A way to send and receive messages
  • Single program multiple data (SPMD) model
    • Logic for multiple processes merged into one program
    • Control Statements separate processor blocks of logic
    • A compiled program is stored on each processor
    • All executables are started together statically
    • Example: MPI (Message Passing Interface)
  • Multiple program multiple data (MPMD) model
    • Each processor has a separate master program
    • Master program spawns child processes dynamically
    • Example: PVM (Parallel Virtual Machine)
pvm parallel virtual machine
PVM (Parallel Virtual Machine)

From Oak Ridge National Laboratories, Free distribution

  • Multiple process control: Host process: control environment; Any process can spawn others, Daemon: control message passing
  • PVM System Calls
    • Control: pvm_mytid(), pvm_spawn(), pvm_parent(), pvm_exit()
    • Get send buffer: pvm_initsend()
    • Pack for sending: pvm_pkint(), pvm_pkfloat(), pvm_pkstr()
    • Blocking/non-blocking transmission: pvm_send(), pvm_recv(), pvm_nrecv()
    • Unpack after receipt: Pvm_upkint(), pvm_upkfload, pvm_upkstr()
    • Group definition: pvm_joingroup()
    • Collective communication: pvm_bcast(), pvm_scatter(), pvm_gather, pvm_reduce(), pvm_mcast()
mpij and mpijava
mpij and MpiJava
  • Overview
    • MpiJava is a wrapper sitting on mpich or lamMpi
    • mpij is a native Java implementation of mpi
  • Documentation
    • MpiJava (http://www.hpjava.org/mpiJava.html)
    • mpij (uses the same API as MpiJava)
  • Java Grande consortium (http://www.javagrande.org)
    • Sponsors conferences & encourages Java for Parallel Programming
    • Maintains Java based paradigms (mpiJava, HPJava, and mpiJ)
  • Other Java based implementations
    • JavaMpi is another less popular MPI Java wrapper
spmd computation mpi
SPMD Computation (MPI)

main (int argc, char *argv[])

{ MPI_Init(&argc, &argv);

.

.

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank == 0) master();

else slave();

.

.

MPI_Finalize();

}

The master process executes master()

The slave processes execute slave()

a simple mpi program
A Simple MPI Program

#include <mpi.h>

#include <stdio.h>

int main(intargc, char *argv[])

{ int rank, size, MAX = 100 + 1, TAG=1;

char data[MAX];

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

if (size!=2) MPI_Abort(MPI_COMM_WORLD, 1); // Terminate all processors

if (myRank==0) { sprintf(data, "Sending from %d of %d", rank, size);MPI_Send(data, MAX, MPI_CHAR, 1, TAG, MPI_COMM_WORLD);

} else { MPI_Recv( data, MAX, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG

, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf:%s\n", data);

}

MPI_Finalize();

}

start and finish
Start and Finish
  • MPI_Init: Bring up program on all computers, pass command line arguments, establish ranks.
  • MPI_Comm_rank: Determine the rank of the current process
  • MPI_Comm_size: return the number of processors that are running
  • MPI_Finalize: Terminate the program normally
  • MPI_Abort: Terminate with an error code when something bad happens
standard send mpi send
Standard Send (MPI_Send)

Block until message is received or data copied to a buffer

int MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm)

  • Input Parameters
    • buf: initial address of send buffer (choice)
    • count: integer number of elements in send buffer
    • type: type of each send buffer element (ex: MPI_CHAR, MPI_INT, MPI_DOUBLE, MPI_BYTE, MPI_PACK, etc.)
    • dest: rank of destination (integer)
    • tag: message tag (integer)
    • comm communicator (handle)
  • Note:MPI_PACK allows different data types to be sent in a single buffer using the MPI_Pack and MPI_Unpack functions.
  • Note: Google MPI_Send, MPI_Recv, etc for more intormation
matching message tags
Matching Message Tags
  • Differentiates between types of messages
  • The message tag is carried within message.
  • Wild card codes allow receipt of any message from any source
    • MPI_ANY_TAG: matches any message type
    • MPI_ANY_SOURCE: matches messages from any sender
    • Sends cannot use wildcards (pull operation, not push)

Send message type 5 from buffer x to buffer y in process 2

status of sends and receives
Status of Sends and Receives

MPI_Status status;

MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,

MPI_ANY_TAG, MPI_COMM_WORLD, &status);

  • status.MPI_SOURCE /* rank of sender */
  • status.MPI_TAG /* type of message */
  • status.MPI_ERROR /* error code */
    • MPI_SUCCESS - Successful, MPI_ERR_BUFFER - Invalid buffer pointer
    • MPI_ERR_COUNT - Invalid count, MPI_ERR_TYPE - Invalid data type
    • MPI_ERR_TAG - Invalid tag, MPI_ERR_COMM - Invalid communicator
    • MPI_ERR_RANK - Invalid rank, MPI_ERR_ARG - Invalid argument
    • MPI_ERR_UNKNOWN - Unknown error, MPI_ERR_INTERN - internal error
    • MPI_ERR_TRUNCATE - message truncated on receive
  • MPI_Get_count(&status, recv_type, &count) /* number of elements */
console input and output
Console Input and Output
  • Input
    • Console input must be initiated at the host processif (rank==0) { printf("Enter some fraction"); scanf("%lf", &value); fflush(stdin); }orgets(data) to read a string
  • Output
    • Any process can initiate an output
    • MPI uses internal library functions to route the output to the process initiating the program
    • Transmission using a library functions initiated before normal application transmissions can arrive after, or visa versa
groups and communicators
Groups and Communicators
  • Group:A set of processes ordered by relative rank
  • Communicators: Context required for sends and receives
  • Purpose:Enable collective communication (to subgroups of processors)
  • The default communicator is MPI_COMM_WORLD
    • A unique rank corresponds to each executing process
    • The rank is an integer from 0 to p – 1
    • The number of processors executing is p
  • Applications can create subset communicators
    • Each processor has a unique rank in each sub-communicator
    • The rank is an integer from 0 to g-1
    • The number of processors in the group is g
mpi group communicator functions
MPI Group Communicator Functions

Typical Usage

  • Extract group from communicator:MPI_Comm_group
  • Form new group:MPI_Group_incl or MPI_Group_excl
  • Create new group communicator:MPI_Comm_create
  • Determine group rank:MPI_Comm_rank
  • Communications:MPI message passing functions
  • Destroy created communicators and groups: MPI_Comm_free and MPI_Group_free
details
Details
  • MPI_Group_excl:
    • New group without certain processes from an existing group
    • int MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup);
  • MPI_Group_incl:
    • New group withc selected processes from an existing group
    • int MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup);
creating and using a sub group
Creating and using a sub-group

int ranks[4]={1,3,5,7};

MPI_Group original, subgroup;

MPI_Comm slave;

MPI_Comm_group(MPI_COMM_WORLD, &original);

MPI_Group_incl(original, 4, ranks, &subgroup);

MPI_Comm_create(MPI_COMM_WORLD, subgroup, &slave);

MPI_Send(data,strlen(data)+1 ,MPI_CHAR ,0 ,0, slave);

MPI_Group_free(subgroup); MPI_Group_free(original);

MPI_Comm_free(slave);

point to point communication

Process 1

Process 2

x

y

send(&x, 2);

recv(&y, 1);

Generic syntax (actual formats later)

Point-to-point Communication
  • Pseudo code constructs

Send(data, destination, message tag)

Receive(data, source, message tag)

  • Synchronous
    • Send Completes when data safely received
    • Receive completes when data is available
    • No copying to/from internal buffers
  • Asynchronous
    • Copy to internal message buffer
    • Send completes when transmission begins
    • Local buffers are free for application use
    • Receive polls to determine if data is available
synchronized sends and receives

Process 1

Process 2

Time

Request to send

send();

Suspend

Ac

kno

wledgment

recv();

process

Both processes

Message

contin

ue

(a) send() occurs before recv()

Process 1

Process 2

Time

recv();

Suspend

Request to send

process

send();

Message

Both processes

contin

ue

Ac

kno

wledgment

Synchronized sends and receives

(b) recv() occurs before send()

point to point mpi calls
Buffered Send (receiver gets to it when it can)

Completes after data is copied to a user supplied buffer

Becomes synchronous if no buffers are available

Ready Send (guarantee transmission is successful)

A matching receive call must precede the send

Completion occurs when remote processor receives the data

Standard Send (starts transmission if possible)

If receive call is posted, completes when transmission starts

If no receive call is posted, completes when data is buffered by MPI, but becomes synchronous if no buffers are available

Blocking - Return occurs when the call completes

Non-Blocking - Return occurs immediately

Application must periodically poll or wait for completion

Why non-blocking? To allows more parallel processing

Point to Point MPI calls
buffered send example
Buffered Send Example

Applications supply a data buffer area using MPI_Buffer_attach() to hold the data during transmission

Note: transmission is between sender/receiver MPI buffers

Note: copying in and out of buffers can be expensive

point to point message transfer
Point-to-point Message Transfer

MPI_Comm_rank(MPI_COMM_WORLD,&myrank);

int x;

MPI_Status *stat;

if (myrank == 0)

{ MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD);

} else if (myrank == 1)

{ MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);

}

non blocking point to point transfer
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

int x;

MPI_Request *io;

MPI_STATUS *stat;

if (myrank == 0)

{ MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io);

doSomeProcessing();

MPI_Wait(io, stat);

} else if (myrank == 1)

{ MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);

}

MPI_Isend() and MPI_Irecv() return immediately

MPI_Rsend returns when received by remote computer, MPI_Bsend Buffered send, MPI_Send Standard send

MPI_Wait() returns after transmission, MPI_Test() returns non-zero after transmission, returns zero otherwise

Non-blocking Point-to-point Transfer
message passing order

Process 0

Process 1

Destination

send(…,1,…);

send(…,1,…);

lib()

Source

recv(…,0,…);

lib()

(a) Messages received out of order

recv(…,0,…);

Process 0

Process 1

send(…,1,…);

send(…,1,…);

lib()

(b) Messages received in order

recv(…,0,…);

lib()

recv(…,0,…);

Message Passing Order

Note: Messages originating from a processor will always be received in order. Messages from different processors can be received out of order.

collective communication
Collective Communication

MPI operations on groups of processes

  • MPI_Bcast()): Broadcast or Multicast data to processors in a group
  • Scatter (MPI_Scatter()): Send parts of an array to separate processes
  • Gather (MPI_Gather()): Collect array elements from separate processes
  • AlltoAll (MPI_Alltoall()): A combination of gather and scatter. All processes send; then sections of the combined data are gathered
  • MPI_Reduce(): Combine values from all processes to a single value using some operation (function call).
  • MPI_Reduce_scatter(): First reduce and then scatter result
  • MPI_Scan(): Reduce values received from processors of lower rank in the group. (Note: this is a prefix reduction)
  • MPI_Barrier(): Pause until all processors reach the barrier call
  • Advantages
  • MPI can use the processor hierarchy to improve efficiency
  • Although, we can implement collective communication using standard send and receive calls, collective operations require less programming and debugging
reduce broadcast all reduce
Reduce, BroadCast, All Reduce

Reduce, then broadcast

Butterfly Allreduce

predefined collective operations
Predefined Collective Operations
  • MPI_MAX, MPI_MIN: maximum, minimum
  • MPI_MAXLOC, MPI_MINLOC:
    • If the output buffer is out
    • For each index, out[i].val and out[i].rank contains the max (or min) value and the processor rank containing it
  • MPI_SUM, MPI_PROD: sum, product
  • MPI_LAND, MPI_LOR, MPI_LXOR: logical &, |, ^
  • MPI_BAND, MPI_BOR, MPI_BXOR: bitwise &, |, ^
derived mpi data types
Derived MPI Data Types

/* Goal: send items, each containing a double, integer, and a string */

int lengths[3] = {1, 1, 100};

MPI_Datatype types[3] = {MPI_DOUBLE, MPI_INT, MPI_CHAR, };

int displacements[3] = {0, sizeof(double), sizeof(double)+sizeof(int)};

MPI_Datatype* myType;

/* Derive a data type */

MPI_TYPE_create_struct(3, lengths, displacements, types, &myType);

MPI_Type_commit(myType); /* Commit it for use */

/* count data items broadcast from source to processors in communicator */

MPI_Bcast(&data, count, myType, source, comm);

MPI_Type_free(myType); /* Don\'t need it anymore */

Note: Broadcasts can be fifty to a hundred times faster than doing individual sends using for loops

collective communication example
Collective Communication Example
  • Master:Allocate memory to hold all of the date and then gather items from a group of processes
  • Remotes:Fill an array with data and send them to the master
  • Note:All processors execute the MPI_Gather() function

int data[10]; /*data to gather from processes*/

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank == 0)

{ MPI_Comm_size(MPI_COMM_WORLD, &grp_size);

buf = (int *)malloc(grp_size*10*sizeof (int));

}

else { for (i=0; i<10; i++) data[i] = myrank; }

MPI_Gather(data, 10, MPI_INT, buf, grp_size*10

, MPI_INT, 0 /* gatherer rank */,MPI_COMM_WORLD);

user defined collective operation
User Defined Collective Operation

/* User-defined function to add complex numbers (dest = src + dest) */

void compSum(Complex *src, Complex *dest, int *len, MPI_Datatype *ptr)

{ inti; Complex c;

for (i=0; i< *len; ++i)

{ dest->real += src->real;

dest->imag += src->imag;

src++; dest++;

} }

Complex in[100], out[100];

MPI_Op operation; MPI_DatatypecomplexType;

MPI_Type_contiguous(2, MPI_DOUBLE, &complextype); // Define type

MPI_Type_commit(&complexType); // Record for possible use

MPI_Op_create( compSum, True, &operation); // Define the operation

MPI_Reduce( in, out, 100, complexType, operation, root, communicator );

collective communication rules
Collective Communication Rules
  • All of the processors in the communicator call the same collective function
  • The arguments must specify the same host, input data array length, data type, operation, and communicator
  • The destination process is the only one that needs to specify an output array
  • There is no message tag. Matching is done by the calling order and the communicator
  • The input and output buffers must be different and should not overlap
broadcast

Process 0

Process 1

Process

p

-

1

data

data

data

Action

buf

bcast();

bcast();

bcast();

Code

MPI f

or

m

Broadcast

Broadcast - Sending the same message to all processes

Multicast - Sending the same message to a defined group of processes.

scatter

Process 0

Process 1

Process

p

1

-

data

data

data

Action

buf

scatter();

scatter();

scatter();

Code

MPI f

or

m

Scatter

Distributing each element of an array to separate processes

Contents of the ith location of the array transmits to process i

gather

Process 0

Process 1

Process

p

1

-

data

data

data

Action

buf

gather();

gather();

gather();

Code

MPI f

or

m

Gather

One process collects individual values from set of processes.

reduce

Process 0

Process 1

Process

p

1

-

data

data

data

Action

buf

+

reduce();

reduce();

reduce();

Code

MPI f

or

m

Reduce

Perform a distributed calculation

Example: Perform addition over a distributed array

avoiding mpi deadlocks
Avoiding MPI Deadlocks

An MPI_Recv without a matching send will block forever

  • MPI_Send doesn\'t always work the same way
    • Can copy to a buffer and then return before the transmission is received
    • Can block until the matching MPI_Recv starts
    • MPI uses thresholds to switch from buffered to blocking sends
    • Some implementations buffer small messages and block large messages
  • Deadlock Possibilities (MPI_Send followed by MPI_Recv)
    • If all of the sends block, none of the receives can start
    • Small messages may succeed, while larger messages may lead to deadlock
  • Possible Solutions:
    • Some processors send before receive; others receive before send
    • Use MPI_Sendrecv or Sendrecv_replace so that MPI will automatically handle the order of calls and guarantee no deadlock.
timing parallel programs
Timing Parallel Programs
  • What should not be timed?
    • Time to type input
    • Time to print or display output
  • What should be timed?
    • The actual algorithm\'s computation
    • Communication blocks
  • How?Answer: Either use MPI or C time.h functions

double start = MPI_Wtime();

/* Do stuff */

double time = (MPI_Wtime() – start)*MPI_Wtick();OR C (but doesn\'t include idle time)

clock_t start = clock();

/* Do stuff */

float time = ((double) (clock()-start)) / CLOCKS_PER_SEC;

maximum time over processors
Maximum Time over Processors

double start, localElapsed, elapsed;

// Start all processors together

MPI_Barrier(MPI_COMM_WORLD);

start = MPI_Wtime(); // Start time

/** do code here */

// Get processor elapsed time

localElapsed = MPI_Wtime() – start;

// Get the maximum elapsed processor time

MPI_Reduce(&localElapsed, &elapsed, 1,

MPI_DOUBLE, MPI_MAX, 0, comm);

if (rank == 0) // Master processor outputs result

printf("Elapsed time = %f seconds\n", elapsed);

Note: Another way is to code another barrier at the end in order to avoid needing a reduce operation

ad