marmot mpi analysis and checking tool l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
MARMOT- MPI Analysis and Checking Tool PowerPoint Presentation
Download Presentation
MARMOT- MPI Analysis and Checking Tool

Loading in 2 Seconds...

play fullscreen
1 / 44

MARMOT- MPI Analysis and Checking Tool - PowerPoint PPT Presentation


  • 359 Views
  • Uploaded on

MARMOT- MPI Analysis and Checking Tool Bettina Krammer, Matthias Müller krammer@hlrs.de , mueller@hlrs.de HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de Overview General Problems of MPI Programming Related Work New Approach: MARMOT

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MARMOT- MPI Analysis and Checking Tool' - lotus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
marmot mpi analysis and checking tool

MARMOT- MPI Analysis and Checking Tool

Bettina Krammer, Matthias Müller

krammer@hlrs.de , mueller@hlrs.de

HLRS

High Performance Computing Center Stuttgart

Allmandring 30

D-70550 Stuttgart

http://www.hlrs.de

overview
Overview
  • General Problems of MPI Programming
  • Related Work
  • New Approach: MARMOT
  • Results within CrossGrid
  • Outlook
problems of mpi programming
Problems of MPI Programming
  • All problems of serial programming
  • Additional problems:
    • Increased difficulty to verify correctness of program
    • Increased difficulty to debug N parallel processes
    • Portability issues between different MPI implementations and platforms, e.g. with mpich on the Crossgrid testbed:

WARNING: MPI_Recv: tag= 36003 > 32767 ! MPI only guarantees tags up to this.

THIS implementation allows tags up to 137654536

versions of LAM-MPI < v 7.0 only guarantee tags up to 32767

    • New problems such as race conditions and deadlocks
related work
Related Work
  • Parallel Debuggers, e.g. totalview, DDT, p2d2
  • Debug version of MPI Library
  • Post-mortem analysis of trace files
  • Special verification tools for runtime analysis
    • MPI-CHECK
      • Limited to Fortran
    • Umpire
      • First version limited to shared memory platforms
      • Distributed memory version in preparation
      • Not publicly available
    • MARMOT
what is marmot
What is MARMOT?
  • Tool for the development of MPI applications
  • Automatic runtime analysis of the application:
    • Detect incorrect use of MPI
    • Detect non-portable constructs
    • Detect possible race conditions and deadlocks
  • MARMOT does not require source code modifications, just relinking
  • C and Fortran binding of MPI -1.2 is supported
design of marmot

Application

Debug

Server

(additional

process)

Profiling Interface

MARMOT core tool

MPI library

Design of MARMOT
examples of client checks verification on the local nodes
Examples of Client Checks: verification on the local nodes
  • Verification of proper construction and usage of MPI resources such as communicators, groups, datatypes etc., for example
    • Verification of MPI_Request usage
      • invalid recycling of active request
      • invalid use of unregistered request
      • warning if number of requests is zero
      • warning if all requests are MPI_REQUEST_NULL
      • Check for pending messages and active requests in MPI_Finalize
  • Verification of all other arguments such as ranks, tags, etc.
examples of server checks verification between the nodes control of program
Examples of Server Checks: verification between the nodes, control of program
  • Everything that requires a global view
  • Control the execution flow, trace the MPI calls on each node throughout the whole application
  • Signal conditions, e.g. deadlocks (with traceback on each node.)
  • Check matching send/receive pairs for consistency
  • Check collective calls for consistency
  • Output of human readable log file
crossgrid application tasks

40 MHz (40 TB/sec)

level 1 - special hardware

75 KHz (75 GB/sec)

5 KHz (5 GB/sec)

level 2 -embedded processors

100 Hz

(100 MB/sec)

level 3-PCs

data recording &

offline analysis

CrossGrid Application Tasks
crossgrid application wp 1 4 air pollution modelling stem ii
CrossGrid Application: WP 1.4: Air pollution modelling (STEM-II)
  • Air pollution modeling with STEM-II model
  • Transport equation solved with Petrov-Crank-Nikolson-Galerkin method
  • Chemistry and Mass transfer are integrated using semi-implicit Euler and pseudo-analytical methods
  • 15500 lines of Fortran code
  • 12 different MPI calls:
    • MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Type_extent, MPI_Type_struct, MPI_Type_commit, MPI_Type_hvector, MPI_Bcast, MPI_Scatterv, MPI_Barrier, MPI_Gatherv, MPI_Finalize.
conclusion
Conclusion
  • MARMOT supports MPI 1.2 for C and Fortran binding
  • Tested with several applications and benchmarks
  • Tested on different platforms, using different compilers and MPI implementations, e.g.
    • IA32/IA64 clusters (Intel, g++ compiler) with mpich (CrossGrid testbed, …)
    • IBM Regatta
    • NEC SX5, SX6
    • Hitachi SR8000, Cray T3e
impact and exploitation
Impact and Exploitation
  • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs.
  • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire)
  • Contact with users outside CrossGrid
  • MARMOT has been presented at various conferences and publications

http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

future work
Future Work
  • Scalability and general performance improvements
  • Better user interface to present problems and warnings
  • Extended functionality:
    • more tests to verify collective calls
    • MPI-2
    • Hybrid programming
impact and exploitation18
Impact and Exploitation
  • MARMOT is mentioned in the U.S. DoD High Performance Computing Modernization Program (HPCMP) Programming Environment and Training (PET) project as one of two freely available tools (MPI-Check and MARMOT) for the development of portable MPI programs.
  • Contact with tool developers of the other 2 known verification tools like MARMOT (MPI-Check, umpire)
  • MARMOT has been presented at various conferences and publications

http://www.hlrs.de/organization/tsc/projects/marmot/pubs.html

impact and exploitation cont d
Impact and Exploitation cont’d
  • Tescico proposal

(“Technical and Scientific Computing”)

  • French/German consortium:
    • Bull, CAPS Entreprise, Université Versailles,
    • T-Systems, Intel Germany, GNS, Universität Dresden, HLRS
  • ITEA-labelled
  • French authorities agreed to give funding
  • German authorities did not
impact and exploitation cont d20
Impact and Exploitation cont’d
  • MARMOT is integrated in the HLRS training classes
  • Deployment in GermanyMARMOT is in use in two of the 3 National High Performance Computing Centers:
    • HLRS at Stuttgart (IA32, IA64, NEC SX)
    • NIC at Jülich (IBM Regatta):
      • Just replaced their Cray T3E with a IBM Regatta
      • spent a lot of time finding a problem that would have been detected automatically by MARMOT
feedback of crossgrid applications
Feedback of Crossgrid Applications
  • Task 1.1 (biomedical)
    • C application
    • Identified issues:
      • Possible race conditions due to use of MPI_ANY_SOURCE
      • deadlock
  • Task 1.2 (flood):
    • Fortran application
    • Identified issues:
      • Tags outside of valid range
      • Possible race conditions due to use of MPI_ANY_SOURCE
  • Task 1.3 (hep):
    • ANN (C application)
    • no issues found by MARMOT
  • Task 1.4 (meteo):
    • STEMII (Fortran)
    • MARMOT detected holes in self-defined datatypes used in MPI_Scatterv, MPI_Gatherv. These holes were removed, which helped to improve the performance of the communication.
examples of log file
Examples of Log File

54 rank 1 performs MPI_Cart_shift

55 rank 2 performs MPI_Cart_shift

56 rank 0 performs MPI_Send

57 rank 1 performs MPI_Recv

WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions!

58 rank 2 performs MPI_Recv

WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions!

59 rank 0 performs MPI_Send

60 rank 1 performs MPI_Recv

WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions!

61 rank 0 performs MPI_Send

62 rank 1 performs MPI_Bcast

63 rank 2 performs MPI_Recv

WARNING: MPI_Recv: Use of MPI_ANY_SOURCE may cause race conditions!

64 rank 0 performs MPI_Pack

65 rank 2 performs MPI_Bcast

66 rank 0 performs MPI_Pack

examples of log file continued
Examples of Log File (continued)

7883 rank 2 performs MPI_Barrier

7884 rank 0 performs MPI_Sendrecv

7885 rank 1 performs MPI_Sendrecv

7886 rank 2 performs MPI_Sendrecv

7887 rank 0 performs MPI_Sendrecv

7888 rank 1 performs MPI_Sendrecv

7889 rank 2 performs MPI_Sendrecv

7890 rank 0 performs MPI_Barrier

7891 rank 1 performs MPI_Barrier

7892 rank 2 performs MPI_Barrier

7893 rank 0 performs MPI_Sendrecv

7894 rank 1 performs MPI_Sendrecv

examples of log file deadlock
Examples of Log File (Deadlock)

9310 rank 1 performs MPI_Sendrecv

9311 rank 2 performs MPI_Sendrecv

9312 rank 0 performs MPI_Barrier

9313 rank 1 performs MPI_Barrier

9314 rank 2 performs MPI_Barrier

9315 rank 1 performs MPI_Sendrecv

9316 rank 2 performs MPI_Sendrecv

9317 rank 0 performs MPI_Sendrecv

9318 rank 1 performs MPI_Sendrecv

9319 rank 0 performs MPI_Sendrecv

9320 rank 2 performs MPI_Sendrecv

9321 rank 0 performs MPI_Barrier

9322 rank 1 performs MPI_Barrier

9323 rank 2 performs MPI_Barrier

9324 rank 1 performs MPI_Comm_rank

9325 rank 1 performs MPI_Bcast

9326 rank 2 performs MPI_Comm_rank

9327 rank 2 performs MPI_Bcast

9328 rank 0 performs MPI_Sendrecv

WARNING: all clients are pending!

examples of log file deadlock traceback on 0
Examples of Log File (Deadlock: traceback on 0)

timestamp= 9298: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9300: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9304: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9307: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9309: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9312: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9317: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9319: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9321: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9328: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

examples of log file deadlock traceback on 1
Examples of Log File (Deadlock: traceback on 1)

timestamp= 9301: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9302: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9306: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9310: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9313: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9315: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 2, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9318: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 2, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9322: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9324: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank)

timestamp= 9325: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

examples of log file deadlock traceback on 2
Examples of Log File (Deadlock: traceback on 2)

timestamp= 9303: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9305: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9308: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9311: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9314: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9316: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 1, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 0, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9320: MPI_Sendrecv(*sendbuf, sendcount = 7220, sendtype = MPI_DOUBLE, dest = 0, sendtag = 1, *recvbuf, recvcount = 7220, recvtype = MPI_DOUBLE, source = 1, recvtag = 1, comm = self-defined communicator, *status)

timestamp= 9323: MPI_Barrier(comm = MPI_COMM_WORLD)

timestamp= 9326: MPI_Comm_rank(comm = MPI_COMM_WORLD, *rank)

timestamp= 9327: MPI_Bcast(*buffer, count = 3, datatype = MPI_DOUBLE, root = 0, comm = MPI_COMM_WORLD)

classical solutions i parallel debugger
Classical Solutions I: Parallel Debugger
  • Examples: totalview, DDT, p2d2
  • Advantages:
    • Same approach and tool as in serial case
  • Disadvantages:
    • Can only fix problems after and if they occur
    • Scalability: How can you debug programs that crash after 3 hours on 512 nodes?
    • Reproducibility: How to debug a program that crashes only every fifth time?
    • It does not help to improve portability
classical solutions ii debug version of mpi library
Classical Solutions II: Debug version of MPI Library
  • Examples:
    • catches some incorrect usage: e.g. node count in MPI_CART_CREATE (mpich)
    • deadlock detection (NEC mpi)
  • Advantages:
    • good scalability
    • better debugging in combination with totalview
  • Disadvantages:
    • Portability: only helps to use this implementation of MPI
    • Trade-of between performance and safety
    • Reproducibility: Does not help to debug irreproducible programs
example 1 request reuse source code
Example 1: request-reuse (source code)

/*

** Here we re-use a request we didn't free before

*/

#include <stdio.h>

#include <assert.h>

#include "mpi.h"

int main( int argc, char **argv ) {

int size = -1;

int rank = -1;

int value = -1;

int value2 = -1;

MPI_Status send_status, recv_status;

MPI_Request send_request, recv_request;

printf( "We call Irecv and Isend with non-freed requests.\n" );

MPI_Init( &argc, &argv );

MPI_Comm_size( MPI_COMM_WORLD, &size );

MPI_Comm_rank( MPI_COMM_WORLD, &rank );

printf( " I am rank %d of %d PEs\n", rank, size );

example 1 request reuse source code continued
Example 1: request-reuse (source code continued)

if( rank == 0 ){

/*** this is just to get the request used ***/

MPI_Irecv( &value, 1, MPI_INT, 1, 18, MPI_COMM_WORLD, &recv_request );

/*** going to receive the message and reuse a non-freed request ***/

MPI_Irecv( &value, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &recv_request );

MPI_Wait( &recv_request, &recv_status );

assert( value = 19 );

}

if( rank == 1 ){

value2 = 19;

/*** this is just to use the request ***/

MPI_Isend( &value, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &send_request );

/*** going to send the message ***/

MPI_Isend( &value2, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &send_request );

MPI_Wait( &send_request, &send_status );

}

MPI_Finalize();

return 0;

}

example 1 request reuse output log
Example 1: request-reuse (output log)

We call Irecv and Isend with non-freed requests.

1 rank 0 performs MPI_Init

2 rank 1 performs MPI_Init

3 rank 0 performs MPI_Comm_size

4 rank 1 performs MPI_Comm_size

5 rank 0 performs MPI_Comm_rank

6 rank 1 performs MPI_Comm_rank

I am rank 0 of 2 PEs

7 rank 0 performs MPI_Irecv

I am rank 1 of 2 PEs

8 rank 1 performs MPI_Isend

9 rank 0 performs MPI_Irecv

10 rank 1 performs MPI_Isend

ERROR: MPI_Irecv Request is still in use !!

11 rank 0 performs MPI_Wait

ERROR: MPI_Isend Request is still in use !!

12 rank 1 performs MPI_Wait

13 rank 0 performs MPI_Finalize

14 rank 1 performs MPI_Finalize

example 2 deadlock source code
Example 2: deadlock (source code)

/* This program produces a deadlock.

** At least 2 nodes are required to run the program.

**

** Rank 0 recv a message from Rank 1.

** Rank 1 recv a message from Rank 0.

**

** AFTERWARDS:

** Rank 0 sends a message to Rank 1.

** Rank 1 sends a message to Rank 0.

*/

#include <stdio.h>

#include "mpi.h"

int main( int argc, char** argv ){

int rank = 0;

int size = 0;

int dummy = 0;

MPI_Status status;

example 2 deadlock source code continued
Example 2: deadlock (source code continued)

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &rank );

MPI_Comm_size( MPI_COMM_WORLD, &size );

if( size < 2 ){

fprintf( stderr," This program needs at least 2 PEs!\n" );

}

else {

if ( rank == 0 ){

MPI_Recv( &dummy, 1, MPI_INT, 1, 17, MPI_COMM_WORLD, &status );

MPI_Send( &dummy, 1, MPI_INT, 1, 18, MPI_COMM_WORLD );

}

if( rank == 1 ){

MPI_Recv( &dummy, 1, MPI_INT, 0, 18, MPI_COMM_WORLD, &status );

MPI_Send( &dummy, 1, MPI_INT, 0, 17, MPI_COMM_WORLD );

}

}

MPI_Finalize();

}

example 2 deadlock output log
Example 2: deadlock (output log)

$ mpirun -np 3 deadlock1

1 rank 0 performs MPI_Init

2 rank 1 performs MPI_Init

3 rank 0 performs MPI_Comm_rank

4 rank 1 performs MPI_Comm_rank

5 rank 0 performs MPI_Comm_size

6 rank 1 performs MPI_Comm_size

7 rank 0 performs MPI_Recv

8 rank 1 performs MPI_Recv

8 Rank 0 is pending!

8 Rank 1 is pending!

WARNING: deadlock detected, all clients are pending

example 2 deadlock output log continued
Example 2: deadlock (output log continued)

Last calls (max. 10) on node 0:

timestamp = 1: MPI_Init( *argc, ***argv )

timestamp = 3: MPI_Comm_rank( comm, *rank )

timestamp = 5: MPI_Comm_size( comm, *size )

timestamp = 7: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status)

Last calls (max. 10) on node 1:

timestamp = 2: MPI_Init( *argc, ***argv )

timestamp = 4: MPI_Comm_rank( comm, *rank )

timestamp = 6: MPI_Comm_size( comm, *size )

timestamp = 8: MPI_Recv( *buf, count = -1, datatype = non-predefined datatype, source = -1, tag = -1, comm, *status )