1 / 13

Umpire: Making MPI Programs Safe

Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000. Umpire: Making MPI Programs Safe. Umpire. Writing correct MPI programs is hard Unsafe or erroneous MPI programs Deadlock Resource errors Umpire Automatically detect MPI programming errors

quinto
Download Presentation

Umpire: Making MPI Programs Safe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bronis R. de Supinski and Jeffrey S. VetterCenter for Applied Scientific ComputingAugust 15, 2000 Umpire: Making MPI Programs Safe

  2. Umpire • Writing correct MPI programs is hard • Unsafe or erroneous MPI programs • Deadlock • Resource errors • Umpire • Automatically detect MPI programming errors • Dynamic software testing • Shared memory implementation

  3. MPI Application Umpire Manager Task 0 Task 1 Task 2 Task N-1 Task 0 ... Task 1 Task 2 Task N-1 Interposition using MPI profiling layer Transactions via Shared Memory Task 0 Task 1 Task 2 Task N-1 ... MPI Runtime System Umpire Architecture Verification Algorithms

  4. Collection system • Calling task • Use MPI profiling layer • Perform local checks • Communicate with manager if necessary • Call parameters • Return program counter (PC) • Call specific information (e.g. Buffer checksum) • Manager • Allocate Unix shared memory • Receive transactions from calling tasks

  5. Manager • Detects global programming errors • Unix shared memory communication • History queues • One per MPI task • Chronological lists of MPI operations • Resource registry • Communicators • Derived datatypes • Required for message matching • Perform verification algorithms

  6. Configuration Dependent Deadlock • Unsafe MPI programming practice • Code result depends on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_Send MPI_Send MPI_Recv MPI_Recv

  7. Mismatched Collective Operations • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_Bcast MPI_Barrier MPI_Barrier MPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages

  8. Deadlock detection • MPI history queues • One per task in Manager • Track MPI messaging operations • Items added through transactions • Remove when safely matched • Automatically detect deadlocks • MPI operations only • Wait-for graph • Recursive algorithm • Invoke when queue head changes • Also support timeouts

  9. Task 0 Task 1 Task 2 Task 3 Deadlock Detection Example Barrier Barrier Barrier Bcast Bcast Bcast Barrier Task 1: MPI_Bcast Task 0: MPI_Bcast Task 2: MPI_Bcast Task 2: MPI_Barrier Task 0: MPI_Barrier Task 3: MPI_Barrier Task 1: MPI_Barrier ERROR! Report it!

  10. Resource Tracking Errors • Many MPI features require resource allocations • Communicators, datatypes and requests • Detect “leaks” automatically • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…) • Complicated by assignment • Also detect errant writes to send buffers

  11. Conclusion • First automated MPI debugging tool • Detect deadlocks • Eliminates resource leaks • Assure correct non-blocking sends • Performance • Low overhead (21% for sPPM) • Located deadlock in code set-up • Limitations • MPI_Waitany and MPI_Cancel • Shared memory implementation • Prototype only

  12. Future Work • Further prototype testing • Improve user interface • Handle all MPI calls • Tool distribution • LLNL application group testing • Exploring mechanisms for wider availability • Detection of other errors • Datatype matching • Others? • Distributed memory implementation

  13. UCRL-VG-139184 Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48

More Related