1 / 16

µπ A Scalable & Transparent System for Simulating MPI Programs

SimuTools , Malaga, Spain March 17, 2010. µπ A Scalable & Transparent System for Simulating MPI Programs. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. Motivation & Background. Software & Hardware Lifetimes.

scout
Download Presentation

µπ A Scalable & Transparent System for Simulating MPI Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimuTools, Malaga, Spain March 17, 2010 µπA Scalable & Transparent Systemfor Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D ManagerOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

  2. Motivation & Background Software & Hardware Lifetimes Software & Hardware Design Co-design:E.g., 1 μs barrier cost/benefit Hardware:E.g., Load from application Software:Scaling, debugging, testing, customizing • Lifetime of large parallel machine:5 years • Lifetime of useful parallel code:20 years • Port, analyze, optimize • Ease of development:Obviate actual scaled hardware • Energy efficient:Reduce failed runs at actual scale

  3. μπ = micro parallel performance investigator Performance prediction for MPI, Portals and other parallel applications Actual application code executed on the real hardware Platform is simulated at large virtual scale Timing customized by user-defined machine Scale is key differentiator Target: 1,000,000 virtual cores E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsikmicrosimulator kernel Highly scalable PDES engine μπ Performance Investigation System

  4. Application Compute Compute µπ Tcomp Tcomm MPI Call Entry MPI Call Exit Generalized Interface & Timing Framework • Accommodates arbitrary level of timing detail • Compute time: can use a full system simulation (instruction-level) on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off • Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost

  5. Modify #include and recompile Change#include <mpi.h>to#include <mupi.h> Relink to μπ library Instead of –lmpiuse -lmupi Compiling MPI application with μπ

  6. Run the modified MPI application(a μπ simulation)‏ mpirun–np4 test -nvp32runs test with 32 virtual MPI rankssimulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel Executing MPI application over μπ

  7. Interface Support Existing, Sufficient Planned, Optional Other wait variants Other send/recv variants Other collectives Group communication • MPI_Init(), MPI_Finalize() • MPI_Comm_rank()MPI_Comm_size() • MPI_Barrier() • MPI_Send(), MPI_Recv() • MPI_Isend(), MPI_Irecv() • MPI_Waitall() • MPI_Wtime() • MPI_COMM_WORLD Other, Performance-Oriented • MPI_Elapse_time(dt) • Added for simulation speed • Avoids actual computation, instead simply elapses time

  8. Performance Study • Benchmarks • Zero lookahead • 10μs lookahead • Platform • Cray XT5, 226K cores • Scaling Results • Event Cost • Synchronization Overhead • Multiplexing Gain

  9. Experimentation Platform: Jaguar* * Data and images from http://nccs.gov

  10. Event Cost

  11. Synchronization Speed

  12. Multiplexing Gain

  13. μπSummary - Quantitative • Unprecedented scalability • 27,648,000 virtual MPI ranks on 216,000 actual cores • Optimal multiplex-factor seen • 64 virtual ranks per real rank • Low slowdown even on zero-lookahead scenarios • Even on fast virtual networks

  14. μπSummary - Qualitative • The only available simulator for highly scaled MPI runs • Suitable for source-available, trace-driven, or modeled applications • Configurable hardware timing • User-specified latencies, bandwidths, arbitrary inter-network models • Executions repeatable and deterministic • Global time-stamped ordering • Deterministic timing model, and • Purely discrete event simulation • Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled • Trapped: on-line, live actual execution • Instrumented: off-line trace generation, trace-driven on-line execution • Modeled: model-driven computation and MPI communication patterns • Nearly zero perturbation with unlimited instrumentation

  15. Ongoing Work • NAS Benchmarks • E.g., FFT • Actual at-scale application • E.g., Chemistry • Optimized implementation of certain MPI primitives • E.g., MPI_Barrier(), MPI_Reduce() • Tie to other important phenomena • E.g., energy consumption models

  16. Thank you!Questions? Discrete Computing Systems www.ornl.gov/~2ip

More Related