1 / 63

AMPI: Adaptive MPI

AMPI: Adaptive MPI. Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign. Outline. Motivation AMPI: Overview Benefits of AMPI Converting MPI Codes to AMPI Handling Global/Static Variables Running AMPI Programs AMPI Status

jun
Download Presentation

AMPI: Adaptive MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMPI: Adaptive MPI Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

  2. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  3. AMPI: Motivation • Challenges • New generation parallel applications are: • Dynamically varying: load shifting during execution • Adaptively refined • Composed of multi-physics modules • Typical MPI Implementations: • Not naturally suitable for dynamic applications • Available processor set may not match algorithm • Alternative: Adaptive MPI (AMPI) • MPI & Charm++ virtualization: VP (“Virtual Processors”) AMPI: Adaptive MPI

  4. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  5. MPI “tasks” Implemented as user-level migratable threads ( VPs: virtual processors ) Real Processors AMPI: Overview • Virtualization: MPI ranks → Charm++ threads AMPI: Adaptive MPI

  6. AMPI: Overview (cont.) • AMPI Execution Model: • Multiple user-level threads per process • Typically, one process per physical processor • Charm++ Scheduler coordinates execution • Threads (VPs) can migrate across processors • Virtualization ratio: R = #VP / #P (over-decomposition) Charm++ Scheduler P=1 , VP=4 AMPI: Adaptive MPI

  7. AMPI: Overview (cont.) • AMPI’s Over-Decomposition in Practice • MPI: P=4, ranks=4 • AMPI: P=4, VP=ranks=16 AMPI: Adaptive MPI

  8. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  9. Benefits of AMPI • Overlap between Computation/Communication • Automatically achieved • When one thread blocks for a message, another thread in the same processor can execute • Charm++ Scheduler picks next thread among those that are ready to run Charm++ Scheduler AMPI: Adaptive MPI

  10. Benefits of AMPI (cont.) • Potentially Better Cache Utilization • Gains occur when subdomain is accessed repeatedly (e.g. by multiple functions, called in sequence) 12 might fit in cache, but 3 might not fit AMPI: Adaptive MPI

  11. Benefits of AMPI (cont.) • Thread Migration for Load Balancing • Migration of thread 13: AMPI: Adaptive MPI

  12. Benefits of AMPI (cont.) • Load Balancing in AMPI: MPI_Migrate() • Collective operation informing the load balancer that the threads can be migrated, if needed, for balancing load • Easy to insert in the code of iterative applications • Leverages Load-Balancing framework of Charm++ • Balancing decisions can be based on • Measured parameters: computation load, communication pattern • Application-provided information AMPI: Adaptive MPI

  13. Benefits of AMPI (cont.) • Decoupling of Physical/Virtual Processors Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs P=K3 AMPI: Adaptive MPI

  14. Benefits of AMPI (cont.) • Asynchronous Implementation of Collectives • Collective operation is posted, returns immediately • Test/wait for its completion; meanwhile, do useful work e.g. MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); • Other operations available: MPI_Iallreduce, MPI_Iallgather • Example: 2D FFT benchmark time (ms) AMPI: Adaptive MPI

  15. Motivation for Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library • Computation is only a small proportion of the elapsed time • A number of optimization techniques are developed to improve collective communication performance AMPI: Adaptive MPI

  16. Benefits of AMPI (cont.) • Fault Tolerance via Checkpoint/Restart • State of application checkpointed to disk or memory • Capable of restarting on different number of physical processors! • Synchronous checkpoint, collective call: • In-disk: MPI_Checkpoint(DIRNAME) • In-memory: MPI_MemCheckpoint(void) • Restart: • In-disk: charmrun +p4 prog +restart DIRNAME • In-memory: automatic restart upon failure detection AMPI: Adaptive MPI

  17. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  18. Converting MPI Codes to AMPI • AMPI needs its own initialization, before user-code • Fortran program entry-point: MPI_Main • program pgm  subroutine MPI_Main • ... ... • end program  end subroutine • C program entry-point is handled automatically, via mpi.h - include in same file as main() if absent • If the code has no global/static variables, this is all that is needed to convert! AMPI: Adaptive MPI

  19. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  20. Thread 1 Thread 2 var = myid (1)‏ MPI_Recv() (block...) b = var var = myid (2)‏ MPI_Recv() (block...) Handling Global/Static Variables • Global and static variables are a problem in multi-threaded programs (similar problem in OpenMP): • Globals/statics have a single instance per process • They become shared by all threads in the process • Example: time If var is a global/static, incorrect value is read! AMPI: Adaptive MPI

  21. Handling Global/Static Variables (cont.) • General Solution: Privatize variables in thread • Approaches: • Swap global variables • Source-to-source transformation via Photran • Use TLS scheme (in development) Specific approach to use must be decided on a case-by-case basis AMPI: Adaptive MPI

  22. Handling Global/Static Variables (cont.) • First Approach: Swap global variables • Leverage ELF – Execut. & Linking Format (e.g. Linux) • ELF maintains a Global Offset Table (GOT) for globals • Switch GOT contents at thread context-switch • Implemented in AMPI via build flag –swapglobals • No source code changes needed • Works with any language (C, C++, Fortran, etc) • Does not handle static variables • Context-switch overhead grows with num.variables AMPI: Adaptive MPI

  23. Handling Global/Static Variables (cont.) • Second Approach: Source-to-source transform • Move globals/statics to an object, then pass it around • Automatic solution for Fortran codes: Photran • Similar idea can be applied to C/C++ codes • Totally portable across systems/compilers • May improve locality and cache utilization • No extra overhead at context-switch • Requires new implementation for each language AMPI: Adaptive MPI

  24. Handling Global/Static Variables (cont.) • Example of Transformation: C Program • Original Code: Transformed Code: AMPI: Adaptive MPI

  25. Handling Global/Static Variables (cont.) • Example of Photran Transformation: Fortran Prog. • Original Code: Transformed Code: AMPI: Adaptive MPI

  26. Handling Global/Static Variables (cont.) • Photran Transformation Tool • Eclipse-based IDE, implemented in Java • Incorporates automatic refactorings for Fortran codes • Operates on “pure” Fortran 90 programs • Code transformation infrastructure: • Construct rewriteable ASTs • ASTs are augmented with binding information • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI

  27. Handling Global/Static Variables (cont.) • Photran-AMPI • GUI: • Source: Stas Negara & Ralph Johnson • http://www.eclipse.org/photran/ AMPI: Adaptive MPI

  28. NAS Benchmark AMPI: Adaptive MPI

  29. FLASH Results AMPI: Adaptive MPI

  30. Handling Global/Static Variables (cont.) • Third Approach: TLS scheme (Thread-Local-Store) • Originally employed in kernel threads • In C code, variables are annotated with __thread • Modified/adapted gfortran compiler available • Handles uniformly both globals and statics • No extra overhead at context-switch • Although popular, not yet a standard for compilers • Current Charm++ support only for x86 platforms AMPI: Adaptive MPI

  31. Handling Global/Static Variables (cont.) • Summary of Current Privatization Schemes: • Program transformation is very portable • TLS scheme may become supported on Blue Waters, depending on work with IBM AMPI: Adaptive MPI

  32. NAS Benchmark AMPI: Adaptive MPI

  33. FLASH Results • FLASH is a parallel, multi-dimensional code used to study astrophysical fluids. • Many astrophysical environments are highly turbulent, and have structure on scales varying from large scale, like galaxy clusters, to small scale, like active galactic nuclei, in the same system. AMPI: Adaptive MPI

  34. Outline • Motivation • AMPI: Overview • Benefits of AMPI • Converting MPI Codes to AMPI • Handling Global/Static Variables • Running AMPI Programs • AMPI Status • AMPI References, Conclusion AMPI: Adaptive MPI

  35. Object Migration AMPI: Adaptive MPI

  36. Object Migration • How do we move work between processors? • Application-specific methods • E.g., move rows of sparse matrix, elements of FEM computation • Often very difficult for application • Application-independent methods • E.g., move entire virtual processor • Application’s problem decomposition doesn’t change AMPI: Adaptive MPI

  37. How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables • Open files, environment variables, etc. (not handled yet!) AMPI: Adaptive MPI

  38. Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data AMPI: Adaptive MPI

  39. Migrate Stack Data • Without compiler support, cannot change stack’s address • Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) • Solution: “isomalloc” addresses • Reserve address space on every processor for every thread stack • Use mmap to scatter stacks in virtual memory efficiently • Idea comes from PM2 AMPI: Adaptive MPI

  40. Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code AMPI: Adaptive MPI 0x00000000 0x00000000

  41. Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code AMPI: Adaptive MPI 0x00000000 0x00000000

  42. Migrate Stack Data • Isomalloc is a completely automatic solution • No changes needed in application or compilers • Just like a software shared-memory system, but with proactive paging • But has a few limitations • Depends on having large quantities of virtual address space (best on 64-bit) • 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine • Depends on unportable mmap • Which addresses are safe? (We must guess!) • What about Windows? Blue Gene? AMPI: Adaptive MPI

  43. Heap Data • Heap data is any dynamically allocated data • C “malloc” and “free” • C++ “new” and “delete” • F90 “ALLOCATE” and “DEALLOCATE” • Arrays and linked data structures are almost always heap data AMPI: Adaptive MPI

  44. Migrate Heap Data • Automatic solution: isomalloc all heap data just like stacks! • “-memory isomalloc” link option • Overrides malloc/free • No new application code needed • Same limitations as isomalloc • Manual solution: application moves its heap data • Need to be able to size message buffer, pack data into message, and unpack on other side • “pup” abstraction does all three AMPI: Adaptive MPI

  45. Migrate Heap Data: PUP • Same idea as MPI derived types, but datatype description is code, not data • Basic contract: here is my data • Sizing: counts up data size • Packing: copies data into message • Unpacking: copies data back out • Same call works for network, memory, disk I/O ... • Register “pup routine” with runtime • F90/C Interface: subroutine calls • E.g., pup_int(p,&x); • C++ Interface: operator| overloading • E.g., p|x; AMPI: Adaptive MPI

  46. Migrate Heap Data: PUP Builtins • Supported PUP Datatypes • Basic types (int, float, etc.) • Arrays of basic types • Unformatted bytes • Extra Support in C++ • Can overload user-defined types • Define your own operator| • Support for pointer-to-parent class • PUP::able interface • Supports STL vector, list, map, and string • “pup_stl.h” • Subclass your own PUP::er object AMPI: Adaptive MPI

  47. Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector<float> nodes; std::vector<int> elts; public: ... void pup(PUP::er &p) { p|nodes; p|elts; } }; AMPI: Adaptive MPI

  48. Migrate Heap Data: PUP C Example struct myMesh { int nn,ne; float *nodes; int *elts; }; void pupMesh(pup_er p,myMesh *mesh) { pup_int(p,&mesh->nn); pup_int(p,&mesh->ne); if(pup_isUnpacking(p)) { /* allocate data on arrival */ mesh->nodes=new float[mesh->nn]; mesh->elts=new int[mesh->ne]; } pup_floats(p,mesh->nodes,mesh->nn); pup_ints(p,mesh->elts,mesh->ne); if (pup_isDeleting(p)) { /* free data on departure */ deleteMesh(mesh); } } AMPI: Adaptive MPI

  49. Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE AMPI: Adaptive MPI

  50. Global Data • Global data is anything stored at a fixed place • C/C++ “extern” or “static” data • F77 “COMMON” blocks • F90 “MODULE” data • Problem if multiple objects/threads try to store different values in the same place (thread safety) • Compilers should make all of these per-thread; but they don’t! • Not a problem if everybody stores the same value (e.g., constants) AMPI: Adaptive MPI

More Related