1 / 22

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH). Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University. Contents. Motivation. 1. Introduction. 2. Architecture. 3. Conclusion. 4. Motivation.

moswen
Download Presentation

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Providing Fault-tolerance for Parallel Programs on Grid(FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University

  2. Contents Motivation 1 Introduction 2 Architecture 3 Conclusion 4

  3. Motivation • Hardware performance limitations are overcome by Moore's Law • These cutting-edge technologies make “Tera-scale” clusters feasible !!! • However.. • What about “THE” system reliability ??? • Distributed systems are still fragile due to unexpected failures…

  4. Motivation Multiple Fault-tolerant Framework MPICH-G2 (Ethernet) Good speed (1Gbps) Common MPICH Standard Demand Fault-resilience !!! MPICH-GM (Myrinet) High-speed (10Gbps) Popular MPICH Compatible Demand Fault-resilience !!! MVAPICH (InfiniBand) High-speed (Up to 30Gbps) Will be Popular MPICH Compatible Demand Fault-resilience !!! High-performance Network Trend

  5. Introduction • Unreliability of distributed systems • Even a single local failure can be fatal to parallel processes since it could render useless all computations executed to the point of failure. • Our goal is • To construct a practical multiple fault-tolerant framework for various types of MPICH variants working on high-performance clusters/Grids.

  6. Introduction • Why Message Passing Interface (MPI)? • Designing a generic FT framework is extremely hard due to the diversity of hardware and software systems. • We chosen MPICH series .... • MPI is the most popular programming model in cluster computing. • Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

  7. Architecture-Concept- Failure Detection Monitoring Multiple Fault-tolerant Framework C/R Protocol Consensus & Election Protocol

  8. Architecture-Overall System- Management System Communication Ethernet Gigabit Ethernet High-speed Network (Myrinet, InfiniBand) Ethernet Others Ethernet Others Ethernet Others Communication Communication Communication MPI Process MPI Process MPI Process

  9. Architecture-Development History- Current 2003 2004 2005 FT-MPICH-GM MPICH-GF FT-MVAPICH Fault-tolerant MPICH-G2 -Ethernet- Fault-tolerant MPICH-GM -Myrinet- Fault-tolerant MVAPICH -InfiniBand-

  10. Management System Failure Detection Initialization Coordination Management System Output Management Checkpoint Coordination Checkpoint Transfer Recovery Makes MPI more reliable

  11. Management System

  12. Job Management System 1/2 • Job Management System • Manages and monitors multiple MPI processes and their execution environments • Should be lightweight • Helps the system take consistent checkpoints and recover from failures • Has a fault-detection mechanism • Two main components • Central Manager & Local Job Manager

  13. Job Management System 2/2 • Central Manager • Manages all system functions and states • Detects node failures by periodic heartbeats and Job Manager’s failures • Job Manager • Relays messages between Central Manager & MPI Processes • Detects unexpected MPI process failures

  14. Fault-Tolerant MPI 1/3 • To provide MPI fault-tolerance, we adopt • Coordinated checkpointing scheme (vs. Independent scheme) • The Central Manager is the Coordinator!! • Application-level checkpointing (vs. kernel-level CKPT.) • This method does not require any efforts on the part of cluster administrators • User-transparent checkpointing scheme (vs. User-aware) • This method requires no modification of MPI source codes

  15. ver 1 checkpoint command ver 2 Fault-Tolerant MPI 2/3 • Coordinated Checkpointing rank1 rank2 rank3 rank0 Central Manager storage

  16. ver 1 failure detection Fault-Tolerant MPI 3/3 • Recovery from failures rank1 rank2 rank3 rank0 Central Manager checkpoint command storage

  17. Management System • MPICH-GF • Based on Globus Toolkit2 • Hierarchical Management System • Suitable for multiple clusters • Supports recovery from process/manager/node failure • Limitation • Does not support recovery from multiple failures • Has single point of failure (Central Manager)

  18. Management System • FT-MPICH-GM • New version • It does not rely on the Globus Toolkit. • Removes of hierarchical structure • Myrinet/Infiniband clusters no longer require hierarchical structure. • Supports recovery from multiple failures • FT-MVAPICH • More robust • Removes the single point of failure • Leader election for the job manager

  19. Fault-tolerant MPICH-variants FT-MPICH-GM FT-MVAPICH MPICH-GF Collective Operations P2P Operations ADI(Abstract Device Interface) Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand) FT Module Recovery Module Checkpoint Toolkit Connection Re-establishment Atomic Message Transfer Ethernet Myrinet InfiniBand

  20. Future Works • We’re working to incorporate our FT protocol into the GT-4 framework. • MPICH-GF is GT-2 compliant • Incorporating fault-tolerant management protocol into GT-4 • Make MPICH work with different clusters • Gig-E • Myrinet Open-MPI, VMI, etc. • Infiniband • Supporting non-Intel CPUs • AMD(Opteron)

  21. GRID Issues • Who should be responsible for ? • Monitoring the up/down of nodes. • Resubmitting the failed process. • Allocating new nodes. • GRID Job Management • Resource management • Scheduler • Health Monitoring

  22. Thank You !

More Related