1 / 15

A Proposal of Application Failure Detection and Recovery in the Grid

Institute of Computer Science AGH. A Proposal of Application Failure Detection and Recovery in the Grid. Marian Bubak 1,2 , Tomasz Szepieniec 2 , Marcin Radecki 2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET. Outline. Motivation & introduction

yepa
Download Presentation

A Proposal of Application Failure Detection and Recovery in the Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Institute of Computer Science AGH A Proposal of Application Failure Detectionand Recovery in the Grid Marian Bubak1,2, Tomasz Szepieniec2, Marcin Radecki2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET

  2. Outline • Motivation & introduction • Services useful in fault recovery approach • Overview of our proposal • Problems & workflow approach • Summary

  3.   • Risk that application crashesis higher • Crash is more expensive for large application   Motivation • Reliability of single component does not raise considerably • Environment and application size increase steadily Fault tolerance problembecomes important in the Grid

  4. Checkpointing is costly Often restarting whole application Many global operations Additional developer’s effort is required Application-specific methods Demands vs. Reality Minimaloverhead Automatic, quick recovery Scalability Transparent Porting to any kind ofapplication

  5. Two classes of FT approaches • Application Built-in FT • Algorithm/structure profile can be exploited, • FT activity can by done more efficiently, e.g. checkpointing • Naturally Fault Tolerant problem class, e.g. genetic alg. • Fault Tolerant-MPI • but... all must be done by developer • FT realized by external services • automatic middleware services • no developer effort required • but... limited functionality • It would be beneficial to combine this two

  6. Services useful in FT approach • Monitoring services • For fault detection in hardware and software • e.g. Check if process is still running, • Checkpointing, logging, redundancy services • For preparing recovery • e.g. Store the current state of application • Recovery services • In case of failure • e.g. Rollback from last checkpointing, • Scheduler and resource broker • For knowledge about started application • For re-scheduling, re-brokering job or it’s part

  7. How to make it work together? Infrastructure Mon.Services • The component that manages this services is needed • part of middleware • job companion • co-ordinate actions of FT services • Recovery action taken is more appropriate, because: • whole job state is considered • the most suitable of available services could be used Application Mon. Services Fault Tolerant Manager Checkpointing Services Scheduler Services Recovery Services

  8. Infrastructure Application Application Monitoring Check- pointing Recovery FT Manager – Architecture Infrastructure Mon.Services Fault Tolerant Manager Application Mon. Services Job Supervisor Decision Maker Checkpointing Services Scheduler Services Recovery Scenario Executor Recovery Services

  9. Job Supervisor (1) Fault Tolerant Manager • Main functionality: • Monitors job execution • Manages (or stores information about) checkpointing • When something is wrong generatesFault Alarm • Fault Alarm contains not only the information what is wrong, but also the status of job (e.g. last checkpoint) • Job Supervisor can be asked toperform more checking by Decision Maker Job Supervisor FaultAlarm Decision Maker Recovery Scenario Executor

  10. Job Supervisor (2) – Faults Fault Tolerant Manager • Typical examples of fault: • process crash • node is not responding • lost connection (link is down) • Extended fault characteristics: • Occurring and duration characteristics • Severity for application, • E.g. Master fault is more dangerous than slave fault • Fault is not only when connection is lost, but also when performance dramatically decreases • Sophisticated performance monitoring is required Job Supervisor FaultAlarm Decision Maker Recovery Scenario Executor

  11. Decision Maker Fault Tolerant Manager • Main functionality: • Analyzes the situation, when gets fault alarm • Preparesrecovery scenariosand sendsthe best of them for execution • Issues to be considered: • What is possible • The cost of each recovery scenario • Do-nothingor waitscenario is always possible and sometimes beneficial • E.g. in case of problem with network link when only recovery is to restart the whole application • Historical data and probabilistic methods should be used Job Supervisor FaultAlarm Decision Maker Recovery Scenario Recovery Scenario Executor

  12. Recovery Scenario Executor Fault Tolerant Manager • Main functionality: • Executes actions from scenario • Supervises recovery process • Recovery Scenario contains several actions that could be performed by different recovery services • In case of failure in scenario execution, Decision Maker is alarmed Job Supervisor Decision Maker Recovery Scenario Recovery Scenario Executor

  13. Problems • Many class of services to cooperate with • Many interfaces • How to obtain information about application? • Which services are available? • Semantic specification for monitoring and recovery services is needed

  14. Feasibility – WorkFlows • Grid-Services-based approach could help to solve our problems • Knowledge about application architecture is accessible • Workflow description details are welcomed • Exchange of single component is better that restart the whole application • Directives for FT Manager could be included in job description • Interfaces are unified

  15. Summary • Fault tolerance issuesbecome more and more important in the Grid • A service for fault tolerance management has been proposed • ...which enables more sophisticated fault tolerance for Grid • Workflow-based framework facilites the task • But, this is a proposal only... You are invited for commenting and remarking!

More Related