1 / 1

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Runtime Fault-Handling for Job-Flow Management In Grid Environments. Team: Gargi Dasgupta 3 , Onyeka Ezenwoye 2 , Liana Fong 1 , Selim Kalayci 2 , S. Masoud Sadjadi 2 , Balaji Viswanathan 3 , 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598,

Download Presentation

Runtime Fault-Handling for Job-Flow Management In Grid Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runtime Fault-Handling for Job-Flow Management In Grid Environments Team:Gargi Dasgupta3, Onyeka Ezenwoye2,Liana Fong1, Selim Kalayci2, S. Masoud Sadjadi2, Balaji Viswanathan3, 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 2: Florida International University, Miami, FL 33199, 3: IBM India Research Labs, New Delhi 1. Overview 2. Two-Layered Architecture • Motivation • Provide flexible job-flow orchestration and submission for scientists • Integrate service-based flow orchestration with job scheduling • Provide fault-tolerance handling that is transparent to the job-flow • Approach • Isolate job-flow orchestration from fault handling using layered design • Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows • Second layer: Job trapping & fault handling; using JSDL for jobs • Introduce fault-handling proxy to implement job recovery policies • Use the Generic Proxy from TRAP/BPEL framework • Fault handling policies can be job specific depending on • Type of job • Type of failure • Level of fault-tolerance specified by user 3. Runtime Fault-Handling 4. Prototypical Architecture • Generic Proxy sits between Job-Flow Manager and Meta-Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation. • Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions. • Two sample fault-tolerant patterns: Re-stage Data Pattern Re-submit Job Pattern

More Related