MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion

Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using sensors located across the WDSs. • Run algorithms (developed by NCSA) to determine the sensor locations to minimize the searching time to find the contaminant source locations (sensors are expensive).

Water Threat Management • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goals • The current system is not fault-tolerant. • Develop a fault-tolerant framework and increase performance in the faulty environment.

Existing Water Threat Management System Architecture

EPANET Simulation in the Simulation Engine

Motivation – (1) Resource Outages • TeraGrid resource outages during 2009. TeraGrid User & System News (http://news.teragrid.org/)

Motivation – (1) Resource Outages • Outage Rate (total outage time / year) in 2009 TeraGrid User & System News (http://news.teragrid.org/)

Motivation – (1) Resource Outages • WTM deployment problem with outages TeraGrid User & System News (http://news.teragrid.org/)

Motivation – (2) Queue Wait Time • Queue wait time

Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency

Water Threat Management Application • Sequential & parallel processing

Generation Distribution • Divide generations into multiple parts as multiple jobs.

Generation Distribution • File communication

Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations

Dynamic WTM Workflow Management • Example scenario

Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

Fault-tolerant Queue Design • Architecture

Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information

Evaluation – WTM performance • WTM application performance (generation)

Evaluation – Queue Wait Time • Queue wait time statistics

Evaluation - Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework

Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time

Evaluation – Workflow Performance • Setup points • What to measure • Job run time + queue wait time • 4 different types of deployment • Original on Abe • Original on Big Red • Static fault-tolerant workflow on Abe + Big Red • Dynamic fault-tolerant workflow on Abe + Big Red • 6 different jobs • 6 = 1 (original) + 1 (original) + 2 (static) + 2 (dynamic)

Evaluation – Workflow Performance • Setup points • “Submit” 4 different deployments at the same time • 5 jobs are submitted at the same time (1 job is for static workflow). • Repeat this at different times • The queue wait times will make different results

Evaluation – Workflow Performance • Workflow comparison results

Simulation – Run Time Comparison • Average run time • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

Simulation – Run Time Comparison • Results (queue wait time + job run time + “failure” time)

Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

Simulation – Worst Case Run Time Comparison • Simulation setup • Use the 2009 TeraGrid outage data for this simulation • Submit jobs every 5 minutes during 2009 and compare the worst case run time between the original deployment and the dynamic workflow deployment

Simulation – Worst Case Run Time Comparison

Conclusion • In general, the dynamic fault-tolerant workflow has similar performance to the performance of the original deployment. • However, the dynamic workflow ofthe worst case scenario has much better performance than the performance of the worst case scenario of the original deployment.

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project