350 likes | 457 Views
This thesis defense explores the development of a fault-tolerant grid workflow for managing water threats, focusing on minimizing sensor search times in the event of contamination. It addresses resource outages, queue wait times, and dynamic job dependencies in urban water distribution systems.
E N D
MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon
Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion
Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using sensors located across the WDSs. • Run algorithms (developed by NCSA) to determine the sensor locations to minimize the searching time to find the contaminant source locations (sensors are expensive).
Water Threat Management • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goals • The current system is not fault-tolerant. • Develop a fault-tolerant framework and increase performance in the faulty environment.
Motivation – (1) Resource Outages • TeraGrid resource outages during 2009. TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (1) Resource Outages • Outage Rate (total outage time / year) in 2009 TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (1) Resource Outages • WTM deployment problem with outages TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (2) Queue Wait Time • Queue wait time
Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency
Water Threat Management Application • Sequential & parallel processing
Generation Distribution • Divide generations into multiple parts as multiple jobs.
Generation Distribution • File communication
Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations
Dynamic WTM Workflow Management • Example scenario
Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue
Fault-tolerant Queue Design • Architecture
Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)
Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information
Evaluation – WTM performance • WTM application performance (generation)
Evaluation – Queue Wait Time • Queue wait time statistics
Evaluation - Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework
Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time
Evaluation – Workflow Performance • Setup points • What to measure • Job run time + queue wait time • 4 different types of deployment • Original on Abe • Original on Big Red • Static fault-tolerant workflow on Abe + Big Red • Dynamic fault-tolerant workflow on Abe + Big Red • 6 different jobs • 6 = 1 (original) + 1 (original) + 2 (static) + 2 (dynamic)
Evaluation – Workflow Performance • Setup points • “Submit” 4 different deployments at the same time • 5 jobs are submitted at the same time (1 job is for static workflow). • Repeat this at different times • The queue wait times will make different results
Evaluation – Workflow Performance • Workflow comparison results
Simulation – Run Time Comparison • Average run time • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job
Simulation – Run Time Comparison • Results (queue wait time + job run time + “failure” time)
Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.
Simulation – Worst Case Run Time Comparison • Simulation setup • Use the 2009 TeraGrid outage data for this simulation • Submit jobs every 5 minutes during 2009 and compare the worst case run time between the original deployment and the dynamic workflow deployment
Conclusion • In general, the dynamic fault-tolerant workflow has similar performance to the performance of the original deployment. • However, the dynamic workflow ofthe worst case scenario has much better performance than the performance of the worst case scenario of the original deployment.