1 / 34

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to Water Threat Management Project Motivation

kiora
Download Presentation

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

  2. Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion

  3. Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using sensors located across the WDSs. • Run algorithms (developed by NCSA) to determine the sensor locations to minimize the searching time to find the contaminant source locations (sensors are expensive).

  4. Water Threat Management • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goals • The current system is not fault-tolerant. • Develop a fault-tolerant framework and increase performance in the faulty environment.

  5. Existing Water Threat Management System Architecture

  6. EPANET Simulation in the Simulation Engine

  7. Motivation – (1) Resource Outages • TeraGrid resource outages during 2009. TeraGrid User & System News (http://news.teragrid.org/)

  8. Motivation – (1) Resource Outages • Outage Rate (total outage time / year) in 2009 TeraGrid User & System News (http://news.teragrid.org/)

  9. Motivation – (1) Resource Outages • WTM deployment problem with outages TeraGrid User & System News (http://news.teragrid.org/)

  10. Motivation – (2) Queue Wait Time • Queue wait time

  11. Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency

  12. Water Threat Management Application • Sequential & parallel processing

  13. Generation Distribution • Divide generations into multiple parts as multiple jobs.

  14. Generation Distribution • File communication

  15. Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations

  16. Dynamic WTM Workflow Management • Example scenario

  17. Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

  18. Fault-tolerant Queue Design • Architecture

  19. Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

  20. Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information

  21. Evaluation – WTM performance • WTM application performance (generation)

  22. Evaluation – Queue Wait Time • Queue wait time statistics

  23. Evaluation - Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework

  24. Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time

  25. Evaluation – Workflow Performance • Setup points • What to measure • Job run time + queue wait time • 4 different types of deployment • Original on Abe • Original on Big Red • Static fault-tolerant workflow on Abe + Big Red • Dynamic fault-tolerant workflow on Abe + Big Red • 6 different jobs • 6 = 1 (original) + 1 (original) + 2 (static) + 2 (dynamic)

  26. Evaluation – Workflow Performance • Setup points • “Submit” 4 different deployments at the same time • 5 jobs are submitted at the same time (1 job is for static workflow). • Repeat this at different times • The queue wait times will make different results

  27. Evaluation – Workflow Performance • Workflow comparison results

  28. Simulation – Run Time Comparison • Average run time • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

  29. Simulation – Run Time Comparison • Results (queue wait time + job run time + “failure” time)

  30. Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

  31. Simulation – Worst Case Run Time Comparison • Simulation setup • Use the 2009 TeraGrid outage data for this simulation • Submit jobs every 5 minutes during 2009 and compare the worst case run time between the original deployment and the dynamic workflow deployment

  32. Simulation – Worst Case Run Time Comparison

  33. Simulation – Worst Case Run Time Comparison

  34. Conclusion • In general, the dynamic fault-tolerant workflow has similar performance to the performance of the original deployment. • However, the dynamic workflow ofthe worst case scenario has much better performance than the performance of the worst case scenario of the original deployment.

More Related