Enhancing Fault-Tolerance in Grid Computing through Adaptive Checkpointing and Replication Techniques

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt, Member, IEEE, Filip De Turck, Member, IEEE, Piet Demeester, Senior Member, IEEE, AND Peter A. Vanrolleghem

Table of Content • Introduction • Adaptive Checkpointing Heuristics • Replication-Based Heuristics • Conclusion and Future Work

Introduction • A novel fault-tolerant algorithm combine • Checkpointing • Replication • Be evaluated • Newly developed grid simulation environment Dynamic Scheduling in Distributed Environments(DSiDE)

Introduction (cont.) • Simulation • Run employing workload • System parameters • From several large-scale parallel production systems’ logs • Using the discrete event grid simulator DSiDE

Introduction (cont.) • Comparable throughput and fault tolerance • Static checkpointing with optimal parameters • Replication with optimal parameters

Adaptive Checkpointing Heuristics • The Checkpointing Model • Limites • Runtime overhead (C) • Network latency (L) • Recovery delay (R) • Concentrates on the reduction of the checkpointing runtime overhead

Adaptive Checkpointing Heuristics(cont.) • Problem Assuming the execution time can be exactly determined in advance • Simulation The upper bounds of the algorithms performance, with respect to this parameter

Adaptive Checkpointing Heuristics (cont.) • Last Failure Dependent Checkpointing (LastFailureCP) • Goal • To reduce the overhead

Adaptive Checkpointing Heuristics (cont.) • Mean Failure Dependent Checkpointing (MeanFailureCP) • Only considers checkpoint omissions • Modify the checkpointing interval based on the runtime information • The remaining job execution time • The average failure interval of the resource

Adaptive Checkpointing Heuristics (cont.) • DSiDE Simulation Environment • Goal Validate • Architecture • DExec • DGen • Each DSiDE event has a time stamp • Provide a priori or at runtime • Support several types of dynamic system modifications

Adaptive Checkpointing Heuristics (cont.) • The DSiDE simulator architecture

Adaptive Checkpointing Heuristics (cont.) • The resource performed useful computations • Total grid availability • DSiDE provides a set of events to specify network links and routes

Adaptive Checkpointing Heuristics (cont.) • Simulation Result • To compare the performance • Checkpointing heuristics • Realistic workload • System failure model

Adaptive Checkpointing Heuristics (cont.) • Submit’s time • 80% (7 a.m. ~ 9 p.m.) • 20% (9 p.m. ~ 7 a.m.)

Adaptive Checkpointing Heuristics (cont.) • Execution time • More than 80% of percent of all submitted jobs have medium execution times • 1 hour to 6 hours

Adaptive Checkpointing Heuristics (cont.) • I decreases and longer jobs can get processed • Increase in job runtime is in effect • The results • The results achieved with PeriodicCP are partially improved by LastFailureCP due to omission of redundant checkpoints • The technique provides the best results for short checkpointing intervals • The effectiveness of LastFailureCP strongly depends on failure periodically

Adaptive Checkpointing Heuristics (cont.) • Failures occur quite periodically • Can easily be predicted by the algorithm • LastFailureCP will perform similar to PeriodicCP • The fully dynamic scheme of MeanFailureCP proves to be the most effective • Selective increase in checkpointing keeps the number of processed jobs and the average execution time of MeanFailureCP more or less constant • PeriodicCP and LastFailureCP algorithms, the performance drops considerably

Replication-based Heuristics • Load-Dependent Replication (LoadDependentRep) • Providing fault tolerance in distributed environments through replication • Idle resources can be utilized to run job copies without significantly delaying the execution of the original job

Replication-based Heuristics (cont.) • The algorithm requires a number of parameters to be provided in advance • Minimum number of job copies (Repmin) • Maximum number of job copies (Repmax) • The CPU limit (CL)

Replication-based Heuristics (cont.) • The outcome of the comparison determines the choice for the next job to be scheduled • CA >= CL (Less than Repmax) • 0 < CA < CL (Less than Repmin) • CA = 0 (Skip the current scheduling round) • When one of the job duplicates finishes, other replicas are automatically canceled

Replication-based Heuristics (cont.) • Failure Detection and Load Dependent Replication (FailureDependentRep) • Increase the fault tolerance of the previously discussed LoadDependentRep heuristic • Offer a higher level of fault tolerance compared to solely replication-based strategies • Not ensure job execution

Replication-based Heuristics (cont.) • Adaptive Checkpoint and Replication-Based Fault Tolerance (CombinedFT) • Dynamically switches between both techniques based on runtime information on system load • Checkpointing mode • Replication mode

Replication-based Heuristics (cont.) • Checkpointing mode • CPU availability is low (CA < CL) • Combined FT rolls back • The earlier distributed active job replicas (ARj) • Starts job checkpointing • ARj > 0 • ARj = 0 & CA > 0 • ARj = 0 & CA = 0 & ∃i: ARi > 1 • ARj = 0 & CA = 0 & ¬∃i: ARi > 1

Replication-based Heuristics (cont.) • Replication mode • Either the system load decreases • Enough resources restore from failure (CA≧CL) • All jobs with less than Repmax replicas are considered for submission to the available resources • Assign to the fastest resource connected to a grid site S with the maximum SpeedS • The smallest number of identical replicas

Replication-based Heuristics (cont.) • Simulation Results • Approaches • Unconditional RL(1) • Unconditional RL(2) • Unconditional RL(3) • LoadDependentRL(1, 3, 40) • FailureDependentRL(1, 3, 40) • MeanFailureCP • CombinedFT

Replication-based Heuristics (cont.)

Conclusion and Future Work • Fault tolerance forms an important problem • Job checkpointing • Replication • Evaluate in the DSiDE grid simulator • The runtime overhead characteristic to periodic checkpointing can be reduced

Conclusion and Future Work (cont.) • Advantage • When the distributed system properties are not known in advance, both techniques can best be applied • Future Work • Scheduling methods will be considered

Thank you for your attention

Enhancing Fault-Tolerance in Grid Computing through Adaptive Checkpointing and Replication Techniques