Analysis of checkpointing schemes for multiprocessor systems
Sponsored Links
This presentation is the property of its rightful owner.
1 / 23

Analysis of Checkpointing Schemes for Multiprocessor Systems PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Analysis of Checkpointing Schemes for Multiprocessor Systems. Avi Ziv Jehoshua Bruck Presentation By: Emre Chasan Moustafa. Outline. Introduction Checkpointing Execution Of A Task Performance Analysis Analysis Technique Analysis technique Building The State Machine

Download Presentation

Analysis of Checkpointing Schemes for Multiprocessor Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Analysis of Checkpointing Schemes for Multiprocessor Systems

Avi Ziv

Jehoshua Bruck

Presentation By: Emre Chasan Moustafa


  • Introduction

    • Checkpointing

    • Execution Of A Task

    • Performance Analysis

    • Analysis Technique

  • Analysis technique

    • Building The State Machine

    • Creating the Markov Chain

    • Analyzing the Scheme Using the MRM

  • Scheme Comparison

    • Average Execution Time

    • Average Work

  • Conclusion


  • A technique in distributed shared memory systems for inserting fault tolerance into systems.

  • Reduces the time spent in retrying a task in case of a failure

    • Hence reduces the average execution time of a task

  • Important in many applications

    • Real-timesystems with hard deadlines,

    • Transactionssystems, where high availability is required.

Checkpointing (2)

  • Basically serves two purposes:

    • Detecting faults that occurred during the execution of a task,

    • Reducing the time spent in recovering from faults.

  • Achieved by

    • Duplicating the task into twoor more processors

    • Comparing the states of theprocessors at the checkpoints.

Execution Of A Task

  • Execute one interval of the task by all theprocessorsthat are assigned to it.

  • Performs the operations necessary to achieve fault detection and recovery.

    • Store the states of the processors in the stable storage

    • Compare those states.

      • If no fault occurred

        • The execution of the task is resumed with the next interval in the next step.

      • Otherwise

        • Checkpoint processor performs operations to recover from the fault.

Performance Analysis

  • Important when

    • Tryingto evaluate and compare different schemes

    • Checking ifa scheme achieves its goals in a certain system.

  • Making simulations for performance evaluation

    • Leads to long and time consumingevaluation

  • Using simplified fault model

    • Provides only approximate results

  • This paper describes an analysis technique forstudying the performance of checkpointing schemes forfault-tolerance

    • Provides a way to compare various schemesand select optimal values for some parameters of thescheme, like the number of checkpoints.

Analysis Technique

  • Based on the analysis ofa discrete time Markov Reward Model(MRM)

  • Done in 3 steps

    • Theanalyzed scheme is modelled as a state-machine.

    • The edges of the state-machine are assignedtransition probabilities according to the eventsthat cause the transition and the fault model used.

    • The Markov chain, created by the firsttwo steps, is analyzed, and values for the properties ofinterest are derived.

Analysis Technique (2)

  • An example using the DMR-B-1 scheme

    • Task is executed by twoprocessors in parallel

Building The State Machine

  • Describes the behaviour of the scheme in the eyes of anexternal viewer, who can observe the faults that occurredduring a step.

  • Each transition in the state-machine represents onestep.

  • Each transition has associated with it a set of propertiescalled rewards.

  • For the execution timeof the schemes, we use two rewards

    • vi- The amount of useful work that is done duringthe transition.

    • ti- The time it takes to complete the step thatcorresponds to the transition.

Building The State Machine (2)

  • DMR-B-1 scheme the operation has two basicmodes.

    • The first mode is the normal operationmode, where two processors areexecuting the task inparallel.

    • The second mode is the fault recovery mode,where a single processor tries to find a match to anunverified checkpoint.

The execution of the previous figure causes thefollowing transitions (the number above the arrows arethe edges that are used for the transitions)

Creating the Markov Chain

  • Involves assigning probabilitiesto each of the transitions in the state-machineconstructed in the first step.

  • The probabilities assigned to the edges are determinedby the fault model.

  • Fis the probability that a processor willhave a fault while executing an interval.

  • Transition description for the DMR-B-1 extended state-machine:

Analyzing the Scheme Using the MRM

  • To solve the MRM, construct the transition matrix of theMarkov chain.

    • Eachentry pi,jis the probability of transition from state Ito state j .

  • Two ways to analyze a Markov chain

    • Transient analysis

      • We look at the stateprobabilities at each step, and from those probabilitiesget the desired quantities.

    • Steady-state or limiting analysis.

      • Welook at the state probabilities in the limit as t→∞.

      • In this paper we use the steady-stateanalysis.

Analysis of DMR-B-1

  • Applying results to the DMR-B-1 scheme:

    • The transition matrix of the scheme is:

  • The steady-state probabilities are:

  • And the average execution time of a task:

Simulation Results

  • The comparison was made for a task of length 1 with 20 checkpoints (n = 20, tl= .05), tck= 0.001 and t l d= 0.003.

  • The simulation points fall on the line of analytical plot.

  • Also in other schemes, the the analytical simulation results match well.

Comparison between analytical and simulation results of the average execution time for the DMR-B-1 scheme

Scheme Comparison

  • TMR-F scheme

    • The task is executed by three processors,all of them executing the same interval.

    • Afault ina single processor can be recovered without a rollbackbecause two processors with correct executionstill agree on the checkpoint.

    • If faults occur in morethan one processor all the processors are rolled backand execute the same interval again.

  • DMR-B-2 scheme

    • Two processors execute the task.

    • Whenever a fault occurs both processors are rolledback and execute the same interval again.

    • The differencebetween this scheme and simple rollback schemes,like TMR-F, is that all the unverified checkpoints arestored and compared, not just the checkpoints of thelast step.

    • Two steps with a single fault areenough to verify an interval.

Scheme Comparison (2)

  • DMR-F-1 scheme

    • Uses spare processors and the roll-forward recovery techniquein order to avoid rollback

    • Two processorsare used during fault free steps.

    • Three additionalspare processors are added for a single step after eachfault to try to recover without a rollback.

  • Roll-Forward CheckpointingScheme(RFCS)

    • Aspare processor is used in fault recovery in order toavoid rollback.

    • The difference between the DMR-F-1 and RFCS schemes isthat RFCS uses only one spare processor and the recoverytakes two steps instead of one step in DMR-F-1.

Scheme Comparison (3)

  • Two properties are compared:

    • Average execution time

      • Importantin real-time systems where fast response is desired

    • Averagework used to complete the execution of a task

      • Important in transaction systems, where high availability of the system is required, and so the system should use asfew resources aspossible.

Simplified Model

  • To obtain general properties of the schemes withoutthe influence of a specific implementation

  • The time to execute each stepis

    • ts+ toh,where tohis the overhead time required bythe scheme.

  • Using the simplified model, and a task with n intervals (tl= 1/n)

    • The average execution time:

    • The total work of a task:

Average Execution Time

  • The average execution time of a task with n checkpointsis:

    where S is the average number of steps it takes to complete an interval.

  • The average execution time of the four schemes

Average Execution Time (2)

  • As seen from the figure:

    • TMR-F scheme has the lowestexecution time.

      • Because it is using more processors than

      • Has a much lower probabilityof failing to find two matching checkpoints.

    • DMR-B-2 scheme is the worst

      • Because it uses onlytwo processors

      • Does not use spare processors totry to overcome the failure.

Average execution time with optimal checkpoints

  • The RFCS and DMR-F-1 schemes use spare processors during fault recovery,and thus have better performance than DMR-B-2.

Average Work

  • Applying the precise model, the four schemes give the following formulas:

    (The average work of a task is of length 1 with overhead time of t,,, = 0.002)

Average Work (2)

  • The results here are the reverse of the results in the average execution time.

    • The best scheme here is the DMR-B-2 because

      • it always uses only two processors.

    • The RFCS and DMR-F-1, which use 2 processors during normal execution and add spare processors during fault recovery, require more work.

    • The TMR-F scheme, which uses 3 processors, is the worst scheme.


  • A novel techniqueto analyze the performance of checkpointing schemes is presented.

  • The technique is based on modeling theschemes under a given fault model with a Markov RewardModel

  • Results show that:

    • Generally the number of processorhas a major effect on both quantities.

      • Whena scheme uses more processors, its execution time decreases,while the total work increases.

      • The complexityof the scheme has only a minor effect on its performance.

  • The proposed technique is not limited to theschemes described in this paper, or to the fault modelused here.

    • It can be used to analyze any checkpointingfault-tolerance scheme, with various fault models.

  • Login