Analysis of Checkpointing Schemes for Multiprocessor Systems

Download Presentation

Analysis of Checkpointing Schemes for Multiprocessor Systems

Loading in 2 Seconds...

- 111 Views
- Uploaded on
- Presentation posted in: General

Analysis of Checkpointing Schemes for Multiprocessor Systems

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Analysis of Checkpointing Schemes for Multiprocessor Systems

Avi Ziv

Jehoshua Bruck

Presentation By: Emre Chasan Moustafa

- Introduction
- Checkpointing
- Execution Of A Task
- Performance Analysis
- Analysis Technique

- Analysis technique
- Building The State Machine
- Creating the Markov Chain
- Analyzing the Scheme Using the MRM

- Scheme Comparison
- Average Execution Time
- Average Work

- Conclusion

- A technique in distributed shared memory systems for inserting fault tolerance into systems.
- Reduces the time spent in retrying a task in case of a failure
- Hence reduces the average execution time of a task

- Important in many applications
- Real-timesystems with hard deadlines,
- Transactionssystems, where high availability is required.

- Basically serves two purposes:
- Detecting faults that occurred during the execution of a task,
- Reducing the time spent in recovering from faults.

- Achieved by
- Duplicating the task into twoor more processors
- Comparing the states of theprocessors at the checkpoints.

- Execute one interval of the task by all theprocessorsthat are assigned to it.
- Performs the operations necessary to achieve fault detection and recovery.
- Store the states of the processors in the stable storage
- Compare those states.
- If no fault occurred
- The execution of the task is resumed with the next interval in the next step.

- Otherwise
- Checkpoint processor performs operations to recover from the fault.

- If no fault occurred

- Important when
- Tryingto evaluate and compare different schemes
- Checking ifa scheme achieves its goals in a certain system.

- Making simulations for performance evaluation
- Leads to long and time consumingevaluation

- Using simplified fault model
- Provides only approximate results

- This paper describes an analysis technique forstudying the performance of checkpointing schemes forfault-tolerance
- Provides a way to compare various schemesand select optimal values for some parameters of thescheme, like the number of checkpoints.

- Based on the analysis ofa discrete time Markov Reward Model(MRM)
- Done in 3 steps
- Theanalyzed scheme is modelled as a state-machine.
- The edges of the state-machine are assignedtransition probabilities according to the eventsthat cause the transition and the fault model used.
- The Markov chain, created by the firsttwo steps, is analyzed, and values for the properties ofinterest are derived.

- An example using the DMR-B-1 scheme
- Task is executed by twoprocessors in parallel

- Describes the behaviour of the scheme in the eyes of anexternal viewer, who can observe the faults that occurredduring a step.
- Each transition in the state-machine represents onestep.
- Each transition has associated with it a set of propertiescalled rewards.
- For the execution timeof the schemes, we use two rewards
- vi- The amount of useful work that is done duringthe transition.
- ti- The time it takes to complete the step thatcorresponds to the transition.

- DMR-B-1 scheme the operation has two basicmodes.
- The first mode is the normal operationmode, where two processors areexecuting the task inparallel.
- The second mode is the fault recovery mode,where a single processor tries to find a match to anunverified checkpoint.

The execution of the previous figure causes thefollowing transitions (the number above the arrows arethe edges that are used for the transitions)

- Involves assigning probabilitiesto each of the transitions in the state-machineconstructed in the first step.
- The probabilities assigned to the edges are determinedby the fault model.
- Fis the probability that a processor willhave a fault while executing an interval.
- Transition description for the DMR-B-1 extended state-machine:

- To solve the MRM, construct the transition matrix of theMarkov chain.
- Eachentry pi,jis the probability of transition from state Ito state j .

- Two ways to analyze a Markov chain
- Transient analysis
- We look at the stateprobabilities at each step, and from those probabilitiesget the desired quantities.

- Steady-state or limiting analysis.
- Welook at the state probabilities in the limit as t→∞.
- In this paper we use the steady-stateanalysis.

- Transient analysis

- Applying results to the DMR-B-1 scheme:
- The transition matrix of the scheme is:

- The steady-state probabilities are:
- And the average execution time of a task:

- The comparison was made for a task of length 1 with 20 checkpoints (n = 20, tl= .05), tck= 0.001 and t l d= 0.003.
- The simulation points fall on the line of analytical plot.
- Also in other schemes, the the analytical simulation results match well.

Comparison between analytical and simulation results of the average execution time for the DMR-B-1 scheme

- TMR-F scheme
- The task is executed by three processors,all of them executing the same interval.
- Afault ina single processor can be recovered without a rollbackbecause two processors with correct executionstill agree on the checkpoint.
- If faults occur in morethan one processor all the processors are rolled backand execute the same interval again.

- DMR-B-2 scheme
- Two processors execute the task.
- Whenever a fault occurs both processors are rolledback and execute the same interval again.
- The differencebetween this scheme and simple rollback schemes,like TMR-F, is that all the unverified checkpoints arestored and compared, not just the checkpoints of thelast step.
- Two steps with a single fault areenough to verify an interval.

- DMR-F-1 scheme
- Uses spare processors and the roll-forward recovery techniquein order to avoid rollback
- Two processorsare used during fault free steps.
- Three additionalspare processors are added for a single step after eachfault to try to recover without a rollback.

- Roll-Forward CheckpointingScheme(RFCS)
- Aspare processor is used in fault recovery in order toavoid rollback.
- The difference between the DMR-F-1 and RFCS schemes isthat RFCS uses only one spare processor and the recoverytakes two steps instead of one step in DMR-F-1.

- Two properties are compared:
- Average execution time
- Importantin real-time systems where fast response is desired

- Averagework used to complete the execution of a task
- Important in transaction systems, where high availability of the system is required, and so the system should use asfew resources aspossible.

- Average execution time

- To obtain general properties of the schemes withoutthe influence of a specific implementation
- The time to execute each stepis
- ts+ toh,where tohis the overhead time required bythe scheme.

- Using the simplified model, and a task with n intervals (tl= 1/n)
- The average execution time:
- The total work of a task:

- The average execution time of a task with n checkpointsis:
where S is the average number of steps it takes to complete an interval.

- The average execution time of the four schemes

- As seen from the figure:
- TMR-F scheme has the lowestexecution time.
- Because it is using more processors than
- Has a much lower probabilityof failing to find two matching checkpoints.

- DMR-B-2 scheme is the worst
- Because it uses onlytwo processors
- Does not use spare processors totry to overcome the failure.

- TMR-F scheme has the lowestexecution time.

Average execution time with optimal checkpoints

- The RFCS and DMR-F-1 schemes use spare processors during fault recovery,and thus have better performance than DMR-B-2.

- Applying the precise model, the four schemes give the following formulas:
(The average work of a task is of length 1 with overhead time of t,,, = 0.002)

- The results here are the reverse of the results in the average execution time.
- The best scheme here is the DMR-B-2 because
- it always uses only two processors.

- The RFCS and DMR-F-1, which use 2 processors during normal execution and add spare processors during fault recovery, require more work.
- The TMR-F scheme, which uses 3 processors, is the worst scheme.

- The best scheme here is the DMR-B-2 because

- A novel techniqueto analyze the performance of checkpointing schemes is presented.
- The technique is based on modeling theschemes under a given fault model with a Markov RewardModel
- Results show that:
- Generally the number of processorhas a major effect on both quantities.
- Whena scheme uses more processors, its execution time decreases,while the total work increases.
- The complexityof the scheme has only a minor effect on its performance.

- Generally the number of processorhas a major effect on both quantities.
- The proposed technique is not limited to theschemes described in this paper, or to the fault modelused here.
- It can be used to analyze any checkpointingfault-tolerance scheme, with various fault models.