Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia Polytechnic and State University

Outline • Background • Problem Definition – Failure Recovery in the Mobile Computing Environment • Proposed Solution – Movement-Based Check-pointing and Logging • Performance Analysis • Analytic Model of the System • Analysis Results and Conclusions • Future Work

Background

Mobile Computing • Advances in wireless networking and portable device technologies are revolutionizing computing • Mobile Computing – A type of distributed computing • Involves hosts that may be mobile • Host network connectivity maintained through wireless communications

Fault-tolerance in Distributed systems Check-pointing, Logging, Rollback recovery • Check-pointing  failure-free operations • Save system state to stable storage • This snapshot is called a checkpoint • Logging  failure-free operations • All non-deterministic events and the information necessary to replay these events are logged to the stable storage • In addition to checkpoints

Fault-tolerance in Distributed systems • Failure Recovery • Failed process rolls back to the latest checkpoint • Replays all the logged events in their original order • Recreates pre-failure state independently

Problem Definition Failure Recovery in the Mobile Computing Environment

Effects of Properties of MC Env. • Mobility of hosts • If checkpointing requires coordination, the MH must be searched and located first before control messages can be delivered; this increases communication delay • Data related to recovery, such as checkpoints and logs, may be distributed over many MSS; a mechanism is required for efficient storage, retrieval and management of this dispersed information

Effects of Properties of MC Env. • Low bandwidth and unreliable network connectivity • A recovery mechanism that requires a large number of messages or large size of messages imposes undue burden on the wireless resources and increases the cost of providing fault tolerance.

Effects of Properties of MC Env. • Limited battery life of host devices • Communication is energy intensive. • Recovery mechanism must keep communication (the number of messages and the size of messages) to a minimum.

Effects of Properties of MC Env. • Lack of stable storage on host devices • Devices are vulnerable to physical damage • Devices are small and are equipped with limited memory • MH’s disk cannot reliably function as the stable storage required to store recovery information.

Effects of Properties of MC Env. • Different types of ‘failures‘ • Voluntary disconnection and hardware failure must be handled differently • A disconnected host may reconnect after a while and expect to resume operations • A MH that is currently unreachable cannot be expected to participate in a checkpointing or recovery operation. • A scheme that requires synchronization or coordination with other MHs would either block until the MH reconnected or would fail.

The Problem… • Traditional recovery schemes suffer from many shortcomings when applied to the mobile computing environment. • The failure-prone nature of the environment makes it essential to provide some form of explicit recovery mechanism.

The Problem… • In general, application recovery mechanisms try to balance • Recovery cost (failure-free operational cost) • Recovery time • Storage requirements for recovery related information

The Problem… • Adaptations of traditional recovery schemes for the mobile computing environment • Do not consider mobility in the selection of checkpointing interval • Use periodic checkpointing • Subsequently control the proliferation of recovery information using techniques that merge logs and move the information closer to the MH.

Proposed Solution Movement-Based Check-pointing and Logging

Assumed Mobile Computing System • A set of mobile hosts (MHs) • They maintain network connectivity through a wireless link to a static mobile support station (MSS) • A MSS handles all communications to and from MHs within its area of influence known as a cell • Each MSS is equipped with enough volume of stable storage to store the state and log information

Assumed Mobile Computing System • Interactions between the MH and the network infrastructure relevant to failure recovery • Handoff – Cell boundary crossing • Disconnection – For power conservation • Reconnection – Possibly in a cell different from the one in which it disconnected

Assumed Mobile Computation • A distributed computation  a number of processes executing concurrently on multiple hosts. • Process states: • Normal- executing application related computations, receiving user inputs or sending and receiving messages. • Save - saves its state as a checkpoint to the stable storage • Between checkpoints, the process also logs all events (Normal state) • Recovery – Loads checkpoints and applies logs

Movement-Based Checkpointing and Logging • Interval between checkpoints is governed by the number of handoffs experienced by the MH and is not fixed • MH maintains a handoff counter which is incremented by 1 every time a handoff occurs. • When the value of the counter becomes greater than a threshold M, a checkpoint is taken. • In between checkpoints, all write events related to a MH is also logged to the local MSS.

Movement-Based Checkpointing and Logging • The threshold M is a configurable parameter. Depends on: • User mobility rate • Network the failure rate • Application log arrival rate

Movement-Based Checkpointing and Logging • Thus, depending on the variability in the MH’s mobility, the time interval between successive checkpoints differs. • Recovery – MH recovers independently without coordination with other MHs • Upon reconnection, MH informs local MSS. • Local MSS contacts MSS with latest checkpoint • Local MSS contacts all MSS storing logs • All data transferred to local MSS via wired network and to MH via wireless link • MH rolls back and applies logs

Movement-Based Checkpointing and Logging • The performance of this scheme depends on identifying the optimal movement threshold Mper user and application. • Checkpoints and logs remain within acceptable range of the MH’s current location and eliminates the need for information consolidation. • Ensures acceptable recovery time since M bounds the number of MSSs’ from which logs must be retrieved.

Performance Analysis Analytic Model

Stochastic Petri-Net (SPN) Model

SPN Model Parameters

SPN Model Parameters • Parameter Θk- Checkpoint rate of the MH • Parameter Θi- Recovery rate of the MH = inverse of recovery time • i - number of handoffs experienced by the MH since the last checkpoint and before failure.

Analytic Model – Recovery Time

Analytic Model – Recovery Time • Treq_rec - Time spent on recovery information requests • Nmss_logs – Number of MSSs storing logs • Dmss - average hop count between MSScp and MSSrec

Analytic Model – Recovery Time • Tckp_tx - Time spent on transmitting the latest checkpoint to the MH • Tlog_tx - Time spent on transmitting the logs to the MH • Trec - Time spent to rollback to the last checkpoint and apply the logs

Analytic Model – Cost of Recovery • Tr – Average Recovery time per failure • Fr – Recovery probability • Tc – Cost of recovery No. of checkpoints before failure No. of logs before failure

SPN Evaluation Parameters • Size of a log entry - 50B • Size of a checkpoint - 2000B • Bandwidth of wired network-2Mbps • Ratio of bandwidth of wireless to wired network (r) - 0.1 • Time required to apply a log entry (Telog) - 0.0001s • Time required to transmit a log entry through the wireless channel (Tlog_w) - 0.002s • Time required to transmit a checkpoint through the wireless channel (Tckp_w) - 0.08s

Performance Analysis Results and Conclusions

Recovery Probability vs. Recovery Time

Recovery Probability vs. Log Arrival Rate

Recovery Probability vs. Failure Rate

Recovery Probability & Recovery Time vs. Movement Threshold

Determining Optimal Movement Threshold that Minimizes Recovery Cost Per Failure

Conclusion – Proposed Scheme • An efficient failure recovery scheme for mobile computing systems based on movement-based checkpointing and logging • Movement-based checkpointing and logging scheme takes a checkpoint only after the mobile node has made M movements (mobility handoffs). • The value of M is governed by the failure rate, log arrival rate, and the mobility rate of the application and MH. • Identify the optimal movement threshold M, when given the failure, mobility and log arrival rates, to minimize the cost of recovery per failure.

Conclusion – Practical Application • Build a table at configuration time covering possible parameter values of the mobility rate and failure rate of the MH and log arrival rate of the mobile applications, and listing the optimal M value that would minimize the recovery cost per failure. • At runtime, based on the measured rates, the optimal M may be selected dynamically to minimize the recovery cost per failure. • Optimal M selected must also satisfy the specified recovery probability when given an application deadline to recover from a failure.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems