1 / 40

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems. Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia Polytechnic and State University. Outline. Background Problem Definition – Failure Recovery in the Mobile Computing Environment

royfoster
Download Presentation

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia Polytechnic and State University

  2. Outline • Background • Problem Definition – Failure Recovery in the Mobile Computing Environment • Proposed Solution – Movement-Based Check-pointing and Logging • Performance Analysis • Analytic Model of the System • Analysis Results and Conclusions • Future Work

  3. Background

  4. Mobile Computing • Advances in wireless networking and portable device technologies are revolutionizing computing • Mobile Computing – A type of distributed computing • Involves hosts that may be mobile • Host network connectivity maintained through wireless communications

  5. Fault-tolerance in Distributed systems Check-pointing, Logging, Rollback recovery • Check-pointing  failure-free operations • Save system state to stable storage • This snapshot is called a checkpoint • Logging  failure-free operations • All non-deterministic events and the information necessary to replay these events are logged to the stable storage • In addition to checkpoints

  6. Fault-tolerance in Distributed systems • Failure Recovery • Failed process rolls back to the latest checkpoint • Replays all the logged events in their original order • Recreates pre-failure state independently

  7. Problem Definition Failure Recovery in the Mobile Computing Environment

  8. Effects of Properties of MC Env. • Mobility of hosts • If checkpointing requires coordination, the MH must be searched and located first before control messages can be delivered; this increases communication delay • Data related to recovery, such as checkpoints and logs, may be distributed over many MSS; a mechanism is required for efficient storage, retrieval and management of this dispersed information

  9. Effects of Properties of MC Env. • Low bandwidth and unreliable network connectivity • A recovery mechanism that requires a large number of messages or large size of messages imposes undue burden on the wireless resources and increases the cost of providing fault tolerance.

  10. Effects of Properties of MC Env. • Limited battery life of host devices • Communication is energy intensive. • Recovery mechanism must keep communication (the number of messages and the size of messages) to a minimum.

  11. Effects of Properties of MC Env. • Lack of stable storage on host devices • Devices are vulnerable to physical damage • Devices are small and are equipped with limited memory • MH’s disk cannot reliably function as the stable storage required to store recovery information.

  12. Effects of Properties of MC Env. • Different types of ‘failures‘ • Voluntary disconnection and hardware failure must be handled differently • A disconnected host may reconnect after a while and expect to resume operations • A MH that is currently unreachable cannot be expected to participate in a checkpointing or recovery operation. • A scheme that requires synchronization or coordination with other MHs would either block until the MH reconnected or would fail.

  13. The Problem… • Traditional recovery schemes suffer from many shortcomings when applied to the mobile computing environment. • The failure-prone nature of the environment makes it essential to provide some form of explicit recovery mechanism.

  14. The Problem… • In general, application recovery mechanisms try to balance • Recovery cost (failure-free operational cost) • Recovery time • Storage requirements for recovery related information

  15. The Problem… • Adaptations of traditional recovery schemes for the mobile computing environment • Do not consider mobility in the selection of checkpointing interval • Use periodic checkpointing • Subsequently control the proliferation of recovery information using techniques that merge logs and move the information closer to the MH.

  16. Proposed Solution Movement-Based Check-pointing and Logging

  17. Assumed Mobile Computing System • A set of mobile hosts (MHs) • They maintain network connectivity through a wireless link to a static mobile support station (MSS) • A MSS handles all communications to and from MHs within its area of influence known as a cell • Each MSS is equipped with enough volume of stable storage to store the state and log information

  18. Assumed Mobile Computing System • Interactions between the MH and the network infrastructure relevant to failure recovery • Handoff – Cell boundary crossing • Disconnection – For power conservation • Reconnection – Possibly in a cell different from the one in which it disconnected

  19. Assumed Mobile Computation • A distributed computation  a number of processes executing concurrently on multiple hosts. • Process states: • Normal- executing application related computations, receiving user inputs or sending and receiving messages. • Save - saves its state as a checkpoint to the stable storage • Between checkpoints, the process also logs all events (Normal state) • Recovery – Loads checkpoints and applies logs

  20. Movement-Based Checkpointing and Logging • Interval between checkpoints is governed by the number of handoffs experienced by the MH and is not fixed • MH maintains a handoff counter which is incremented by 1 every time a handoff occurs. • When the value of the counter becomes greater than a threshold M, a checkpoint is taken. • In between checkpoints, all write events related to a MH is also logged to the local MSS.

  21. Movement-Based Checkpointing and Logging • The threshold M is a configurable parameter. Depends on: • User mobility rate • Network the failure rate • Application log arrival rate

  22. Movement-Based Checkpointing and Logging • Thus, depending on the variability in the MH’s mobility, the time interval between successive checkpoints differs. • Recovery – MH recovers independently without coordination with other MHs • Upon reconnection, MH informs local MSS. • Local MSS contacts MSS with latest checkpoint • Local MSS contacts all MSS storing logs • All data transferred to local MSS via wired network and to MH via wireless link • MH rolls back and applies logs

  23. Movement-Based Checkpointing and Logging • The performance of this scheme depends on identifying the optimal movement threshold Mper user and application. • Checkpoints and logs remain within acceptable range of the MH’s current location and eliminates the need for information consolidation. • Ensures acceptable recovery time since M bounds the number of MSSs’ from which logs must be retrieved.

  24. Performance Analysis Analytic Model

  25. Stochastic Petri-Net (SPN) Model

  26. SPN Model Parameters

  27. SPN Model Parameters • Parameter Θk- Checkpoint rate of the MH • Parameter Θi- Recovery rate of the MH = inverse of recovery time • i - number of handoffs experienced by the MH since the last checkpoint and before failure.

  28. Analytic Model – Recovery Time

  29. Analytic Model – Recovery Time • Treq_rec - Time spent on recovery information requests • Nmss_logs – Number of MSSs storing logs • Dmss - average hop count between MSScp and MSSrec

  30. Analytic Model – Recovery Time • Tckp_tx - Time spent on transmitting the latest checkpoint to the MH • Tlog_tx - Time spent on transmitting the logs to the MH • Trec - Time spent to rollback to the last checkpoint and apply the logs

  31. Analytic Model – Cost of Recovery • Tr – Average Recovery time per failure • Fr – Recovery probability • Tc – Cost of recovery No. of checkpoints before failure No. of logs before failure

  32. SPN Evaluation Parameters • Size of a log entry - 50B • Size of a checkpoint - 2000B • Bandwidth of wired network-2Mbps • Ratio of bandwidth of wireless to wired network (r) - 0.1 • Time required to apply a log entry (Telog) - 0.0001s • Time required to transmit a log entry through the wireless channel (Tlog_w) - 0.002s • Time required to transmit a checkpoint through the wireless channel (Tckp_w) - 0.08s

  33. Performance Analysis Results and Conclusions

  34. Recovery Probability vs. Recovery Time

  35. Recovery Probability vs. Log Arrival Rate

  36. Recovery Probability vs. Failure Rate

  37. Recovery Probability & Recovery Time vs. Movement Threshold

  38. Determining Optimal Movement Threshold that Minimizes Recovery Cost Per Failure

  39. Conclusion – Proposed Scheme • An efficient failure recovery scheme for mobile computing systems based on movement-based checkpointing and logging • Movement-based checkpointing and logging scheme takes a checkpoint only after the mobile node has made M movements (mobility handoffs). • The value of M is governed by the failure rate, log arrival rate, and the mobility rate of the application and MH. • Identify the optimal movement threshold M, when given the failure, mobility and log arrival rates, to minimize the cost of recovery per failure.

  40. Conclusion – Practical Application • Build a table at configuration time covering possible parameter values of the mobility rate and failure rate of the MH and log arrival rate of the mobile applications, and listing the optimal M value that would minimize the recovery cost per failure. • At runtime, based on the measured rates, the optimal M may be selected dynamically to minimize the recovery cost per failure. • Optimal M selected must also satisfy the specified recovery probability when given an application deadline to recover from a failure.

More Related