System Reliability Axioms: Understanding Failure Modes & Recovery Strategies

Operating System Reliability

Some Axioms • Some simple systems, designed from scratch, sometimes work • A complex system that works is invariably found to have evolved from a simple system that works • A complex system, designed from scratch never works

Failure-Mode Theorems • Complex systems usually operate in failure mode • A system should have safe behaviors when encountering failures • When a “fail-safe” system fails, it fails by failing to fail safe

Some definitions • Failure occurs when the system does not perform its services in the manner specified • Failures can be subtle (e.g., performance fault) • Fault is anomalous physical condition • Includes system specification/implementation mistakes • Error is part of system state that differs from its intended value

Classification of Failures • Process failures • System failures • Secondary storage failures • Communication medium failures

Process Failures • Examples • Computation results in incorrect outcome • System state deviates from specification • Process fails to progress • Errors leading to failure • Deadlock, timeout, protection violation • Bad input, consistency violation • Ignoring malicious behavior

System Failures • Processor fails to execute • Software error, hardware error (CPU, bus, etc.) • Fail-stop behavior assumed • Failure types • Amnesia • Partial-amnesia • Pause • Halting

Secondary Storage Failures • Stored data inaccessible • Parity error • Head crash • Contaminated medium • Reconstructable from archive + log, maybe • Mirrored disks (independent failure mode)

Communication Medium Failures • Site can’t communicate with another site • Causes • Switching node failure • Hardware failure • Software failure • Congestion • Link failure • Hardware • Implementation failure • Network partitions can result

Recovery • Restart process/processor • Reclaim resources • Undo/finish incomplete transactions • Concurrency makes things harder

Forward Error Recovery • Goal: To restore system from erroneous state to error-free state • If nature of error is completely known • Remove error from state • Proceed with execution from error-free state • Rarely possible to do

Backward Error Recovery • When error source unknown • Restore state to previous error-free state; restart • Independent of fault, errors causing fault • Problems • Performance penalty • No guarantee fault will not reoccur • Possible unrecoverable component of state • Recovery point: state used to replace error

Backward Error Recovery • Basic approaches • Operation-based • Logs • Update-in-place • Write-ahead-log • State-based

Update-in-Place • Every update to object also records the log • Name of object • Old and new states of object • Recoverable update operation implements as • Do, undo, redo operations

Write-ahead Log • Update-in-place has problem if crash occurs between update and log recorded to stable storage • Update object only after undo log recorded • Before committing updates, record both redo and undo logs • Expensive to write log to stable storage

State-Based Recovery • Save entire process state at recovery point • Recovery point called checkpoint • Rolling back process: restoring to checkpoint • Tradeoff: frequent checkpoints vs. completion delay • Shadow pages • Save unmodified page copy on stable storage • Update only volatile copy; discard on rollback

Concurrent Systems Recovery • Rollback issues • Orphan messages • Domino effect • Lost messages • Livelocks

Orphan Messages (a message prior to a checkpoint is sent to the future) x1 x2 X [ [ y1 m y2 Y [ [ z1 z2 Z [ [ [ recovery point

Domino Effect • Suppose Y rolls back to y2 • m is orphan message • Process Y must rollback to y1 • Suppose Z rolls back to z2 • Y rolls back to y1 • Now a message from future is sent to the past prior to a checkpoint • Forcing Z to roll back to z1

Lost Messages x1 X [ m z1 Z [ failure [ recovery point

Live Locks x1 X [ z1 Z [ repeated failure [ recovery point

Concurrent Recovery • Coordination required at either time of establishing checkpoints • Beginning of recovery

Checkpoint Assumptions • Communication via messages • Unreliable FIFO channels • Higher-level end-to-end protocols assumed • Subsumes rollback-caused message loss • No network partitions from communication failures

Checkpoint Algorithm Concepts • Permanent and tentative checkpoints • Saved on stable storage • Permanent: part of known consistent global checkpoint • Tentative: until successful termination of checkpoint algorithm • Rolls back only to permanent checkpoints

Synchronous Checkpoint Algorithms • Two-phase commit • Problems: • Message overhead for synchronizations • Synchronization delays • Costly when failures are rare

Asynchronous Checkpointing • Local checkpoints taken independently • Log all incoming messages on stable storage • Minimizes undone computation • Allows reprocessing of messages after rollback

Asynchronous Checkpointing Assumptions • Assumptions • Reliable FIFO communication channels • Infinite buffers • Event-driven computation • A process idle until message received • Processes message and change state • Sends zero or more messages • Can identify each event with monotonically increasing counter

Event-Driven Computation x1 x2 X y1 y2 Y z1 z2 Z

Asynchronous Checkpointing • Basic idea • Save states, messages sent at each event • Volatile logging • Each processor notes number of messages sent to others, and received from others • Use counters to determine orphan messages

Summary • Failures caused by errors • Can remove errors by forward/backward error recovery • Backward error-recovery more costly, more general • Synchronous checkpoints helpful, costly • Asynchronous checkpoints messier, domino effects

System Reliability Axioms: Understanding Failure Modes & Recovery Strategies

System Reliability Axioms: Understanding Failure Modes & Recovery Strategies

Presentation Transcript

Operating System Reliability

Operating system

Operating System

Operating system

OPERATING SYSTEM

OPERATING SYSTEM

Operating System

Operating System

Operating System

Power System Reliability: Operating Reserves From Responsive Load

System Reliability

Operating System

Operating System

OPERATING SYSTEM

Operating System Reliability