Comp 655 distributed operating systems
Download
1 / 70

COMP 655: Distributed/Operating Systems - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

COMP 655: Distributed/Operating Systems. Winter 2012 Mihajlo Jovanovic Week 7: Fault Tolerance. Fault Tolerance. Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' COMP 655: Distributed/Operating Systems' - ciaran-guy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Comp 655 distributed operating systems

COMP 655:Distributed/Operating Systems

Winter 2012

Mihajlo Jovanovic

Week 7: Fault Tolerance

Distributed Systems - COMP 655


Fault tolerance
Fault Tolerance

  • Fault tolerance concepts

  • Implementation – distributed agreement

  • Distributed agreement meets transaction processing: 2- and 3-phase commit

    Bonus material

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

  • Sparing

Distributed Systems - COMP 655


Fault tolerance concepts
Fault tolerance concepts

  • Availability – can I use it now?

    • Usually quantified as a percentage

  • Reliability – can I use it for a certain period of time?

    • Usually quantified as MTBF

  • Safety – will anything really bad happen if it does fail?

  • Maintainability – how hard is it to fix when it fails?

    • Usually quantified as MTTR

Distributed Systems - COMP 655


Comparing nines
Comparing nines

  • 1 year = 8760 hr

  • Availability levels

    • 90% = 876 hr downtime/yr

    • 99% = 87.6 hr downtime/yr

    • 99.9% = 8.76 hr downtime/yr

    • 99.99% = 52.56 min downtime/yr

    • 99.999% = 5.256 min downtime/yr

Distributed Systems - COMP 655


Exercise how to get five nines
Exercise: how to get five nines

  • Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider:

    • Hardware failures, especially disks

    • Power failures

    • Network outages

    • Software installation

    • What else?

  • Come up with some ideas about how to solve the problems you identify

Distributed Systems - COMP 655


Multiple machines at 99
Multiple machines at 99%

Assuming independent failures

Distributed Systems - COMP 655


Multiple machines at 95
Multiple machines at 95%

Assuming independent failures

Distributed Systems - COMP 655


Multiple machines at 80
Multiple machines at 80%

Assuming independent failures

Distributed Systems - COMP 655


Things to watch out for in availability requirements
Things to watch out for in availability requirements

  • What constitutes an outage …

    • A client PC going down?

    • A client applet going into an infinite loop?

    • A server crashing?

    • A network outage?

    • Reports unavailable?

    • If a transaction times out?

    • If 100 transactions time out in a 10 min period?

    • etc

Distributed Systems - COMP 655


More to watch out for
More to watch out for

  • What constitutes being back up after an outage?

  • When does an outage start?

  • When does it end?

  • Are there outages that don’t count?

    • Natural disasters?

    • Outages due to operator errors?

  • What about MTBF?

Distributed Systems - COMP 655


Ways to get 99 availability
Ways to get 99% availability

  • MTBF = 99 hr, MTTR = 1 hr

  • MTBF = 99 min, MTTR = 1 min

  • MTBF = 99 sec, MTTR = 1 sec

Distributed Systems - COMP 655


More definitions

fault

causes

error

may cause

failure

More definitions

  • Types of faults:

  • transient

  • intermittent

  • permanent

Fault tolerance is continuing to work correctly in the presence of faults.

Distributed Systems - COMP 655


Types of failures
Types of failures

Distributed Systems - COMP 655


If you remember one thing
If you remember one thing

  • Components fail in distributed systems on a regular basis.

  • Distributed systems have to be designed to deal with the failure of individual components so that the system as a whole

    • Is available and/or

    • Is reliable and/or

    • Is safe and/or

    • Is maintainable

      depending on the problem it is trying to solve and the resources available …

Distributed Systems - COMP 655


Fault tolerance1
Fault Tolerance

  • Fault tolerance concepts

  • Implementation – distributed agreement

  • Distributed agreement meets transaction processing: 2- and 3-phase commit

Distributed Systems - COMP 655


Two army problem
Two-army problem

  • Red army has 5,000 troops

  • Blue army and White army have 3,000 troops each

  • Attack together and win

  • Attack separately and lose in serial

  • Communication is by messenger, who might be captured

  • Blue and white generals have no way to know when a messenger is captured

Distributed Systems - COMP 655


Activity outsmart the generals
Activity: outsmart the generals

  • Take your best shot at designing a protocol that can solve the two-army problem

  • Spend ten minutes

  • Did you think of anything promising?

Distributed Systems - COMP 655


Conclusion go home
Conclusion: go home

  • “agreement between even two processes is not possible in the face of unreliable communication”

Distributed Systems - COMP 655


Byzantine generals
Byzantine generals

  • Assume perfect communication

  • Assume n generals, m of whom should not be trusted

  • The problem is to reach agreement on troop strength among the non-faulty generals

Distributed Systems - COMP 655


Byzantine generals example
Byzantine generals - example

n = 4, m = 1

(units are K-troops)

  • Multicast troop-strength messages

  • Construct troop-strength vectors

  • Compare notes: majority rules in each component

  • Result: 1, 2, and 4 agree on (1,2,unknown,4)

Distributed Systems - COMP 655


Doesn t work with n 3 m 1
Doesn’t work with n=3, m=1

Distributed Systems - COMP 655


Fault tolerance2
Fault Tolerance

  • Fault tolerance concepts

  • Implementation – distributed agreement

  • Distributed agreement meets transaction processing: 2- and 3-phase commit

Distributed Systems - COMP 655


Distributed commit protocols
Distributed commit protocols

  • What is the problem they are trying to solve?

    • Ensure that a group of processes all do something, or none of them do

    • Example: in a distributed transaction that involves updates to data on three different servers, ensure that all three commit or none of them do

Distributed Systems - COMP 655


2 phase commit
2-phase commit

What to do when P, in READY state, contacts Q

Coordinator

Participant

Distributed Systems - COMP 655


If coordinator crashes
If coordinator crashes

  • Participants could wait until the coordinator recovers

  • Or, they could try to figure out what to do among themselves

    • Example, if P contacts Q, and Q is in the COMMIT state, P should COMMIT as well

Distributed Systems - COMP 655


2 phase commit1
2-phase commit

What to do when P, in READY state, contacts Q

  • If all surviving participants are in READY state,

  • Wait for coordinator to recover

  • Elect a new coordinator (?)

Distributed Systems - COMP 655


3 phase commit
3-phase commit

  • Problem addressed:

    • Non-blocking distributed commit in the presence of failures

    • Interesting theoretically, but rarely used in practice

Distributed Systems - COMP 655


3 phase commit1
3-phase commit

Coordinator

Participant

Distributed Systems - COMP 655


Bonus material
Bonus material

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

  • Sparing

Distributed Systems - COMP 655


Rpc rmi crash omission failures
RPC, RMI crash & omission failures

  • Client can’t locate server

  • Request lost

  • Server crashes after receipt of request

  • Response lost

  • Client crashes after sending request

Distributed Systems - COMP 655


Can t locate server
Can’t locate server

  • Raise an exception, or

  • Send a signal, or

  • Log an error and return an error code

    Note: hard to mask distribution in this case

Distributed Systems - COMP 655


Request lost
Request lost

  • Timeout and retry

  • Back off to “cannot locate server” if too many timeouts occur

Distributed Systems - COMP 655


Server crashes after receipt of request
Server crashes after receipt of request

  • Possible semantic commitments

    • Exactly once

    • At least once

    • At most once

Normal

Work done

Work not done

Distributed Systems - COMP 655


Behavioral possibilities
Behavioral possibilities

  • Server events

    • Process (P)

    • Send completion message (M)

    • Crash (C)

  • Server order

    • P then M

    • M then P

  • Client strategies

    • Retry every message

    • Retry no messages

    • Retry if unacknowledged

    • Retry if acknowledged

Distributed Systems - COMP 655


Combining the options
Combining the options

Distributed Systems - COMP 655


Lost replies
Lost replies

  • Make server operations idempotent whenever possible

  • Structure requests so that server can distinguish retries from the original

Distributed Systems - COMP 655


Client crashes
Client crashes

  • The server-side activity is called an orphan computation

  • Orphans can tie up resources, hold locks, etc

  • Four strategies (at least)

    • Extermination, based on client-side logs

      • Client writes a log record before and after each call

      • When client restarts after a crash, it checks the log and kills outstanding orphan computations

      • Problems include:

        • Lots of disk activity

        • Grand-orphans

Distributed Systems - COMP 655


Client crashes continued
Client crashes, continued

  • More approaches for handling orphans

    • Re-incarnation, based on client-defined epochs

      • When client restarts after a crash, it broadcasts a start-of-epoch message

      • On receipt of a start-of-epoch message, each server kills any computation for that client

    • “Gentle” re-incarnation

      • Similar, but server tries to verify that a computation is really an orphan before killing it

Distributed Systems - COMP 655


Yet more client crash strategies
Yet more client-crash strategies

  • One more strategy

    • Expiration

      • Each computation has a lease on life

      • If not complete when the lease expires, a computation must obtain another lease from its owner

      • Clients wait one lease period before restarting after a crash (so any orphans will be gone)

      • Problem: what’s a reasonable lease period?

Distributed Systems - COMP 655


Common problems with client crash strategies
Common problems with client-crash strategies

  • Crashes that involve network partition

    (communication between partitions will not work at all)

  • Killed orphans may leave persistent traces behind, for example

    • Locks

    • Requests in message queues

Distributed Systems - COMP 655


Bonus material1
Bonus material

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

  • Sparing

Distributed Systems - COMP 655


How to do it
How to do it?

  • Redundancy applied

    • In the appropriate places

    • In the appropriate ways

  • Types of redundancy

    • Data (e.g. error correcting codes, replicated data)

    • Time (e.g. retry)

    • Physical (e.g. replicated hardware, backup systems)

Distributed Systems - COMP 655


Triple modular redundancy
Triple Modular Redundancy

Distributed Systems - COMP 655


Tandem computers
Tandem Computers

  • TMR on

    • CPUs

    • Memory

  • Duplicated

    • Buses

    • Disks

    • Power supplies

  • A big hit in operations systems for a while

Distributed Systems - COMP 655


Replicated processing
Replicated processing

  • Based on process groups

  • A process group consists of one or more identical processes

  • Key events

    • Message sent to one member of a group

    • Process joins group

    • Process leaves group

    • Process crashes

  • Key requirements

    • Messages must be received by all members

    • All members must agree on group membership

Distributed Systems - COMP 655


Flat or non flat
Flat or non-flat?

Distributed Systems - COMP 655


Effective process groups require
Effective process groups require

  • Distributed agreement

    • On group membership

    • On coordinator elections

    • On whether or not to commit a transaction

  • Effective communication

    • Reliable enough

    • Scalable enough

    • Often, multicast

    • Typically looking for atomic multicast

Distributed Systems - COMP 655


Process groups also require
Process groups also require

  • Ability to tolerate crash failures and omission failures

    • Need k+1 processes to deal with up to k silent failures

  • Ability to tolerate performance, response, and arbitrary failures

    • Need 3k+1 processes to reach agreement with up to k Byzantine failures

    • Need 2k+1 processes to ensure that a majority of the system produces the correct results with up to k Byzantine failures

Distributed Systems - COMP 655


Bonus material2
Bonus material

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

  • Sparing

Distributed Systems - COMP 655


Reliable multicasting
Reliable multicasting

Distributed Systems - COMP 655


Exercise reliable multicast
Exercise: Reliable Multicast

  • Brainstorm issues that could arise when designing a solution to guarantee delivery of messages to all members in a process group:

    • Number of receivers?

    • Is ordering of messages important?

    • Can processes crash and re-join the group?

Distributed Systems - COMP 655


Scalability problem
Scalability problem

  • Too many acknowledgements

    • One from each receiver

    • Can be a huge number in some systems

    • Also known as “feedback implosion”

Distributed Systems - COMP 655


Basic feedback suppression in scalable reliable multicast
Basic feedback suppression in scalable reliable multicast

  • If a receiver decides it has missed a message,

  • it waits a random time, then multicasts a retransmission request

  • while waiting, if it sees a sufficient request from another receiver,

    • it does not send its own request

  • server multicasts all retransmissions

Distributed Systems - COMP 655


Hierarchical feedback suppression for scalable reliable multicast
Hierarchical feedback suppression for scalable reliable multicast

  • messages flow from root toward leaves

  • acks and retransmit requests flow toward root from coordinators

  • each group can use any reliable small-group multicast scheme

Distributed Systems - COMP 655


Atomic multicast
Atomic multicast multicast

  • Often, in a distributed system, reliable multicast is a step toward atomic multicast

  • Atomic multicast is atomicity applied to communications:

    • Either all members of a process group receive a message, OR

    • No members receive it

  • Often requires some form of order agreement as well

Distributed Systems - COMP 655


How atomic multicast helps
How atomic multicast helps multicast

  • Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database

  • One replica goes down

  • Database activity continues

  • The process comes back up

  • Atomic multicast allows us to figure out exactly which transactions have to be re-played

Distributed Systems - COMP 655


More concepts
More concepts multicast

  • Group view

  • View change

  • Virtually synchronous

    • A stronger form of reliable multicast

    • Each message is received by all non-faulty processes, or

    • If sender crashes during multicast, message could be ignored by all processes

Distributed Systems - COMP 655


Virtual synchrony picture
Virtual synchrony picture multicast

Basic idea:

in virtual synchrony, a multicast cannot cross a view-change

Distributed Systems - COMP 655


Receipt vs delivery
Receipt vs Delivery multicast

Remember totally-ordered multicast …

Distributed Systems - COMP 655


What about multicast message order
What about multicast message order? multicast

  • Two aspects:

    • Relationship between sending order and delivery order

    • Agreement on delivery order

  • Send/delivery ordering relationships

    • Unordered

    • FIFO-ordered

    • Causally-ordered

  • If receivers agree on delivery order, it’s called totally-ordered multicast

Distributed Systems - COMP 655


Unordered
Unordered multicast

Process P1

Process P2

Process P3

sends m1

sends m2

delivers m1

delivers m2

delivers m2

delivers m1

Distributed Systems - COMP 655


Fifo ordered

Agreement on: multicast

m1 before m2

m3 before m4

FIFO-ordered

Process P1

Process P2

Process P3

Process P4

sends m1

sends m2

delivers m1

delivers m3

delivers m2

delivers m4

delivers m3

delivers m1

delivers m2

delivers m4

sends m3

sends m4

Distributed Systems - COMP 655


Six types of virtually synchronous reliable multicast

Relationship between sending multicast

order and delivery order

Agreement on

delivery order

Six types of virtually synchronous reliable multicast

Distributed Systems - COMP 655


Implementing virtual synchrony
Implementing virtual synchrony multicast

Don’t deliver a message until it’s been received everywhere -

but “everywhere” can change

  • 7’s crash is detected by 4, which sends a view-change message

  • Processes forward unstable messages, followed by flush

  • When have flush from all processes in new view, install new view

Distributed Systems - COMP 655


Bonus material3
Bonus material multicast

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

  • Sparing

Distributed Systems - COMP 655


Recovery from error
Recovery from error multicast

  • Two main types:

    • Backward recovery to a checkpoint (assumed to be error-free)

    • Forward recovery (infer a correct state from available data)

Distributed Systems - COMP 655


More about checkpoints
More about checkpoints multicast

  • They are expensive

  • Usually combined with a message log

  • Message logs are cleared at checkpoints

  • Recovering a crashed process:

    • Restart it

    • Restore its state to the most recent checkpoint

    • Replay the message log

Distributed Systems - COMP 655


Recovery line most recent distributed snapshot
Recovery line == most recent distributed snapshot multicast

Distributed Systems - COMP 655


Domino effect
Domino effect multicast

Distributed Systems - COMP 655


Bonus material4
Bonus material multicast

  • Implementation – reliable point-to-point communication

  • Implementation – process groups

  • Implementation – reliable multicast

  • Recovery

Distributed Systems - COMP 655