fundamentals stream session 8 fault tolerance dependability i l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Fundamentals Stream Session 8: Fault Tolerance & Dependability I PowerPoint Presentation
Download Presentation
Fundamentals Stream Session 8: Fault Tolerance & Dependability I

Loading in 2 Seconds...

play fullscreen
1 / 32

Fundamentals Stream Session 8: Fault Tolerance & Dependability I - PowerPoint PPT Presentation


  • 474 Views
  • Uploaded on

Distributed Systems Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Ta ïani Overview of the day Today: Fault-Tolerance in DS / Guest Lecture 9-10am: Fundamental Stream / Fault Tolerance (a) 10-11am: Guest Lecturer Prof. Jean-Charles Fabre

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fundamentals Stream Session 8: Fault Tolerance & Dependability I' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of the day
Overview of the day

Today: Fault-Tolerance in DS / Guest Lecture

  • 9-10am: Fundamental Stream / Fault Tolerance (a)
  • 10-11am: Guest Lecturer Prof. Jean-Charles Fabre
    • How to assess the robustness of OS and middleware
  • 12-1pm: Fundamental Stream / Fault Tolerance (b)

Next week: Advanced Fault-Tolerance

G. Blair/ F. Taiani

overview of the session
Overview of the session
  • A famous example (video)
  • Basic dependability concepts
    • Also covered in CSc 365 (Year B) "Safety Critical Systems”
  • Fault-tolerance at the node level: Replication
  • Fault-tolerance at the network level: TCP
  • Fault-tolerance at the environment level: Data centres

Associated Reading: Chapter 7. Fault Tolerance of Tanenbaum & van Steen

G. Blair/ F. Taiani

a famous example
A Famous Example
  • April 14, 1912

?

Video

G. Blair/ F. Taiani

a major disaster
A Major Disaster
  • The worst peacetime maritime disaster
    • more than 1500 dead
    • only 706 survivors
  • The wreck only discovered in 1985
    • 2 1/2 miles down on the ocean floor

G. Blair/ F. Taiani

the titanic and fault tolerance
The Titanic and Fault-Tolerance
  • How fault-tolerant was the Titanic?
    • best technology of the time: deemed “practicably” unsinkable
    • 16 watertight compartments
    • Titanic could float with its first 4 compartments flooded
  • Some flaws in the design:
    • poor turning ability
    • bulkheads only went as high as E-deck
    • not enough lifeboats!

G. Blair/ F. Taiani

what does it teach us
What does It Teach Us?
  • Good to plan ahead for problems
    • Like collision and hull damage for a ship
  • Replication / redundancy is the key to survival
    • Titanic contained 16 watertight compartments
    • Could still float with the first four flooded (& other combinations)
    • But always a limit to what you can tolerate
  • But you also need diversity / independence
    • Compartments are watertight
    • Critical issue: prevent water propagation (deck E issue)
  • The case of the titanic can be applied to many other systems
    • Generic concepts have been developed to discuss them

G. Blair/ F. Taiani

dependability concepts
Dependability Concepts
  • Failure: what we want to avoid
    • Titanic: sinking and killing people
    • Distributed System: stop functioning correctly
  • Fault: a problem that could cause a failure
    • Titanic: the iceberg, design flaws
    • Distributed System: hardware crash, software bug
  • Errors: an erroneous internal state that can lead to a failure
    • Titanic: damaged hull, water in compartment
    • DS: erroneous data, inconsistent internal behaviour

G. Blair/ F. Taiani

system and sub systems
System and Sub-Systems
  • Faults, Errors, Failures are recursive notions
  • A systems usually made of components e.g. sub-systems
    • Titanic: hull, decks, compartments, smoke stacks
    • Distributed system: nodes, network links, routers, hubs, services
      • Nodes contain CPUs, hard disks, network card, ...
      • Network comprises routers, hubs, gateways, name servers ...
  • Failure of internal component: a fault for enclosing system

G. Blair/ F. Taiani

means to dependability
Means to Dependability
  • Fault prevention:
    • Do not go through the Artic Ocean (no icebergs)
    • Buy good hardware (less faulty hardware)
    • Hire skilled programmers (less bugs)
  • Fault removal
    • Blast the iceberg (not practicable as external fault)
    • Test and replace faulty hardware before it is used
    • Test and remove bugs
  • Fault tolerance
    • Use replication / redundancy and diversity
    • Prevent water / error propagation

G. Blair/ F. Taiani

quality of service
Quality of Service
  • Notion of 'failure' is not black or white
    • Consider a web server taking 5 minute to deliver each request
    • It fulfils its functional specification (to deliver web pages)
    • It works better than a server that would not reply at all
    • But can it be considered to be functioning correctly?
  • Lots of intermediary situations
    • Quality of Service of a system: how 'good' a system is
      • Multiple dimension: latency, bandwidth, security, availability, reliability, ...
    • 'Failure' usually reserved to when a service becomes unusable
    • Goal: highest QoS in spite of faults (and at the lowest cost)

complete failure

optimal service

G. Blair/ F. Taiani

graceful degradation
Graceful Degradation
  • Ideally fault tolerance should mask any internal failure/fault
    • A router crashes but you do not even notice
  • In practice not always possible
    • Essentially because of cost + fault assumption
  • Goal is then to minimise impact of faults on system user
    • Avoid complete failure
    • Maintain highest possible QoS in spite of faults
  • This is called "Graceful Degradation"
    • Extremely important: leaves time to administrators to react
    • Users continue using the service, albeit with a lower QoS

G. Blair/ F. Taiani

distributed systems and fault tolerance
Distributed Systems andFault Tolerance
  • Fault tolerance requires distribution
    • For replication and independence of failure modes
    • I.e. to tolerate earthquakes  not all servers in San Francisco
  • Distribution requires fault-tolerance
    • In large distributed systems, partial node failures unavoidable
    • Single points of failure to be avoided as much as possible

G. Blair/ F. Taiani

what can go wrong in a ds
What Can Go Wrong in a DS?
  • Nodes (servers, clients, peers)
    • Faulty hardware  crash or data corruption
    • Power failure  crash (and sometimes data corruption)
    • Any "physical" accident: fire, flood, earthquake, ...
  • Network
    • Same as nodes.
      • Routers gateways: whole subnet impacted
      • Name servers: whole name domain impacted
    • Congestion (dropped packets)
  • Users / People
    • Involuntary mistakes
    • Security attacks (external &internal, i.e. from legitimate users)

G. Blair/ F. Taiani

main steps of fault tolerance
Main Steps of Fault Tolerance
  • Error Detection
    • Detecting that something is wrong (like water in the keel)
    • The earlier the better
    • The earlier the more difficult
  • Error Recovery
    • Preventing errors from propagating (watertight door)
    • Removing errors
    • Does not mean root cause (fault) removed
  • Fault treatment (usually off-line, manually)
    • Fault Diagnostic
    • Fault Removal

G. Blair/ F. Taiani

error detection
Error Detection
  • Error detection in the Titanic quite “easy”
    • No water should leak inside the boat (safety invariant)
  • Distributed Systems: more difficult
    • What should be the result of a request is usually unknown
    • In some cases sanity or correctness check possible
    • In many cases some form of redundancy needed

(1.414213)2=?

2

C

x

1.414213

G. Blair/ F. Taiani

error recovery
Error Recovery
  • Backward error recovery
    • Return system into a previous error-free (hopefully) state
    • Can be achieved in generic manner (replication, checkpointing)
    • Risk of inconsistency in general case
    • Can be minimised (see next week)
  • Forward error recovery
    • Bring system in a valid future state
    • Highly application dependent
    • Makes sense as soon as real-time is involved
    • Examples: video streaming, flight control software

G. Blair/ F. Taiani

node ft replication
Node FT: Replication
  • Why replicate?
    • Performance: e.g. replication of heavily loaded web servers
    • Allows error detection (when replicas disagree)
    • Allows backward error recovery
  • Importance of fault-model
    • Effect of replication depends on how individual replicas fail
    • Crash faults  availability: % availability = 1-pn(where n = no of replicas, p = probability of individual failure)
  • Requirements for a replication service
    • Replication transparency ...
    • in spite of network partitions, disconnections, etc.

G. Blair/ F. Taiani

focus on availability
Focus on Availability
  • Assumptions (very important!)
    • Replicas fail independently
    • Replicas fail silently (crashes, no arbitrary behaviour)

G. Blair/ F. Taiani

a generic replication architecture
A Generic Replication Architecture

Service

Request &replies

RM

C

RM

FE

Front ends

Clients

C

FE

RM

Replicamanagers

G. Blair/ F. Taiani

passive replication

Primary

Backup

C

FE

RM

RM

C

RM

FE

Backup

Passive Replication
  • What is passive replication (primary backup)?
    • FEs communicate with a single primary RM, which must then communicate with secondary RMs (slaves)
    • Requires election of new primary in case of primary failure
    • Only tolerate crash faults (silent failure of replicas)
    • Variant: primary’s state saved to stable storage (cold passive)
      • saving to stable storage known as “checkpointing”

G. Blair/ F. Taiani

active replication

RM

C

RM

C

FE

FE

RM

Active Replication
  • What is active replication?
    • Front Ends multicast requests to every Replica Manager
    • Appropriate reliable group communications needed
    • ordering guarantees crucial (see W4: Group Communication)
    • Tolerate crash-faults and arbitrary faults

G. Blair/ F. Taiani

comparing active vs passive

less expensive

less complexless powerful

more powerful

more expensive

more complex

Comparing Active vs. Passive

G. Blair/ F. Taiani

network ft tcp

Error detection against corruption

Network FT: TCP

Fault-model

  • TCP is a fault-tolerant protocol
    • correct messages arrives even if packets get lost or corrupted
  • A TCP packet:

C.SC 251: Networking

G. Blair/ F. Taiani

tcp checksum
TCP Checksum
  • TCP checksum : 16bit blocks of packet added & complemented
    • (additional one-complementing here and there)
    • 216 (~65 000) possible combinations
    • TCP packets typically around 1500 bytes  212000 combinations
    • Several packets bounds to have the same checksum
      • Roughly: 212000 ÷ 216 = 211984 ~ 16 x 103594
  • The Trick
    • Usual corruption unlikely to produce a consistent checksum
    • Notion of hamming distance
    • Additional checksum at the network (IP) and link layers (Ethernet)

 (!)

multiple corruptions

G. Blair/ F. Taiani

hamming distance

0110

+

1010

0110

1010

Hamming Distance
  • Between 2 string of bits:
    • the number of bit switches required to go from 1 to the other
  • Distance between 1011101 and 1001001? 2
  • Minimum Hamming Distance of a code
    • minimal hamming distance between 2 valid "code words"
    • i.e. minimal number of corruptions to fall back on a valid packet
  • Simple checksum:
    • packets of 8 bits
    • checksum on 4 bits: adds the two 4-bit blocks
  • What is the minimal hamming distance of this code?
  • Between 2 string of bits:
    • the number of bit switches required to go from 1 to the other
  • Distance between 1011101 and 1001001?

1100

????

G. Blair/ F. Taiani

hamming distance27

Does not match.

Invalid packet.

Corruption detected.

1000

1000

Matches again.

Valid packet.Corruption not detected.

0110

0100

0110

0100

0100

1010

1100

+

+

+

1000

1000

1000

Hamming Distance
  • Hamming distance is 2
    • Note that the 2 corruptions must obey a particular pattern
    • In a network corruption usually occur in bursts

0110

1010

1100

G. Blair/ F. Taiani

what is in a checksum
What Is in a Checksum?
  • The TCP checksum is a form of error detection code
    • Min. Hamming distance reflects strength of the detection
    • Far more complex error detection codes exist
    • For instance CRC: cyclic redundancy check
  • It is a form of redundancy
    • information redundancy: if packet OK no need for checksum
    • partial redundancy: impossible to reconstruct packet from checksum
  • When code used "strong" enough (min. Hamming dist > 2) possible reconstruct "most likely" correct packet
    • Error correction code (not seen in this course)
    • Provides error recovery

G. Blair/ F. Taiani

environment ft data centres
Environment FT: Data Centres
  • Mission critical systems (your company’s web server)
    • Little point in addressing only one type of fault
    • Major risks: power failure & overheating
  • Collocation Provider: rents luxury “parking places” for servers
    • Multiple network operators
    • Secured physical access
    • Multiple power sources (including power generator)
    • Highly robust air conditioning
  • Example: Redbus Interhouse plc (http://www.interhouse.org/)
    • 3 centres in London, others in Amsterdam, Frankfurt, Milan, Paris
    • Even there things can go wrong

http://www.theregister.co.uk/2004/06/16/redbus_power_fails/

G. Blair/ F. Taiani

what we ll see next week
What we'll see next week
  • Refinement of what we have seen today
  • Replication scheme: the problem of consistency
    • Consistent check-pointing for primary backup
    • Fault-tolerant total ordering for active replication
  • Contrast replication and distributed transactions
    • Short flash-back to Week 3 lecture

G. Blair/ F. Taiani

references
References
  • Chapter 7 of Tanenbaum & van Steen
  • On the Titanic
    • http://www.charlespellegrino.com/Time%20Line.htm
      • Very detailed account
    • http://octopus.gma.org/space1/titanic.html
  • Fundamental concepts of dependability
    • by Algirdas Avizenis Jean-Claude Laprie Brian Randell
    • http://www.cert.org/research/isw/isw2000/papers/56.pdf

G. Blair/ F. Taiani

expected learning outcomes
Expected Learning Outcomes

At the end of this 8th Unit:

  • You should understand the basic dependability concepts of fault, error, failure
  • You should know the basic steps of fault-tolerance
  • You should appreciate fundamental fault-techniques like replication and error-detection codes
  • You should be able to explain and compare active and passive replication style

G. Blair/ F. Taiani