fault tolerant computing basics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Fault Tolerant Computing Basics PowerPoint Presentation
Download Presentation
Fault Tolerant Computing Basics

Loading in 2 Seconds...

play fullscreen
1 / 29

Fault Tolerant Computing Basics - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Fault Tolerant Computing Basics. Dan Siewiorek Carnegie Mellon University June 2012. Preview. Many terms have multiple usage that can lead to confusion when used out of context Sources of error

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fault Tolerant Computing Basics' - tatum


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fault tolerant computing basics

Fault Tolerant ComputingBasics

Dan Siewiorek

Carnegie Mellon University

June 2012

preview
Preview
  • Many terms have multiple usage that can lead to confusion when used out of context
    • Sources of error
  • Faults go through at least ten stages from inception to repair - so designer better plan for all ten stages
    • Relationship between sequence of events in handling a fault and mathematical measures
outline
Outline
  • Introduction
  • Definitions
  • Sources of Errors
why reliability
WHY RELIABILITY?
  • Three of the driving factors:
    • Critical applications
      • computer outage or error can cause loss of money, time, life
      • No longer just in aerospace, but in more mundane applications – customer expectations
    • Increasing system complexity
      • more components,  more likelihood of failure (counter: increased rel. of | VLSI)
      • Lower signal/noise ratios in ↑ VLSI speed  more likelihood of transient errors
      • Diagnosis more difficult, downtime is longer, repair costs ↑ increased inventory costs too
    • Relative cost is less
availability example
AVAILABILITY EXAMPLE
  • 90 MINUTES DOWNTIME PER WEEK
  • AVAILABILITY 0.991
  • RESERVATION SYSTEM -- $36,000/MINUTE DOWN
  • $3.24 MILLION PER WEEK
  • .1% AVAILABILITY = 10 MINUTES

= $360,000.00

univac i checkers
Univac I Checkers
  • Parity
    • Memory
    • Input to function table
    • Output from function table, odd number of selected gates. Dummy lines preserve parity
    • Unitypes
  • 1-of-n
    • Intermediate line function table
    • Memory bank select
univac i checkers cont d
Univac I Checkers (cont’d)
  • Duplication
    • Registers
    • Adder
    • Comparitor
    • Multiplier-quotient coupler
    • Bus amplifier
    • Bus interface
  • Automatic voltage monitoring system tests every DC voltage at rate of one per minute
  • “720 checker” counts 720 characters per I/O block
definitions
Definitions
  • RELIABILITY:SURVIVAL PROBABILITY
    • When repair is costly or function is critical
  • AVAILABILITY:THE FRACTION OF TIME A SYSTEM MEETS ITS SPECIFICATION
    • When service can be delayed or denied
  • REDUNDANCY:EXTRA HARDWARE, SOFTWARE, TIME
stages in the development of a system
Stages in the development of a system

STAGEERROR SOURCESERROR DETECTION

Specification Algorithm Design Simulation

& design Formal Specification Consistency checks,

model checking

Prototype Algorithm design Stimulus/response

Wiring & assembly testing

Timing

Component Failure

Manufacture Wiring & assembly System testing

Component failure Diagnostics

Installation Assembly System Testing

Component failure Diagnostics

Field Operation Component failure Diagnostics

Operator errors

Environmental factors

cause effect sequence
Cause-effect sequence
  • FAILURE: component does not provide service
  • FAULT:deviation of logic function from design value
    • Hard, Transient
  • ERROR: manifestation of a fault by incorrect value
fault classification
Fault Classification
  • DURATION:
    • Transient- design errors, environment
    • Intermittent- repair by replacement
    • Permanent- repair by replacement
  • EXTENT:
    • Local (independent)
    • Distributed (related)
  • VALUE:
    • Determinate (stuck at X)
    • Indeterminate (variable)
basic steps in fault handling
Fault Confinement -- contain it before it can spread

Fault Detection -- find out about it to prevent acting on bad data

Fault Masking -- mask effects

Retry -- since most problems are transient, just try again

Diagnosis -- figure out what went wrong as prelude to correction

Reconfiguration -- work around a defective component

Recovery -- resume operation after reconfiguration in degraded mode

Restart -- re-initialize (warm restart; cold restart)

Repair -- repair defective component

Reintegration -- after repair, go from degraded to full operation

Basic Steps in Fault Handling
mtbf mttd mttr
MTBF -- MTTD -- MTTR

Availability = MTTF

______________

MTTF + MTTR

error containment levels
Error Containment Levels
  • For distributed systems there are additional levels
    • Containment to a single node or FTU
    • Containment to a single bus or subsystem
    • Containment to a single vehicle/piece of equipment in a national infrastructure
mainframe outage sources
“Mainframe”Outage Sources

(* the sum of these sources was 0.75)

tandem causes of system failures
Tandem Causes of System Failures

(Up is good; down is bad)

tandem hardware causes of outage
Tandem Hardware Causes of Outage
  • Disks 49%
  • Communications 24%
  • Processors 18%
  • Timing 9%
  • Spares 1%
tandem operations causes of outage
Tandem Operations Causes of Outage
  • Procedures 42%
  • Configurations 39%
  • Move 13%
  • Overflow 4%
  • Upgrade 1%
tandem maintenance causes of outage
Tandem Maintenance Causes of Outage
  • Disk 67%
  • Communication 20%
  • Processor 13%
tandem environmental outages
Tandem Environmental Outages
  • Extended Power Loss 80%
  • Earthquake 5%
  • Flood 4%
  • Fire 3%
  • Lightning 3%
  • Halon Activation 2%
  • Air Conditioning 2%
  • Total MTBF about 20 years
  • MTBAoG* about 100 years
    • Roadside highway equipment will be more exposed than this

* (AoG= “Act Of God”)

cmu andrew file server study
CMU Andrew File Server Study
  • Configuration
    • 13 SUN II Workstations with 68010 processor
    • 4 Fujitsu Eagle Disk Drives
  • Observations
    • 21 Workstation Years
  • Frequency of events
    • Permanent Failures 29
    • Intermittent Faults 610
    • Transient Faults 446
    • System Crashes 298
  • Mean Time To
    • Permanent Failures 6552 hours
    • Intermittent Faults 58 hours
    • Transient Faults 354 hours
    • System Crash 689 hours
some interesting ratios
Some Interesting Ratios
  • Permanent Outages/Total Crashes = 0.1
  • Intermittent Faults/Permanent Failures = 21
    • Thus first symptom appears over 1200 hours prior to repair
  • (Crashes - Permanent)/Total Faults = 0.255
  • 14/29 failures had three or fewer error log entries
    • 8/29 had no error log entries