slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5 PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 16

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5 - PowerPoint PPT Presentation

  • Uploaded on

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5 October 1999. OVERVIEW. Mission/Safety Critical Fault Tolerance Fault Tolerant Techniques Examples of Draper Fault Tolerant Systems Backup. 2.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5' - waite

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


  • Mission/Safety Critical Fault Tolerance
  • Fault Tolerant Techniques
  • Examples of Draper Fault Tolerant Systems
  • Backup


mission safety critical fault tolerance
Mission / Safety Critical Fault Tolerance
  • Fault Tolerance is the ability to provide intended functionality in the presence of faults. Redundancy is used to mitigate the effects of system error due to faults.
  • There are two classes of faults that are of particular concern to mission and safety critical applications:
    • Random hardware faults
    • Software / Common mode faults
  • Techniques have been developed for both classes of faults
    • Random hardware faults:
      • Formally validated solutions using hardware redundancy and exact consensus to tolerant arbitrarily malicious (Byzantine) faults
      • Continuous Self Tests for latent faults
      • State restoration
    • Software / Common mode faults:
      • Design diversity
      • Exception handlers
      • Watch dog timers


hardware redundancy synchronization exact consensus
Hardware Redundancy, Synchronization & Exact Consensus
  • Technique
    • Hardware redundancy using exact consensus with rigorously implemented fault containment regions and tight synchronization
  • Problems being addressed
    • Random hardware fault
    • High coverage fault detection and identification
      • identifying all failure modes and methods for dealing with them
      • dealing with arbitrarily malicious (Byzantine) faults
    • Avoidance of output errors through fault masking by voters
  • Pros
    • No need to exhaustively identify all possible failure modes
    • System amenable to formal verification methods
  • Cons
    • Does not protect from common mode faults
    • Assumes recovery from first fault occurs prior to occurrence of second fault


state restoration
State Restoration
  • Technique
    • Resynchronize and reset the state of the channel being recovered using the fault tolerant clock, inter-channel data exchange from known ‘good’ channels and voting network
    • Incremental recovery using tagged memory and data exchange network
  • Problems being addressed
    • Random hardware fault
    • Recovering temporarily failed or repaired hardware
  • Pros
    • Recovery process is a straight forward application of the existing cross channel data exchange and voting hardware
  • Cons
    • System is off line during recovery process or a portion of the recovery process
    • Incremental recovery may not converge


design diversity
Design Diversity
  • Technique
    • N-version programming with confidence voter, layered on Fault Tolerant Processor with attached processors
  • Problems being addressed
    • Common mode software problems without losing Byzantine resilience to random hardware faults
    • Coincident software faults across N -versions
  • Pros
    • Unified approach addressing random hardware faults and common mode software faults
    • Isolation between hardware and software faults
  • Cons
    • Cannot use exact consensus approach to voting software versions, need to set thresholds to manage false alarms vs. missed detections
    • Not enough testing done to prove confidence voter solved coincident software faults


hierarchical approach to fault management
  • Near Perfect Fault Detection, Identification and Reconfiguration of Random, Arbitrarily Malicious Faults to a Fault Containment Region
  • Continuous Hardware based Fault Masking and Detection
    • Near Real-Time (msec.) Software based Fault Isolation and Reconfiguration
  • Resolution of Transient, Intermittent, and Hard faults by Heuristics
  • Software based Self-Tests (Built-In-Test) Run in Background on a time-available basis:
    • Uncover Latent Faults (Especially with Voters, etc.)
    • Localize Faults to Module or Chip Level
  • On-line Repair & Diagnostic Capability


seawolf ship control processing unit scpu
  • Quadruply redundant fault tolerant computer
  • Fault isolation to the card level
  • Heuristics to resolve transient, intermittent, & hard faults
  • Channel recovery withSCPU off line for < 150 msec.
  • Continuous self tests for latent faults
  • Exception handlers, watchdog timer, & overrun flag for common mode software faults


flight critical computer for the x 38 crew return vehicle
Flight Critical Computer for theX-38 / Crew Return Vehicle
  • Fault Tolerant Parallel Processor
    • Eight Processors: one quad group & four simplex groups
  • COTS hardware and software
  • Four Fault Containment Regions
    • Expandable to Five
  • Network Element provides:
    • Hardware synchronization
    • Source congruent data exchange & voting
    • Message passing between parallel processors
    • Error detection
  • Exception handlers, watchdog timer, overrun flag and memory management for software faults


fault tolerant processor with attached processors ftp ap
Fault Tolerant Processor with Attached Processors (FTP/AP)
  • Quadruply redundant fault tolerant processor with 4 attached processors
    • Four software versions of critical function on AP
  • Fault isolation to hardware FCR or software version
  • Confidence voter instead of majority voter used to resolve software version discrepancies and the issue of coincident errors
  • Hardened kernel approach used for operating system software
  • Recovery of software versions is accomplished by continuous execution with output masked
    • Output is compared to voted output
    • If output agrees for several iterations, version is restored


fault containment region fcr
Fault Containment Region (FCR)
  • An FCR is a collection of components that operates correctly regardless of any arbitrary logical or electrical fault outside the region.
  • An arbitrary logical or electrical fault in an FCR cannot cause the hardware outside the region to misbehave or fail in any manner.
  • Faults cannot propagate across containment regions but their effects (errors) can.


error containment
Error Containment
  • Voting planes are used to mask errors at different stages in a fault tolerant system.
    • Input voting masks failed sensor value from propagating to application programs.
    • Internal computer voting masks erroneous data from a failed Fault Containment Region from propagating to other FCRs.
    • Output voting and monitor/interlock mechanism prevents outputs of failed FCRs from propagating outside the computational core.
    • Actuator voting masks errors in the transmission mechanism connecting computer to actuators.


requirements for exact consensus byzantine resilience
Requirements for Exact Consensus& Byzantine Resilience
  • Requirements for Exact Consensus:
    • Identical initial states
    • Identical inputs
    • Identical operations
    • No faults
    • Bounded skew
  • Theoretically correct implementation of f-Byzantine Resilience requires:
    • Bit-wise comparison of results emanating from redundant sites of equivalent state complexity
    • 3f+1 fault containment regions (FCRs)
    • 2f+1 inter-FCR connectivity
    • f+1 round inter-FCR protocol
    • FCR synchronism


a correct solution 4 fcrs
A Correct Solution: 4 FCRs
  • Four participants in input distribution algorithm


a correct solution 2 rounds of exchange
A Correct Solution: 2-Rounds of Exchange
  • 2-Round input distribution algorithm
  • Vote exchanged values