1 / 12

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS. Lecture 1 : What is fault tolerance all about?. Course information. Webpage: http://www.ee.iastate.edu/~gmani/cpre545 Book Fault Tolerance in Distributed Systems, Pankaj Jalote, Prentice Hall Grading Homework: 25% Mid-term: 25% Term-project: 25%

anka
Download Presentation

CprE 545: FAULT-TOLERANT SYSTEMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE 545: FAULT-TOLERANT SYSTEMS Lecture 1: What is fault tolerance all about?

  2. Course information • Webpage: http://www.ee.iastate.edu/~gmani/cpre545 • Book • Fault Tolerance in Distributed Systems, Pankaj Jalote, Prentice Hall • Grading • Homework: 25% • Mid-term: 25% • Term-project: 25% • Final exam: 25% • Project • Study/Implementation design • Written report of about 25 pages • Need creative component • No cheating allowed.

  3. Course outline • Dependability concepts • Dependable system, techniques for achieving dependability, dependability measures, fault, error, failure, and classification of faults and failures. • Fault-tolerant strategies • Fault detection, masking, containment, location, reconfiguration, and recovery. • Fault tolerant design techniques • Hardware redundancy, software redundancy, time redundancy, and information redundancy. • Fault tolerance in real-time systems • Time-space tradeoff, imprecise computation, (m,k)-firm deadline model, fault tolerant scheduling algorithms.

  4. Course outline (contd..) • Dependable communication • Dependable channels, survivable networks, fault-tolerant routing. • Fault tolerance in distributed systems • Byzantine General problem, consensus protocols, checkpointing and recovery, stable storage and RAID architectures, and data replication and resiliency. • Fault-tolerant interconnection networks • Hypercube, star graphs, and fault tolerant ATM switches. • Dependability evaluation techniques and tools • Fault trees, Markov chains; HIMAP tool. • Reading of some of the state-of-the-art research material.

  5. Motivation • Systems are implemented using COTS parts • Components may fail due to various reasons • Hostile Environment • Operating conditions out of specification range • Aging • Poor design • Being able to tolerate an individual failure may save the day

  6. Why fault-tolerance? • 10,000 units of a component are used in a system • Failure rate of components: 0.5%/1000 hours • Total Failure rate: (0.5*10000)/(100*1000) = 0.05/hour • Approximate Unreliability = lt • Desired reliability = 0.99 • Time duration, t = (1-0.99)/l = 0.01/0.05 = 1/5 hours • System goes below desired level after 12 minutes

  7. Fault-tolerant computing concepts • Fault-tolerant computing • Correct execution of specified algorithm in the presence of defects • Fault tolerance is achieved using redundancy • Physical (space) or temporal (time) • Two important criteria • Availability • Reliability • Three dimensional reliability framework • Physical  circuit/RTL/Logic switch/Processor/Memory/Links • Time  specification/design/prototyping/manufacturing/ installation/ operation • Cost  Acquisition/Repair

  8. System and fault tolerance • Function: What the system in intended for • Behavior: what it does • Structure: What makes it do what it does • System may be layered • In layered system, each layer behaves as a component at the next layer • Function and service may be plural System User Component

  9. Ownership cost and fault tolerance • Two Systems • To be used for 4 years • System A • Acquiring: $2000 • Maintenance: @250/year • Total cost: $3000 • System B • Acquiring: $1000 • Maintenance: @500/year • Total cost: $3000 • But down time frustration can be avoided Total Cost Cost Cost of Acquiring Cost of Maintenance Reliability

  10. Some definitions • Fault • Physical Change  Physical World • Error • Result of a Fault  Information World • Failure • Deviation from intended function  External Effect • Three World Model Physical Informational External

  11. Why fault tolerance is more important? • System speed is higher • More reason for Timing Faults • Harsher environments • Systems are employed in all kind of applications • Novice users • Inadvertent user abuse • Higher cost for repair • Manpower and down times are expensive • Larger systems • Use more components, more chances of failure

  12. Issues • At what level to introduce redundancy? • Duplicate or triplicate? • How to manage redundancy? • Automatic or user assisted fault tolerance? • Fault tolerance or reliability and relationship? • Information redundancy? • How to evaluate?

More Related