1 / 38

Fault Tolerant Computer Design COMS30125

Learning Objectives Why computer fails How to design computers that are fault tolerant How to evaluate reliability Error correcting code Diagnosis. 3. Why FT?. What is Fault-Tolerance?A

Sophia
Download Presentation

Fault Tolerant Computer Design COMS30125

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Fault Tolerant Computer Design (COMS30125)

    3. 3 Why FT? What is Fault-Tolerance? A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system. Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    4. 4 Why FT? Key attributes Fault - Error - Failure Performance - Availability - Reliability More recently concept of “survivability” Inclusions of these constraints at design stage is likely to be more cost effective. Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    5. 5 Why FT? Who is concerned about fault-tolerance? System Users – irrespective of the application but some are a lot more concerned than others Who is concerned at design stages? Universities R, d, and a (Research, development, applications) Industry r, D, and A (research, Development, Applications) Issues Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    6. 6 Why FT? Examples General Purpose Systems PCs: RAMs with parity checks and possibly ECC (consideration of re-execution on failure detection is being investigated) Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW) Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    7. 7 Why FT? Examples Reliable Systems Telephone systems Banking systems e.g. ATM Stock market CAE - exams/projects Football games - display/ticketing Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    8. 8 Why FT? Examples Critical and Life Critical Systems Manned and unmanned space borne systems Aircraft control systems Nuclear reactor control systems Life support systems Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    9. 9 Why FT? Examples Reliable -> Critical Systems 911 telephone switching system Traffic light control system Automobile control system (ABS, Fuel injection system) Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    10. 10 Introduction Historical perspective and major push New initiatives Goals of fault-tolerance Applications of fault-tolerance Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components

    11. 11 Introduction (contd.) Historical Perspective not a new concept first use by J. van Neumann 1956 probabilistic logic and synthesis of reliable organism from unreliable components, Annals of mathematical studies, Princeton University Press Major push Space program HW Fault tolerance - then SW Fault tolerance later Merge the two Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    12. 12 Introduction (contd.) New initiatives Density of devices more failures likely Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on chip/board/system possibility of system hang-up Speculative execution results may need to be re-checked Low cost of HW and SW affordable/economical Hot issues: Soft errors, Life-time failures Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    13. 13 Introduction (contd.) Goals - different goals for different applications The key word is “reliability” – has different meaning for different users and applications Intuitive explanations Dependability Service Specification Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    14. 14 Introduction (contd.) Intuitive concepts Reliability – continues to work Availability – works when I need it Safety – does not put me in jeopardy Performability Maintainability Testability Survivability – will the system survive catastrophic events? Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    15. 15 Introduction (contd.) Applications Space borne system long life system Airplane control system critical system Transaction processing system high availability system Switching system high availability over certain level of performance Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    16. 16 Terminology and definitions Reliability and concept of probability R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0. Availability Performabiltiy An Example Dependability Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    17. 17

    18. 18 Fault-Error-Failure concept Intuitive definitions Origins of faults Methods to break FEF chain Attribute of faults Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components

    19. 19 Fault-Error-Failure concept (contd.) Intuitive definitions Fault - An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, … Causes Error - Effect of activation of a fault Failure - over-all system effect of an error Fault -> Error -> Failure Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    20. 20 Fault-Error-Failure concept (contd.) Origins of faults Physical device level (HW) Logic level (HW) Chip level (HW) System level (HW/SW) interfacing, specifications, … Why systems fail Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    21. 21 Fault-Error-Failure concept (contd.) Methods to break FEF chain Flow FEF Barriers Fault avoidance Fault masking Fault removal Fault forecasting Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    22. 22 Fault-Error-Failure concept (contd.) Attribute of faults Cause Nature Duration Extent Value Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes.

    24. 24 The development of Systematic Approach to Reliability and Fault Tolerance (1950s) Theoretical work on redundancy and coding Moore, Shannon, Hamming and Von Neumann (1960s) Fault Tolerance systematically built into systems Bell ESS IBM 360 Space system (SATURN IV) (1970s) Reliability of Primary concern in commercial designs Tandem Nonstop

    25. 25 Revival of interest Vacuum tubes?Transistors ?VLSI Computers have become increasingly complex Billions of transistors switching billions of times per second Wear out 5-year repair interval for TV sets can be reduced to a few months for computers

    26. 26 System reliability can be enhanced in two ways: Fault Avoidance: Implementing the system from ultra reliable components that are extremely unlikely to fail. Fault Tolerance (FT): designing the systems such that it continues to operate correctly, with error free execution of programs, even in the presence of certain specified faults Fault tolerance is achieved by using protective Redundancy Hardware redundancy Software redundancy Time redundancy

    27. 27 Classes of Fault Tolerant Systems Ultra reliable systems Employed in critical real time control applications System Reliability: The probability that the system will operate correctly over the desired mission time. Example: Avionic computers for unstable aircraft (NASA) Failure probability constrained to be less than10-9 for a 10 hour mission Fault Tolerance: Maximum number of failures that may occur anywhere in the system without causing system failure.

    28. 28 Long Life Systems Application where maintenance and/or repair is impossible Unmanned Spacecraft Mean time to Failure(MTTF): The expected (average) time to system failure. Example:20 years MTTF for a communication satellite Maximum mission time: The maximum time of operation for some specified minimum reliability. Example: Reliability if 0.90 for a 10 year mission on an outer planet exploratory vehicle

    29. 29 Safety Critical High Reliability Systems Safety Critical High Reliability Systems

    30. 30 Highly Available systems Application where downtime is expensive Telephone switching computer Expensive high performance systems Mean time to repair(MTTR): The average time before the system is repaired following a failure. Mean time between Failures:(MTBF)=MTTF + MTTR Maintainability: The probability that a system will be operating correctly at any given time during its operation schedule Availability: The probability that a system will be operating correctly at any given time during its operation schedule.

    31. 31 MTTF Availability = ------------------ MTTF + MTTR MTBF – MTTR = --------------------- MTBF Examples: Cray-1(1975) MTTF = 4 hours MTTR = 0.1 hours 4 Availability = -------- = 0.98 4.1 BELL ESS Goal: 20 minutes of downtime in 40 years

    32. 32 Availability:

    33. Cost of ownership as a function of reliability and maintainability features

    34. 34 Fail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefit Simple redundancy does not help much.

    35. 35

    36. 36 The Last 5 Years: Availability Dark Ages Ready for a Renaissance? Things got better, then things got a lot worse!

    37. 37 A Schematic of HotMail ~7,000 servers 100 backend stores with 120TB 3 data centers Links to Passport Ad-rotator Internet Mail gateways … ~ 1B messages per day 150M mailboxes, 100M active ~400,000 new per day.

    38. 38 Availability

More Related