slide1 l.
Download
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 34

- PowerPoint PPT Presentation


  • 311 Views
  • Uploaded on

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000 /2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: Introduction (Unit I) Motivation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - Mia_John


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

HUMBOLDT-UNIVERSITÄT ZU BERLIN

INSTITUT FÜR INFORMATIK

DEPENDABLE SYSTEMS

Vorlesung 1

INTRODUCTION

Wintersemester 2000/2001

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/ftc

DS - IX - NFT - 1

fault tolerant computing systems topical outline
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
  • Introduction (Unit I)
    • Motivation
    • System views
    • Dependability rings
    • Dependable design methodology
  • Dependability Concepts, Measures and Models (UNIT DCMM)
    • Basic definitions
    • Dependability measures
    • Dependability models
    • Examples
    • Dependability evaluation tools
  • Testing Techniques (UNIT TT)
    • Testing techniques principles
    • Processor testing
    • Memory testing
    • Network testing

DS - IX - NFT - 2

fault tolerant computing systems topical outline3
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
  • Fault Diagnosis Techniques (UNIT FST)
    • Fault detection techniques
    • Fault location (isolation) methods
  • Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level)
    • Dynamic techniques
    • Static techniques
    • Hybrid techniques
  • Fault-tolerant and Fault-secure Memories (UNIT FRTT)
    • Fault-tolerant techniques in manufacturing
    • Replication
    • Coding
    • Reconfiguration

DS - IX - NFT - 3

fault tolerant computing systems topical outline4
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
  • Network Fault Tolerance (UNIT NFT)
    • Computer networks
    • Basic techniques
    • Example – multistage networks
  • Case Studies (UNIT CS)
    • ESS and 3B20
    • FTMP – Fault-tolerant Multiprocessor
    • SIFT – Software-implemented Fault Tolerance
    • Communication controller
    • Fault-tolerant Building Block Architecture

DS - IX - NFT - 4

course activities
COURSE ACTIVITIES
  • PROJECT
  • PRESENTATION
  • INVITED SPEAKERS
  • CONFERENCES AND WORKSHOPS
  • Some Websites:
    • www.dependability.org
    • www.paradise.caltech.edu
    • www.milan.eas.asu.edu
    • www.crhc.uiuc.edu

DS - IX - NFT - 5

major references on fault tolerant computing books general 1
Major References on Fault-tolerant Computing (Books/General) 1
  • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970.
  • Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.
  • Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.
  • Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981.
  • Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.
  • Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.
  • Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.
  • Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.

DS - IX - NFT - 6

major references on fault tolerant computing books general 2
Major References on Fault-tolerant Computing (Books/General) 2
  • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.
  • Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.
  • Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.
  • Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992.
  • Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.
  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994.
  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.

DS - IX - NFT - 7

major references on fault tolerant computing books general 3
Major References on Fault-tolerant Computing (Books/General) 3
  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.
  • Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994.
  • Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.
  • Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.
  • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.
  • A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997
  • W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999
  • S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999.

DS - IX - NFT - 8

major references on fault tolerant computing books reliability evaluation
Major References on Fault-tolerant Computing (Books/Reliability Evaluation)
  • Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.
  • Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.
  • Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.
  • Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.
  • W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999

DS - IX - NFT - 9

major references on fault tolerant computing books coding
Major References on Fault-tolerant Computing (Books/Coding)
  • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.
  • Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.
  • Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.
  • Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.
  • Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.
  • Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989.

DS - IX - NFT - 10

major references on fault tolerant computing books software
Major References on Fault-tolerant Computing (Books/Software)
  • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.
  • Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982.
  • Shooman, M. L., Software Engineering, McGraw-Hill, 1983.
  • Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.
  • Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
  • Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.
  • Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.
  • Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.

DS - IX - NFT - 11

major references on fault tolerant computing journals
Major References on Fault-tolerant Computing (Journals)
  • Special Issue of Proc. Of IEEE, October 1978
  • Special Issue of Computer, October 1979
  • Special Issue of Computer, March 1980
  • Special Issue of Computer, August 1984
  • Special Issue of IEEE Software, May 1995
  • IEEE Trans. on Reliability
  • IEEE Trans. On Software Engineering
  • Computer
  • Design and Test
  • Electronics
  • Proc. Of IEEE
  • Computer Design
  • Journal of Electronic Testing: Theory and Applications
  • Journal of Parallel and Distributed Computing
  • IEEE Trans. on Parallel and Distributed Computing
  • Real-Time Systems Journal

DS - IX - NFT - 12

major references on fault tolerant computing conference proceedings
Major References on Fault-tolerant Computing (Conference Proceedings)
  • Fault-Tolerant Computing Symposium
  • Reliability and Maintainability Symposium
  • Reliability in Distributed Software and Database Systems Symposium
  • Test Conference
  • Distributed Computing Systems Conference
  • Parallel Processing Conference
  • Real-Time Systems Symposium
  • Computer Architecture Symposium

DS - IX - NFT - 13

introduction
INTRODUCTION
  • OBJECTIVES:
    • MOTIVATION FOR FAULT-TOLERANT SYSTEMS
    • TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY
    • TO PRESENT BASIC CONCEPTS AND APPROACHES
    • TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY
  • CONTENTS:
    • MOTIVATION
    • SYSTEM VIEWS
    • SYSTEM DEPENDABILITY CONCEPTS
    • APPROACHES TO DEPENDABLE DESIGN
    • DEPENDABILITY RINGS
    • DEPENDABLE DESIGN METHODOLOGY

DS - IX - NFT - 14

types of systems
TYPES OF SYSTEMS
  • Dependable (Reliable) System
    • A system which delivers a required service during its lifetime
  • Fault-Tolerant Computer Systems
    • A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults
  • Real-Time-Computer Systems
    • are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.)
  • Responsive Computer System
    • are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner

DS - IX - NFT - 15

motivation for reliable and fault tolerant computing
MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING
  • ECONOMIC NECESSITY
  • LIFE SAVING
  • NOVICE USERS
  • HARSH ENVIRONMENTS
  • MORE COMPLEX SYSTEMS

DS - IX - NFT - 16

device reliability and system reliability
DEVICE RELIABILITY AND SYSTEM RELIABILITY

Equivalent –

Device Reliability

106

105

104

103

102

10

1

Mean Time between Failures (MTBF) in Years

Minimum Acceptable Reliability

System Reliability

1950 1960 1970 1980 1990

Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI

DS - IX - NFT - 17

dependability performance trade off
DEPENDABILITY – PERFORMANCE TRADE-OFF

Ultra Reliable Systems

0.99999

0.9999

0.999

0.99

0.9

Commercial

Fault-Tolerant

Systems

Availability

Massively Parallel/

Distributed Systems

1 10 100 1000 10000 100000

Throughput (MIPS)

DS - IX - NFT - 18

examples
EXAMPLES
  • DEFENSE SYSTEMS
  • FLIGHT SYSTEMS
  • AIR TRAFFIC CONTROL
  • COMMUNICATION SYSTEMS
  • BANKING SYSTEMS
  • AIRLINE SEAT RESERVATIONS
  • TELEPHONE SYSTEMS
  • HOUSEHOLD APPLIANCES
  • VIDEO GAMES

DS - IX - NFT - 19

view 1 system life cycle
VIEW 1: SYSTEM LIFE CYCLE

SYSTEM CONSTRAINTS

NEW TECHNOLOGY

OBSOLESCENCE

NEEDS

CONCEPT FORMULATION

SYSTEM SPECIFICATION

DESIGN

PROTOTYPE

PRODUCTION

INSTALLATION

OPERATIONAL LIFE

MODIFICATION AND RETIREMENT

  • Notice that testing, verification or validation should occur after every phase of life cycle
  • Very few tools exist, and for some steps of the cycle only

DS - IX - NFT - 20

view 2 packaging levels of integration
VIEW 2: PACKAGING LEVELS OF INTEGRATION
  • APPLICATIONS
  • APPLICATIONS MODULES
  • SPECIAL-PURPOSE LANGUAGES
  • STANDARD LANGUAGES
  • OPERATING SYSTEMS
  • CABINETS/FRAMES
  • BOXES/CAGES
  • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs
  • INTEGRATED CIRCUITS (CHIPS)
  • Dependability must be considered at every level
  • System decomposition (partitioning) may have a significant impact on dependability

DS - IX - NFT - 21

view 3 workload view
VIEW 3: WORKLOAD VIEW

LIVEWARE

USEFUL

WORK

PREPARATION

SEMI USEFUL WORK

HARDWARE/ SOFTWARE

IDLING

FAULT

SERVICING

  • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY

DS - IX - NFT - 22

view 4 levels of abstraction for digital computers

LEVEL

SUBLEVEL

COMPONENTS

PMS

Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os

Program

HLL, ISP (Inst- raction Set

Processor

Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution

Logic

Register Trans- fer Level (RTL)

Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore)

Circuit

Resistors, Capacitors, Inductors, Power Sources, Diodes

Transistors

Quantum & El-ectromagnetic

Disks, Tapes

VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS
  • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL

DS - IX - NFT - 23

view 5 computer system
VIEW 5: COMPUTER SYSTEM

LIVEWARE

MAINTENANCE PERSONNEL

OPERATORS

SYSTEM DESIGNERS

SYSTEM ANALYSTS

PROGRAMMERS

USERS

SOFTWARE

PACKAGES

ASSEMBLERS

COMPILERS

OPERATING SYSTEMS

UTILITY PROGRAMS

DEBUGGING PROGRAMS

FILE PROCESSING PROGRAMS

FIRMWARE

MICROPROGRAM & MICROPRO-

GRAMMING SYSTEMS

HARDWARE

CPUs

I/O DEVICES

MEMORIES

INTERCONNECTION NETWORKS

FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS)

DS - IX - NFT - 24

warning view 6 if you do not follow dependable design methodology you may end up with the following
(WARNING!!!)VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING:

SIX PHASES OF A PROJECT

  • ENTHUSIASM
  • DISILLUSIONMENT
  • PANIC AND HYSTERIA
  • SEARCH FOR THE GUILTY
  • PUNISHMENT OF THE INNOCENT
  • PRAISE AND AWARDS FOR THE NON-PARTICIPANTS

(Author unknown – found in one of the computer companies)

DS - IX - NFT - 25

system dependability concepts
SYSTEM DEPENDABILITY CONCEPTS
  • RELIABILITY
    • Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0
  • AVAILABILITY
    • Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems

A (t) = R (t)

    • Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime

As (t) =

  • SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset

DS - IX - NFT - 26

approaches
APPROACHES
  • FAULT INTOLERANCE
  • FAULT TOLERANCE
  • MAINTAINABILITY
  • HARDWARE/SOFTWARE TRADE-OFFS

DS - IX - NFT - 27

hardware software continuum and vertical migration
HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION

HARDWARE

EXAMPLES

M6800

MC68000

VAX-11/780 IBM-30XX

CRAY-XMP C-205

SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS

INSTRUCTIONS

INTEGER ARITHMETIC ADD/SUB

MPY/DIV

FLOATING-POINT ARITHMETIC

VECTOR PROCESSING

MULTIPROCESSING (e.g., submachine set-up)

SOFTWARE

VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa.

Vertical Migration improves performance and dependability, and reduces cost.

DS - IX - NFT - 28

dependability reliability rings for fault tolerance
DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE

Dependability

Rings

Acceptance Test

Operating System, Languages and Application

Acceptance Test

System Hardware

Acceptance Test

Register-Transfer Level

Acceptance Test

Logic Level

Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery)

DS - IX - NFT - 29

a bootstrap test rings in a multicomputer system
A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM

Network

Memories

Processor

Diagnostic and Maintenance Processor (s) (Hardcore)

Test Rings

DS - IX - NFT - 30

dependable design methodology
DEPENDABLE DESIGN METHODOLOGY
  • Identify fault classes, fault latency and fault impact
  • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment
  • Identify “weak spots” and assess potential damage
  • Decompose the system
  • Develop fault and error detection techniques and algorithms
  • Develop fault isolation techniques and algorithms
  • Develop recovery/reintegration/restart
  • Evaluate degree of fault tolerance
  • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage

DS - IX - NFT - 31

real time systems design
REAL-TIME SYSTEMS DESIGN
  • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.
  • Characterize timing of a system (hardware and software).
  • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring.
  • Verify and validate the design for quantitative and qualitative specifications.
  • Refine, iterate and fine-tune the design.

DS - IX - NFT - 32

responsive system design
RESPONSIVE SYSTEM DESIGN
  • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.
  • Determine system timing (hardware and software) assess damage, availability and responsiveness.
  • Develop and time fault and error detection techniques and algorithms.
  • Develop and time fault isolation techniques and algorithms.
  • Develop time recovery/reintegration/restart.
  • Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring.
  • Evaluate responsiveness.
  • Refine and iterate for improvement.

RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME

DS - IX - NFT - 33

references textbook
REFERENCES(TEXTBOOK)
  • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978.
  • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987.
  • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.
  • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994
  • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.

DS - IX - NFT - 34

ad