HUMBOLDT-UNIVERSITÄT ZU BERLIN
Download
1 / 34

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK - PowerPoint PPT Presentation


HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000 /2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc FAULT-TOLERANT COMPUTING SYSTEMS Topical Outline: Introduction (Unit I) Motivation

Related searches for HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


HUMBOLDT-UNIVERSITÄT ZU BERLIN

INSTITUT FÜR INFORMATIK

DEPENDABLE SYSTEMS

Vorlesung 1

INTRODUCTION

Wintersemester 2000/2001

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/ftc

DS - IX - NFT - 1


FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

  • Introduction (Unit I)

    • Motivation

    • System views

    • Dependability rings

    • Dependable design methodology

  • Dependability Concepts, Measures and Models (UNIT DCMM)

    • Basic definitions

    • Dependability measures

    • Dependability models

    • Examples

    • Dependability evaluation tools

  • Testing Techniques (UNIT TT)

    • Testing techniques principles

    • Processor testing

    • Memory testing

    • Network testing

DS - IX - NFT - 2


FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

  • Fault Diagnosis Techniques (UNIT FST)

    • Fault detection techniques

    • Fault location (isolation) methods

  • Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level)

    • Dynamic techniques

    • Static techniques

    • Hybrid techniques

  • Fault-tolerant and Fault-secure Memories (UNIT FRTT)

    • Fault-tolerant techniques in manufacturing

    • Replication

    • Coding

    • Reconfiguration

DS - IX - NFT - 3


FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

  • Network Fault Tolerance (UNIT NFT)

    • Computer networks

    • Basic techniques

    • Example – multistage networks

  • Case Studies (UNIT CS)

    • ESS and 3B20

    • FTMP – Fault-tolerant Multiprocessor

    • SIFT – Software-implemented Fault Tolerance

    • Communication controller

    • Fault-tolerant Building Block Architecture

DS - IX - NFT - 4


COURSE ACTIVITIES

  • PROJECT

  • PRESENTATION

  • INVITED SPEAKERS

  • CONFERENCES AND WORKSHOPS

  • Some Websites:

    • www.dependability.org

    • www.paradise.caltech.edu

    • www.milan.eas.asu.edu

    • www.crhc.uiuc.edu

DS - IX - NFT - 5


Major References on Fault-tolerant Computing (Books/General) 1

  • Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970.

  • Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.

  • Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.

  • Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981.

  • Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.

  • Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.

  • Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.

  • Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.

DS - IX - NFT - 6


Major References on Fault-tolerant Computing (Books/General) 2

  • Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.

  • Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.

  • Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.

  • Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992.

  • Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.

  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994.

  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.

DS - IX - NFT - 7


Major References on Fault-tolerant Computing (Books/General) 3

  • Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.

  • Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994.

  • Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems, Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.

  • Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.

  • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.

  • A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997

  • W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999

  • S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser Muenchen, 1999.

DS - IX - NFT - 8


Major References on Fault-tolerant Computing (Books/Reliability Evaluation)

  • Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.

  • Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.

  • Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.

  • Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.

  • W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999

DS - IX - NFT - 9


Major References on Fault-tolerant Computing (Books/Coding)

  • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.

  • Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.

  • Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.

  • Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.

  • Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.

  • Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989.

DS - IX - NFT - 10


Major References on Fault-tolerant Computing (Books/Software)

  • Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.

  • Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982.

  • Shooman, M. L., Software Engineering, McGraw-Hill, 1983.

  • Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.

  • Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.

  • Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.

  • Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.

  • Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.

DS - IX - NFT - 11


Major References on Fault-tolerant Computing (Journals)

  • Special Issue of Proc. Of IEEE, October 1978

  • Special Issue of Computer, October 1979

  • Special Issue of Computer, March 1980

  • Special Issue of Computer, August 1984

  • Special Issue of IEEE Software, May 1995

  • IEEE Trans. on Reliability

  • IEEE Trans. On Software Engineering

  • Computer

  • Design and Test

  • Electronics

  • Proc. Of IEEE

  • Computer Design

  • Journal of Electronic Testing: Theory and Applications

  • Journal of Parallel and Distributed Computing

  • IEEE Trans. on Parallel and Distributed Computing

  • Real-Time Systems Journal

DS - IX - NFT - 12


Major References on Fault-tolerant Computing (Conference Proceedings)

  • Fault-Tolerant Computing Symposium

  • Reliability and Maintainability Symposium

  • Reliability in Distributed Software and Database Systems Symposium

  • Test Conference

  • Distributed Computing Systems Conference

  • Parallel Processing Conference

  • Real-Time Systems Symposium

  • Computer Architecture Symposium

DS - IX - NFT - 13


INTRODUCTION

  • OBJECTIVES:

    • MOTIVATION FOR FAULT-TOLERANT SYSTEMS

    • TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY

    • TO PRESENT BASIC CONCEPTS AND APPROACHES

    • TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY

  • CONTENTS:

    • MOTIVATION

    • SYSTEM VIEWS

    • SYSTEM DEPENDABILITY CONCEPTS

    • APPROACHES TO DEPENDABLE DESIGN

    • DEPENDABILITY RINGS

    • DEPENDABLE DESIGN METHODOLOGY

DS - IX - NFT - 14


TYPES OF SYSTEMS

  • Dependable (Reliable) System

    • A system which delivers a required service during its lifetime

  • Fault-Tolerant Computer Systems

    • A system that has the capability to continue the correct execution of its programs and input/output functions in the presence of faults

  • Real-Time-Computer Systems

    • are the ones that deliver service to a user within a specified deadline (physical time, duration, etc.)

  • Responsive Computer System

    • are Fault-Tolerant Real-Time Systems that deliver satisfactory service in a timely manner

DS - IX - NFT - 15


MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING

  • ECONOMIC NECESSITY

  • LIFE SAVING

  • NOVICE USERS

  • HARSH ENVIRONMENTS

  • MORE COMPLEX SYSTEMS

DS - IX - NFT - 16


DEVICE RELIABILITY AND SYSTEM RELIABILITY

Equivalent –

Device Reliability

106

105

104

103

102

10

1

Mean Time between Failures (MTBF) in Years

Minimum Acceptable Reliability

System Reliability

19501960197019801990

Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI

DS - IX - NFT - 17


DEPENDABILITY – PERFORMANCE TRADE-OFF

Ultra Reliable Systems

0.99999

0.9999

0.999

0.99

0.9

Commercial

Fault-Tolerant

Systems

Availability

Massively Parallel/

Distributed Systems

110100100010000100000

Throughput (MIPS)

DS - IX - NFT - 18


EXAMPLES

  • DEFENSE SYSTEMS

  • FLIGHT SYSTEMS

  • AIR TRAFFIC CONTROL

  • COMMUNICATION SYSTEMS

  • BANKING SYSTEMS

  • AIRLINE SEAT RESERVATIONS

  • TELEPHONE SYSTEMS

  • HOUSEHOLD APPLIANCES

  • VIDEO GAMES

DS - IX - NFT - 19


VIEW 1: SYSTEM LIFE CYCLE

SYSTEM CONSTRAINTS

NEW TECHNOLOGY

OBSOLESCENCE

NEEDS

CONCEPT FORMULATION

SYSTEM SPECIFICATION

DESIGN

PROTOTYPE

PRODUCTION

INSTALLATION

OPERATIONAL LIFE

MODIFICATION AND RETIREMENT

  • Notice that testing, verification or validation should occur after every phase of life cycle

  • Very few tools exist, and for some steps of the cycle only

DS - IX - NFT - 20


VIEW 2: PACKAGING LEVELS OF INTEGRATION

  • APPLICATIONS

  • APPLICATIONS MODULES

  • SPECIAL-PURPOSE LANGUAGES

  • STANDARD LANGUAGES

  • OPERATING SYSTEMS

  • CABINETS/FRAMES

  • BOXES/CAGES

  • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs

  • INTEGRATED CIRCUITS (CHIPS)

  • Dependability must be considered at every level

  • System decomposition (partitioning) may have a significant impact on dependability

DS - IX - NFT - 21


VIEW 3: WORKLOAD VIEW

LIVEWARE

USEFUL

WORK

PREPARATION

SEMI USEFUL WORK

HARDWARE/ SOFTWARE

IDLING

FAULT

SERVICING

  • ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY

DS - IX - NFT - 22


LEVEL

SUBLEVEL

COMPONENTS

PMS

Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os

Program

HLL, ISP (Inst- raction Set

Processor

Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution

Logic

Register Trans- fer Level (RTL)

Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore)

Circuit

Resistors, Capacitors, Inductors, Power Sources, Diodes

Transistors

Quantum & El-ectromagnetic

Disks, Tapes

VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS

  • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL

DS - IX - NFT - 23


VIEW 5: COMPUTER SYSTEM

LIVEWARE

MAINTENANCE PERSONNEL

OPERATORS

SYSTEM DESIGNERS

SYSTEM ANALYSTS

PROGRAMMERS

USERS

SOFTWARE

PACKAGES

ASSEMBLERS

COMPILERS

OPERATING SYSTEMS

UTILITY PROGRAMS

DEBUGGING PROGRAMS

FILE PROCESSING PROGRAMS

FIRMWARE

MICROPROGRAM & MICROPRO-

GRAMMING SYSTEMS

HARDWARE

CPUs

I/O DEVICES

MEMORIES

INTERCONNECTION NETWORKS

FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS)

DS - IX - NFT - 24


(WARNING!!!)VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH THE FOLLOWING:

SIX PHASES OF A PROJECT

  • ENTHUSIASM

  • DISILLUSIONMENT

  • PANIC AND HYSTERIA

  • SEARCH FOR THE GUILTY

  • PUNISHMENT OF THE INNOCENT

  • PRAISE AND AWARDS FOR THE NON-PARTICIPANTS

    (Author unknown – found in one of the computer companies)

DS - IX - NFT - 25


SYSTEM DEPENDABILITY CONCEPTS

  • RELIABILITY

    • Is a conditional probability that the system will perform its intended function without failure at time t provided it was fully operational at time t = 0

  • AVAILABILITY

    • Instantaneous availability is the probability that a system is performing correctly at time t and is equal to reliability of non-repairable systems

      A (t) = R (t)

    • Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime

      As (t) =

  • SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset

DS - IX - NFT - 26


APPROACHES

  • FAULT INTOLERANCE

  • FAULT TOLERANCE

  • MAINTAINABILITY

  • HARDWARE/SOFTWARE TRADE-OFFS

DS - IX - NFT - 27


HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION

HARDWARE

EXAMPLES

M6800

MC68000

VAX-11/780 IBM-30XX

CRAY-XMP C-205

SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS

INSTRUCTIONS

INTEGER ARITHMETIC ADD/SUB

MPY/DIV

FLOATING-POINT ARITHMETIC

VECTOR PROCESSING

MULTIPROCESSING (e.g., submachine set-up)

SOFTWARE

VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa.

Vertical Migration improves performance and dependability, and reduces cost.

DS - IX - NFT - 28


DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE

Dependability

Rings

Acceptance Test

Operating System, Languages and Application

Acceptance Test

System Hardware

Acceptance Test

Register-Transfer Level

Acceptance Test

Logic Level

Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery)

DS - IX - NFT - 29


A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM

Network

Memories

Processor

Diagnostic and Maintenance Processor (s) (Hardcore)

Test Rings

DS - IX - NFT - 30


DEPENDABLE DESIGN METHODOLOGY

  • Identify fault classes, fault latency and fault impact

  • Determine qualitative and quantitative specs for fault tolerance and evaluate your design in specific environment

  • Identify “weak spots” and assess potential damage

  • Decompose the system

  • Develop fault and error detection techniques and algorithms

  • Develop fault isolation techniques and algorithms

  • Develop recovery/reintegration/restart

  • Evaluate degree of fault tolerance

  • Refine, iterate for improvement; try to eliminate “weak spots” and minimize potential damage

DS - IX - NFT - 31


REAL-TIME SYSTEMS DESIGN

  • Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.

  • Characterize timing of a system (hardware and software).

  • Map timing specification onto a system timing (find the best resource allocation and scheduling methods), and incorporate concurrent monitoring.

  • Verify and validate the design for quantitative and qualitative specifications.

  • Refine, iterate and fine-tune the design.

DS - IX - NFT - 32


RESPONSIVE SYSTEM DESIGN

  • Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.

  • Determine system timing (hardware and software) assess damage, availability and responsiveness.

  • Develop and time fault and error detection techniques and algorithms.

  • Develop and time fault isolation techniques and algorithms.

  • Develop time recovery/reintegration/restart.

  • Map timing specification onto system timing under appropriate assumptions and incorporate concurrent monitoring.

  • Evaluate responsiveness.

  • Refine and iterate for improvement.

    RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME

DS - IX - NFT - 33


REFERENCES(TEXTBOOK)

  • C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978.

  • G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987.

  • M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.

  • Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994

  • Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.

DS - IX - NFT - 34


ad
  • Login