Comprehensive Overview of System Performance Evaluation and Software Reliability

Probability for Computer Science Kishor S. Trivedi Visiting Prof. Of Computer Science and Engineering, IITK Prof. Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291 Phone: 7576 e-mail: kst@ee.duke.edu URL: www.ee.duke.edu/~kst IIT Kanpur

Outline • Introduction • Preliminaries: Sample Space, Probability Axioms, Independence, Conditioning,Binomial Trials • Random Variables: Binomial, Poisson, Exponential, Weibull, Erlang, Hyperexponential, Hypoexponential, Pareto, Defective • Reliability, Hazard Rate • Average Case Analysis of Program Performance • Reliability Analysis Using Block Diagrams and Fault Trees • Reliability of Standby Systems • Statistical Inference Including Confidence Intervals • Hypothesis Testing • Regression

Schedule & Textbook • Schedule: Jan 21, 23, 28 and Feb 6, 18, 25, 27 • Probability & Statistics with reliability, queuing, and computer science applications, K. S. Trivedi, second edition, John Wiley & Sons, 2001 (Indian paperback).

Program Performance Evaluation • Worst-case vs. Average case analysis • Data-structure-oriented vs. Control structure-oriented • Sequential vs. Concurrent • Centralized vs. Distributed • Structured vs. with unrestricted transfer of control • Unlimited (hardware) resources vs. limited resources • Software architecture: modules, their characteristics (execution time) and interactions (branching, looping) • Characteristics of hardware on which the software is run • Measures: completion time (mean, variance & dist.), thruput • Measurements or Models (simulation vs. analytic) analytic models: combinatorial, DTMC, SMP, CTMC, SPN

System Performance Evaluation • Workload: traffic arrivals, service time distributions pattern of resource requests • Hardware architecture and software architecture • Resource Contention, Scheduling & Allocation • Concurrency, Synchronization, distributed processing • Timeliness (Have to Meet Deadlines) • Measures: Thruput, Goodput, loss probability, response time or delay (mean, variance & dist.) • Low-level (Cache, memory interference: ch. 7) • System-level (CPU-I/O, multiprocessing: ch. 8,9) • Network-level (protocols, handoff in wireless: ch. 7,8) • Measurements or models (simulation or analytic) analytic models: DTMC, CTMC, PFQN, SPN

System Performance Evaluation • Workload: • Single vs. multiple types of requests (classes, chains); in the latter case, the following three items needed for each type of request • traffic arrivals: one time vs. a stream stream: Poisson (Bernoulli), General renewal, IPP (IBP), MMPP(MMBP), MAP, BMAP, NHPP, Self-similar • service time distributions: Exponential (geometric), deterministic, uniform, Erlang, Hyperexponential, Hypoexponential, Phase-type, general (with finite mean and variance), Pareto • pattern of resource requests: service time distribution (or the mean) at each resource per visit, branching probabilities; often described as a DTMC (discrete-time Markov chain) and can also be seen as the behavior of an individual program • All this information should be collected from actual measurements (if possible) followed by statistical inference

Software Reliability • Black-box (measurements+ statistical inference) vs. Architecture-based approach (models) • Black-box approaches treat software as a monolithic whole, considering only its interactions with external environment, without an attempt to model its internal structure • With growing emphasis on reuse, software development process moves toward component-based software design • White-box approach may be better to analyze a system with many software components and how they fit together

Software Architecture • Software behavior with respect to the manner in which different components interact • May include the information about the execution time of each component • Use control flow graph to represent architecture • Sequential program architecture modeled by • Discrete Time Markov Chain (DTMC) • Continuous Time Markov Chain (CTMC) • Semi-Markov process (SMP)

Failure Behavior of Components and Interfaces Failure can happen • during the execution of any component or • during the transfer of control between components Failure behavior can be specified in terms of • reliability • constant failure rate • time-dependent failure intensity

System Reliability/Availability • Faultload: fault types, fault arrivals, repair/recovery procedures and delay time distributions • Hardware architecture and software architecture • Minimum Resource Requirements • Dynamic failures • Performance/Reliability interdependence • Measures: Reliability, Availability, MTTF, Downtime • Low-level (Physics of failures, chip level) • System-level (CPU-I/O, multiprocessing: ch. 8,9) • Software and Hardware combined together • Network-level • Measurements or models (simulation or analytic) analytic models: RBD, FTREE, CTMC, SPN

Definition of Reliability • Recommendations E.800 of the International Telecommunications Union (ITU-T) defines reliability as follows: • “The ability of an item to perform a required function under given conditions for a given time interval.” • In this definition, an item may be a circuit board, a component on a circuit board, a module consisting of several circuit boards, a base transceiver station with several modules, a fiber-optic transport-system, or a mobile switching center (MSC) and all its subtending network elements. The definition includes systems with software.

Definition of Availability • Availability is closely related to reliability, and is also defined in ITU-T Recommendation E.800 as follows:[1] "The ability of an item to be in a state to perform a required function at a given instant of time or at any instant of time within a given time interval, assuming that the external resources, if required, are provided." • An important difference between reliability and availability is that reliability refers to failure-free operation during an interval, while availability refers to failure-free operation at a given instant of time, usually the time when a device or system is first accessed to provide a required function or service

High Reliability/Availability/Safety • Traditional applications (long-life/life-critical/safety-critical) • Space missions, aircraft control, defense, nuclear systems • New applications (non-life-critical/non-safety-critical, business critical) • Banking, airline reservation, e-commerce applications, web-hosting, telecommunication • Scientific applications (non-critical)

Motivation: High Availability • Scott McNealy, Sun Microsystems Inc. • "We're paying people for uptime.The only thing that really matters is uptime, uptime, uptime, uptime and uptime. I want to get it down to a handful of times you might want to bring a Sun computer down in a year. I'm spending all my time with employees to get this design goal” • SUN Microsystems – SunUP & RASCAL program for high-availability • Motorola - 5NINES Initiative • HP, Cisco, Oracle, SAP - 5nines:5minutes Alliance • IBM – Cornhusker clustering technology for high-availability, eLiza, autonomic computing • Microsoft – Trustable computing initiative • John Hennessey – in IEEE Computer • Microsoft – Regular full page ad on 99.999% availability in USA Today

Motivation – High Availability

Need for a new term • Reliability is used in a generic sense • Reliability used as a precisely defined mathematical function • To remove confusion, IFIP WG 10.4 has proposed Dependability as an umbrella term

SECURITY ATTRIBUTES AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY FAULT PREVENTION INTEGRITY FAULT REMOVAL MAINTAINABILITY DEPENDABILITY MEANS FAULT TOLERANCE FAULT FORECASTING FAULTS THREATS ERRORS FAILURES Dependability– Umbrella term Trustworthiness of a computer system such that reliance can justifiably be placed on the service it delivers

IFIP WG10.4 • Failure occurs when the delivered service no longer complies with the specification • Error is that part of the system state which is liable to lead to subsequent failure • Fault is adjudged or hypothesized cause of an error Faults are the cause of errors that may lead to failures Fault Error Failure

Dependability:Reliability, Availability,Safety, Security • Redundancy: Hardware (Static,Dynamic), Information, Time, software • Fault Types: Permanent (needs repair or replacement), Intermittent (reboot/restart or replacement), Transient (retry), Design : Heisenbugs, Aging related bugs Bohrbugs • Fault Detection, Automated Reconfiguration • Imperfect Coverage • Maintenance: scheduled, unscheduled

Software Fault Classification • Many software bugs are reproducible, easily found and fixed during the testing and debugging phase • Other bugs that are hard to find and fix remain in the software during the operational phase • These bugs may never be fixed, but if the operation is retried or the system is rebooted, the bugs may not manifest themselves as failures • manifestation is non-deterministic and dependent on the software reaching very rare states Bohrbugs Heisenbugs

Software Bohrbugs Heisenbugs “Aging” related bugs Test/ Debug Retry opn. Des./Data Diversity Restart app. Reboot node Design/ Development Operational Software Fault Classification

Failure Classification (Cristian) • Failures • Omission failures (Send/receive failures) • Crash failures • Infinite loop • Timing failures • Early • Late (performance or dynamic failures) • Response failures • Value failures • State-transition failures

Security • Security intrusions cause a system to fail • Security Failure • Integrity: Destruction/Unauthorized modification of information • Confidentiality: Theft of information • Availability: e.g., Denial of Services (DoS) • Similarity (as well as differences) between: • Malicious vs. accidental faults • Security vs. reliability/availability • Intrusion tolerance vs. fault tolerance

The Need of Performability Modeling • New technologies, services & standards need new modeling methodologies • Pure performance modeling: too optimistic! Outage-and-recovery behavior not considered • Pure dependability modeling: too conservative! Different levels of performance not considered

for a specified operational time Reliability Availability at any given instant Performance under failures Survivability “ilities” besides performance Performability measures of the systems ability to perform designated functions R.A.S.-ability concerns grow. High-R.A.S. not only a selling point for equipment vendors and service providers. But, regulatory outage report required by FCC for public switched telephone networks (PSTN) may soon apply to wireless.

Evaluation vs. Optimization • Evaluation of system for desired measures given a set of parameters • Sensitivity Analysis • Bottleneck analysis • Reliability importance • Optimization • Static:Linear,integer,geometric,nonlinear, multi-objective; constrained or unconstrained • Dynamic: Dynamic programming, Markov decision process, semi-Markov decision process

PURPOSE OF EVALUATION • Understanding a system • Observation Operational environment Controlled environment • Reasoning A model is a convenient abstraction • Predicting behavior of a system • Need a model • Accuracy based on degree of extrapolation

PURPOSE OF EVALUATION(Continued) These famous quotes bring out the difficulty of prediction based on models: • “All Models are Wrong; Some Models are Useful”George Box • “Prediction is fine as long as it is not about the future” Mark Twain

Basic Definitions • Reliability R(t): X : time to failure of a system F(t): distribution function of system lifetime • Mean Time To system Failure: f(t): density function of system lifetime

Availability (Continued) • Instantaneous (point) Availability A(t): A(t) = P (system working at t) Let H(t) be the convolution of F and G: • g(t): density function of system repair time Then: Inst. Availability , , Reliability

Availability Never failed in (0,t), prob: R(t) • System working at time t First failed and got repaired at time x<t & UP at end of interval (x,t), prob: x + dx t x 0 First repair completed here

Availability (Continued) • MTTR: Mean Time to Repair • Y: repair period of the system • Availability and Reliability are related but different!

Availability (Continued) • Steady-State Availability: • We can show that for systems without redundancy: • For a system with redundancy: where MTTFeq & MTTReq must be carefully defined • Also:

MEASURES TO BE EVALUATED • Dependability • Reliability: R(t), System MTTF • Availability: Steady-state, Transient • Downtime • Performance • Throughput, Blocking Probability, Response Time “Does it work, and for how long?'' “Given that it works, how well does it work?''

MEASURES TO BE EVALUATED (Continued) • Composite Performance and Dependability • Need Techniques and Tools That Can Evaluate • Performance, Dependability and Their Combinations “How much work will be done(lost) in a given interval including the effects of failure/repair/contention?''

Methods of EVALUATION • Measurement-Based Most believable, most expensive Not always possible or cost effective during system design • Statistical techniques are very important here • Model-Based

Methods of EVALUATION(Continued) • Model-Based Less believable, Less expensive 1. Discrete-Event Simulation vs. Analytic 2. State-Space Methods vs. Non-State-Space Methods 3. Hybrid: Simulation + Analytic (SPNP) 4. State Space + Non-State Space (SHARPE)

Methods of EVALUATION(Continued) • Measurements + Models Vaidyanathan et al ISSRE 99

QUANTITATIVE EVALUATION TAXONOMY Closed-form solution Numerical solution using a tool

Note that • Both measurements & simulations imply statistical analysis of outputs (ch. 10,11) • Statistical inference • Hypothesis testing • Design of experiments • Analysis of variance • Regression (linear, nonlinear) • Distribution driven simulation requires generation of random deviates (variates) (ch. 3, 4, 5) • Probability and Statistics are different yet highly related • Probability models need inputs that generally come from measurement data (followed by statistical inference) • Statistics in turn uses probability theory

MODELING THROUGHOUT SYSTEM LIFECYCLE • System Specification/Design Phase Answer “What-if Questions'' • Compare design alternatives (Bedrock, Wireless handoff) • Performance-Dependability Trade-offs (Wireless Handoff) • Design Optimization (optimizing the number of guard channels)

MODELING THROUGHOUT SYSTEM LIFECYCLE (Continued) • Design Verification Phase Use Measurements + Models E.g. Fault/Injection + Availability Model Union Switch and Signals, Boeing, Draper • Configuration Selection Phase: DEC, HP • System Operational Phase: IDEN Project Workload based adaptive rejuvenation • It is fun!

MODELING TAXONOMY

MODELER'S DILEMMA Should I Use Discrete-Event Simulation? • Point Estimates and Confidence Intervals • How many simulation runs are sufficient? • What Specification Language to use? • C, SIMULA, SIMSCRIPT, MODSIM, GPSS, RESQ, SPNP v6, Bones, SES workbench

MODELER'S DILEMMA (Continued) • Simulation: + Detailed System Behavior including non-exponential distributions & non-Poisson or processes + Performance, Availability and Performability Modeling Possible - Long Execution Time (Variance Reduction Possible) • Importance Sampling, importance splitting, regenerative simulation. • Parallel and Distributed Simulation - Many users in practice do not realize the need to calculate confidence intervals

MODELER'S DILEMMA (Continued) Should I Use Non-State-Space Methods? • Model Solved Without Generating State Space • Also Known as Combinatorial Models • Use: Order Statistics, Mixing, Convolution • Common Dependability Model Types: also called Combinatorial Models • Series-Parallel Reliability Block Diagrams (RBD) • Non-Series-Parallel Block Diagrams (or Reliability Graphs) • Fault-Trees Without Repeated Events • Fault-Trees With Repeated Events

RBD example

RELIABILITY GRAPH Example

Fault tree without repeated events

FAULT TREE WITH REPEATED EVENTS EXAMPLE

Comprehensive Overview of System Performance Evaluation and Software Reliability

Comprehensive Overview of System Performance Evaluation and Software Reliability

Presentation Transcript

Computer Science

English for Computer Science

Computer Science:

English For Computer Science

computer science

computer science

Computer Science

computer science

Writing for computer science

Computer Science

Computer Science

COMPUTER SCIENCE

Probability for Computer Science

Probability for Computer Scientists

Computer Science

Probability for Computer Scientists

Probability for Computer Scientists

Computer Science For Kids

Computer Science For Kids

Topics for Computer Science

Computer Science PDF | Computer Science Definition

Probability and Statistics for Computer Engineer