probability and statistics with reliability queuing and computer science applications introduction n.
Skip this Video
Loading SlideShow in 5 Seconds..
Probability and Statistics with Reliability, Queuing and Computer Science Applications: Introduction PowerPoint Presentation
Download Presentation
Probability and Statistics with Reliability, Queuing and Computer Science Applications: Introduction

Probability and Statistics with Reliability, Queuing and Computer Science Applications: Introduction

1720 Views Download Presentation
Download Presentation

Probability and Statistics with Reliability, Queuing and Computer Science Applications: Introduction

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Probability and Statistics with Reliability, Queuing and Computer Science Applications: Introduction Kishor S. Trivedi Visiting Prof. Of Computer Science and Engineering, IITK Prof. Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291 Phone: 7576 e-mail: URL: IIT Kanpur

  2. Outline • Introduction • Reliability, Availability, Security, Performance, Performability • Methods of Evaluation • Evaluation Vs. Optimization • Model construction, parameterization,solution,validation, interpretation • Preliminaries: Sample Space, Probability Axioms, Independence, Conditioning, Binomial Trials • Random Variables: Binomial, Poisson, Exponential, Weibull, Erlang, Hyperexponential, Hypoexponential, Pareto, Defective • Reliability, Hazard Rate • Average Case Analysis of Program Performance • Reliability Analysis Using Block Diagrams and Fault Trees • Reliability of Standby Systems • Statistical Inference Including Confidence Intervals • Hypothesis Testing • Regression

  3. Schedule & Textbooks • Schedule: Jan 21, 23, 28 and Feb 6, 18, 25, 27 • Probability & Statistics with reliability, queuing, and computer science applications, K. S. Trivedi, second edition, John Wiley & Sons, 2001. • Performance and reliability analysis of computer systems: An Example-Based Approach Using the SHARPE Software Package, Sahner, Trivedi, Puliafito, Kluwer Academic Publishers, 1996.

  4. Program Performance Evaluation • Worst-case vs. Average case analysis • Data-structure-oriented vs. Control structure-oriented • Sequential vs. Concurrent • Centralized vs. Distributed • Structured vs. with unrestricted transfer of control • Unlimited (hardware) resources vs. limited resources • Software architecture: modules, their characteristics (execution time) and interactions (branching, looping) • Measures: completion time (mean, variance & dist.) • Measurements or Models (simulation vs. analytic) analytic models: combinatorial, DTMC, SMP, CTMC, SPN

  5. System Performance Evaluation • Workload: traffic arrivals, service time distributions pattern of resource requests • Hardware architecture and software architecture • Resource Contention, Scheduling & Allocation • Concurrency, Synchronization, distributed processing • Timeliness (Have to Meet Deadlines) • Measures: Thruput, Goodput, loss probability, response time or delay (mean, variance & dist.) • Low-level (Cache, memory interference: ch. 7) • System-level (CPU-I/O, multiprocessing: ch. 8,9) • Network-level (protocols, handoff in wireless: ch. 7,8) • Measurements or models (simulation or analytic) analytic models: DTMC, CTMC, PFQN, SPN

  6. System Performance Evaluation • Workload: • Single vs. multiple types of requests (classes, chains) • The following items needed for each type of request: • traffic arrivals: one time vs. a stream stream: Poisson (Bernoulli), General renewal, IPP (IBP), MMPP(MMBP), MAP, BMAP, NHPP, Self-similar • service time distributions: Exponential (geometric), deterministic, uniform, Erlang, Hyperexponential, Hypoexponential, Phase-type, general (with finite mean and variance), Pareto • pattern of resource requests: service time distribution (or the mean) at each resource per visit, branching probabilities; often described as a DTMC (discrete-time Markov chain) and can also be seen as the behavior of an individual program • All this information should be collected from actual measurements (if possible) followed by statistical inference

  7. Software Reliability • Black-box (measurements+ statistical inference) vs. Architecture-based approach (models) • Black-box approaches treat software as a monolithic whole, considering only its interactions with external environment, without an attempt to model its internal structure • With growing emphasis on reuse, software development process moves toward component-based software design • White-box approach may be better to analyze a system with many software components and how they fit together

  8. Software Architecture • Software behavior with respect to the manner in which different components interact • May include the information about the execution time of each component • Use control flow graph to represent architecture • Sequential program architecture modeled by • Discrete Time Markov Chain (DTMC) • Continuous Time Markov Chain (CTMC) • Semi-Markov process (SMP)

  9. Failure Behavior of Components and Interfaces Failure can happen • during the execution of any component or • during the transfer of control between components Failure behavior can be specified in terms of • reliability • constant failure rate • time-dependent failure intensity

  10. System Reliability/Availability • Faultload: fault types, fault arrivals, repair/recovery procedures and delay time distributions • Hardware architecture and software architecture • Minimum Resource Requirements • Dynamic failures • Performance/Reliability interdependence • Measures: Reliability, Availability, MTTF, Downtime • Low-level (Physics of failures, chip level) • System-level (CPU-I/O, multiprocessing: ch. 8,9) • Software and Hardware combined together • Network-level • Measurements or models (simulation or analytic) analytic models: RBD, FTREE, CTMC, SPN

  11. Definition of Reliability • Recommendations E.800 of the International Telecommunications Union (ITU-T) defines reliability as follows: • “The ability of an item to perform a required function under given conditions for a given time interval.” • In this definition, an item may be a circuit board, a component on a circuit board, a module consisting of several circuit boards, a base transceiver station with several modules, a fiber-optic transport-system, or a mobile switching center (MSC) and all its subtending network elements. The definition includes systems with software.

  12. Definition of Availability • Availability is closely related to reliability, and is also defined in ITU-T Recommendation E.800 as follows:[1] "The ability of an item to be in a state to perform a required function at a given instant of time or at any instant of time within a given time interval, assuming that the external resources, if required, are provided." • An important difference between reliability and availability is that reliability refers to failure-free operation during an interval, while availability refers to failure-free operation at a given instant of time, usually the time when a device or system is first accessed to provide a required function or service

  13. High Reliability/Availability/Safety • Traditional applications (long-life/life-critical/safety-critical) • Space missions, aircraft control, defense, nuclear systems • New applications (non-life-critical/non-safety-critical, business critical) • Banking, airline reservation, e-commerce applications, web-hosting, telecommunication • Scientific applications (non-critical)

  14. Motivation: High Availability • Scott McNealy, Sun Microsystems Inc. • "We're paying people for uptime.The only thing that really matters is uptime, uptime, uptime, uptime and uptime. I want to get it down to a handful of times you might want to bring a Sun computer down in a year. I'm spending all my time with employees to get this design goal” • SUN Microsystems – SunUP & RASCAL program for high-availability • Motorola - 5NINES Initiative • HP, Cisco, Oracle, SAP - 5nines:5minutes Alliance • IBM – Cornhusker clustering technology for high-availability, eLiza, autonomic computing • Microsoft – Trustable computing initiative • John Hennessey – in IEEE Computer • Microsoft – Regular full page ad on 99.999% availability in USA Today

  15. Motivation – High Availability

  16. Need for a new term • Reliability is used in a generic sense • Reliability used as a precisely defined mathematical function • To remove confusion, IFIP WG 10.4 has proposed Dependability as an umbrella term


  18. IFIP WG10.4 • Failure occurs when the delivered service no longer complies with the specification • Error is that part of the system state which is liable to lead to subsequent failure • Fault is adjudged or hypothesized cause of an error Faults are the cause of errors that may lead to failures Fault Error Failure

  19. Dependability:Reliability, Availability,Safety, Security • Redundancy: Hardware (Static,Dynamic), Information, Time, software • Fault Types: Permanent (needs repair or replacement), Intermittent (reboot/restart or replacement), Transient (retry), Design : Heisenbugs, Aging related bugs Bohrbugs • Fault Detection, Automated Reconfiguration • Imperfect Coverage • Maintenance: scheduled, unscheduled

  20. Software Fault Classification • Many software bugs are reproducible, easily found and fixed during the testing and debugging phase • Other bugs that are hard to find and fix remain in the software during the operational phase • These bugs may never be fixed, but if the operation is retried or the system is rebooted, the bugs may not manifest themselves as failures • manifestation is non-deterministic and dependent on the software reaching very rare states Bohrbugs Heisenbugs

  21. Software Bohrbugs Heisenbugs “Aging” related bugs Test/ Debug Retry opn. Des./Data Diversity Restart app. Reboot node Design/ Development Operational Software Fault Classification

  22. Failure Classification (Cristian) • Failures • Omission failures (Send/receive failures) • Crash failures • Infinite loop • Timing failures • Early • Late (performance or dynamic failures) • Response failures • Value failures • State-transition failures

  23. Security • Security intrusions cause a system to fail • Security Failure • Integrity: Destruction/Unauthorized modification of information • Confidentiality: Theft of information • Availability: e.g., Denial of Services (DoS) • Similarity (as well as differences) between: • Malicious vs. accidental faults • Security vs. reliability/availability • Intrusion tolerance vs. fault tolerance

  24. The Need of Performability Modeling • New technologies, services & standards need new modeling methodologies • Pure performance modeling: too optimistic! Outage-and-recovery behavior not considered • Pure dependability modeling: too conservative! Different levels of performance not considered

  25. for a specified operational time Reliability Availability at any given instant Performance under failures Survivability “ilities” besides performance Performability measures of the systems ability to perform designated functions R.A.S.-ability concerns grow. High-R.A.S. not only a selling point for equipment vendors and service providers. But, regulatory outage report required by FCC for public switched telephone networks (PSTN) may soon apply to wireless.

  26. Evaluation vs. Optimization • Evaluation of system for desired measures given a set of parameters • Sensitivity Analysis • Bottleneck analysis • Reliability importance • Optimization • Static:Linear,integer,geometric,nonlinear, multi-objective; constrained or unconstrained • Dynamic: Dynamic programming, Markov decision process, semi-Markov decision process

  27. PURPOSE OF EVALUATION • Understanding a system • Observation Operational environment Controlled environment • Reasoning A model is a convenient abstraction • Predicting behavior of a system • Need a model • Accuracy based on degree of extrapolation

  28. PURPOSE OF EVALUATION(Continued) These famous quotes bring out the difficulty of prediction based on models: • “All Models are Wrong; Some Models are Useful”George Box • “Prediction is fine as long as it is not about the future” Mark Twain

  29. Basic Definitions • Reliability R(t): X : time to failure of a system F(t): distribution function of system lifetime • Mean Time To system Failure: f(t): density function of system lifetime

  30. Availability (Continued) • Instantaneous (point) Availability A(t): A(t) = P (system working at t) Let H(t) be the convolution of F and G: • g(t): density function of system repair time Then: Inst. Availability , , Reliability

  31. Availability Never failed in (0,t), prob: R(t) • System working at time t First failed and got repaired at time x<t & UP at end of interval (x,t), prob: x + dx t x 0 First repair completed here

  32. Availability (Continued) • MTTR: Mean Time to Repair • Y: repair period of the system • Availability and Reliability are related but different!

  33. Availability (Continued) • We can show from equation (1) that: • Also:

  34. Availability (Continued) • Steady-State Availability: • There are two kinds of Availabilities! • Instantaneous &Steady-state • For a system with high degree of redundancy where MTTFeq & MTTReq must be carefully defined; they can be computed using SHARPE

  35. MEASURES TO BE EVALUATED • Dependability • Reliability: R(t), System MTTF • Availability: Steady-state, Transient; Downtime • Safety, security • Performance • Throughput, Blocking Probability, Response Time “Does it work, and for how long?'' “Given that it works, how well does it work?''

  36. MEASURES TO BE EVALUATED (Continued) • Composite Performance and Dependability • Need Techniques and Tools That Can Evaluate • Performance, Dependability and Their Combinations “How much work will be done(lost) in a given interval including the effects of failure/repair/contention?''

  37. Methods of EVALUATION • Measurement-Based Most believable, most expensive Not always possible or cost effective during system design • Statistical techniques are very important here • Model-Based

  38. Methods of EVALUATION(Continued) • Model-Based Less believable, Less expensive 1. Discrete-Event Simulation vs. Analytic 2. State-Space Methods vs. Non-State-Space Methods 3. Hybrid: Simulation + Analytic (SPNP) 4. State Space + Non-State Space (SHARPE)

  39. Methods of EVALUATION(Continued) • Measurements + Models Vaidyanathan et al ISSRE 99

  40. QUANTITATIVE EVALUATION TAXONOMY Closed-form solution Numerical solution using a tool

  41. Note that • Both measurements & simulations imply statistical analysis of outputs (ch. 10,11) • Statistical inference • Hypothesis testing • Design of experiments • Analysis of variance • Regression (linear, nonlinear) • Distribution driven simulation requires generation of random deviates (variates) (ch. 3, 4, 5) • Probability and Statistics are different yet highly related • Probability models need inputs that generally come from measurement data (followed by statistical inference) • Statistics in turn uses probability theory


  43. ANALYTIC MODELING TAXONOMY NON-STATE SPACE MODELING TECHNIQUES SP reliability block diagrams Non-SP reliability block diagrams

  44. State Space Modeling Taxonomy discrete-time Markov chains Markovian modeling continuous-time Markov chains Markov reward models State space methods Semi-Markov models non-Markovian modeling Markov regenerative models Non-Homogeneous Markov

  45. Modeling Steps • Model construction • Model parameterization • Model solution • Result interpretation • Model Validation

  46. MODELING AND MEASUREMENTS: INTERFACES • Measurements supply Input Parameters to Models (Model Calibration or Parameterization) Confidence Intervals should be obtained Boeing, Draper, Union Switch projects • Model Sensitivity Analysis can suggest which Parameters to Measure More Accurately: Blake, Reibman and Trivedi: SIGMETRICS 1988.

  47. MODELING AND MEASUREMENTS: INTERFACES • Model Validation 1. Face Validation 2. Input-Output Validation 3. Validation of Model Assumptions (Hypothesis Testing) • Rejection of a hypothesis regarding model assumption based on measurement data leads to an improved model

  48. MODELING AND MEASUREMENTS: INTERFACES • Model Structure Based on Measurement Data • Hsueh, Iyer and Trivedi; IEEE TC, April 1988 • Gokhale et al, IPDS 98; • Vaidyanathan et al, ISSRE99


  50. ANALYTIC MODELING TAXONOMY NON-STATE SPACE MODELING TECHNIQUES SP reliability block diagrams Non-SP reliability block diagrams