1 / 47

Characterization and Mitigation of Failures in Complex Systems

Characterization and Mitigation of Failures in Complex Systems. Dr. K. S. Trivedi Dr. Y. Hong. Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke University Durham, NC 27708-0294. System Dependability. Faults.

hova
Download Presentation

Characterization and Mitigation of Failures in Complex Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterization and Mitigation of Failures in Complex Systems Dr. K. S. Trivedi Dr. Y. Hong Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke University Durham, NC 27708-0294

  2. System Dependability Faults Impairments Errors Failures Fault prevention (FP) Procurement Fault tolerance (FT) Dependability Means Fault removal (FR) Validation Fault forecasting (FF) Performability Reliability + Performance Availability + Service Survivability Attributes Security Safety Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  3. System Dependability FP Minimal maintenance Redundancy: hardware,software,information,time FT Diversity: data, design, environment Applications Verification (testing) FR Maintenance: corrective (repair) and preventive Probabilistic modeling FF Non-probabilistic modeling Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  4. Modeling and Analysis Fault forecasting Fault removal Probabilistic Modeling Methods & Tools Fault tolerance Fault prevention Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  5. A Research Triangle Theory Modeling methods and Numerical solutions: CTMC, SRN, MRSPN FSPN, NHCTMC, FTREE Applications Tools SHARPE SPNP SREPT SRA Hardware and software: Fault-tolerant Systems Real-time Systems Computer Networks Wireless Networks Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  6. Modeling Methods Fault trees SMP MRGP DTMC, CTMC NHCTMC SRN Probabilistic modeling SPN MRSPN FSPN SDE FBM, FGN HS Nonlinear dynamics Non-probabilistic modeling HS Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  7. CTMC  SMP  MRGP CTMC SMP state state t t T ~ Exp T ~ Gen T ~ Exp state MRGP t T1 ~ Gen T1 ~ Gen Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  8. Relation: Dependability Models GSPN CTMC SRN MRM FTRE RG RBD FT Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  9. Modeling of HS and FSPN • Statistical multiplexed ATM switch and HS model: B TRAFFIC B t ATM SWITCH High priority Low priority Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  10. Modeling of HS and FSPN FSPN model for ATM switch K Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  11. Solution Techniques • Algorithms • Fast algorithms for network reliability and fault tree computation • Fast algorithms for SRN • Numerical and simulation tools • SHARPE (symbolic hierarchical automated reliability and performance evaluator) • SREPT (software reliability estimation prediction tool) • SPNP (stochastic Petri net package) • SRA (software rejuvenation agents) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  12. Architecture of SHARPE ftree SMP/MRGP Markov block Hierarchical & Hybrid Compositions GSPN/SRN relgraph graph pfqn, mfqn Reliability/availability Performability Performance analysis Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  13. Architecture of SPNP Stochastic Reward Net (SRN) models Markovian Stochastic Petri Net Non-Markovian Stochastic Petri Net Fluid Stochastic Petri Net (FSPN) Reachability Graph Analytic-Numeric Method Discrete Event Simulation (DES) Steady-State Transient Steady-State Transient SOR Std. Uniformization Batch Means Indep. Replication Gauss-Seidel Fox-Glynn Method Regenerative Simulation Restart Fast Methods Power Method Stiff Uniformization Importance Sampling Importance Splitting Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  14. Architecture of SREPT SREPT Architecture-based approach Black-box approach Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  15. Black-box Approach Estimate of the number of faults Fault Density Software Product/Process Metrics + Past experience Regression Tree Predicted 1. failure intensity 2. #faults remaining 3. reliability after release Test coverage ENHPP model Sequence of Interfailure times ENHPP model Predicted 1. failure intensity 2. #faults remaining 3. reliability after release 4. coverage ENHPP model Sequence of Interfailure times & Release criteria Optimization & ENHPP model Predicted release time and Predicted coverage Predicted 1. failure intensity 2. #faults remaining 3. reliability after release NHCTMC Failure intensity Debugging policies Discrete-event simulation Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  16. Architecture Analytical modeling Reliability and Performance prediction Failure behavior of components Discrete-event simulation Architecture-based Approach Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  17. Challenges Fault trees using BDD Largeness Automated generation/solution based SPN, SRN Markov model Hierarchical methods, fixed point iteration Stiffness NHCTMC Non-exponential distributions SMP, MRGP DES (Discrete-Event Simulation) Combined performance and reliability/availability MRM SRN FSPN HS Joint continuous & discrete state spaces SDE Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  18. Applications • For high-performance and high-dependability • Software • Software reliability (FR,FF) • Software rejuvenation (FF,FR) • Software fault-tolerance (FT,FF) • Hardware (with software) • Preventive maintenance (FR,FF) • Availability model (FF) • Upgrade design (FT,FF) • Performability of wireless communications (FF) • Transient behavior of ATM networks with failure (FF) • Error recovery in communication networks (FF) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  19. Software Reliability • Early prediction of quality based on software product/process metrics. • Reliability growth modeling based on failure and coverage data collected during testing • Architecture-based software reliability • Developed a tool (SREPT) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  20. Software Rejuvenation Definition: proactive fault management technique for the systems to counteract the effect of aging Approaches: • Periodic: Rejuvenation at regular (deterministic) time intervals irrespective of system load • Periodic and instantaneous load: Rejuvenation at regular intervals and wait until system load is zero • Prediction-based: Rejuvenation at times based on estimated times to failure (implemented in IBM Netfinity Director) • Purely time-based estimation • Time and workload-based estimation Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  21. Recovering Available A C B Analytical Modeling Undergoing rejuvenation Subordinated non-homogeneous CTMCfor t = d Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  22. Numerical Example Service rate and failure rate are functions of time, m(t) and r(t) Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  23. Measurement-based Estimation • Objective detection and validation of software aging • Basic idea periodically monitor and collect data on the attributes responsible for determining the health of the executing software • Quantifying the effect of aging proposed metric - Estimated time to exhaustion Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  24. Non-parametric Regression Smoothing of Rossby data - Real Memory Free Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  25. Software Fault Tolerance • Recovery block: • common-mode software failures, • common-mode acceptance test failures, • failures clustered in the input stream • A generalized model for recovery block strategy to capture the dependencies: • operate input within a recovery block execution • evaluate module output in the recovery block • difficult inputs clustered in input space Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  26. SRN Model for Recovery Block A recovery block with clustered failures Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  27. Hardware Maintenance • Conditioned based maintenance: • CTMC: closed form solution • SHARPE: numerical solution • Find optimal inspection interval Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  28. Hardware Maintenance Comparison: Closed form result & Numerical result Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  29. Hardware Maintenance • Time based maintenance: • SMP based solution (PM: preventive maintenance) • Find optimal maintenance interval Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  30. Availability Models Availability models commonly used in practice assume that times to outage and recovery are exponentially distributed. • How accurate will the all-exponential models be for systems w/ limited information of outage/recovery? • Can we give availability bounds for such systems? Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  31. Result of a General Model • For a system with multiple types of outage-recovery (non-exponentially distributed outrages), the underlying stochastic process is a semi-Markov process (SMP). • We give a closed-form formula of system availability. • Findings – • Only the mean value of time-to-recovery (E[TTR]) affects the availability of systems. The distribution does not matter. • However, the distribution of Time-to-outage (TTO) does affect availability. Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  32. Upgrade with Redundancy • Load-sharing clustering is common in networks (wired, wireless, optical) and server systems. • How to take advantage of the cluster structure and redundancy to upgrade hardware and software components? • How to quantify system performance for different upgrade schemes? Active node Standby node Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  33. Upgrade Schemes • Direct transfer: Simple, but long downtime, for non-realtime-critical system. • Phased upgrade: Almost zero downtime, upgrade paradox, version incompatibility, etc. Baseline model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  34. Control channel down System down(!) Performability of Cellular Control Channel Protection • Each cell has Nb base repeaters (BR) • Each BR provides M TDM channels • One control channel resides in one of the BRs Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  35. control_down causes only system partially down no longer a full outage! Asys Pblock Pdrop APS Automatic Protection Switch Upon control_down, the failed control channel is automatically switched to a channel on a working base repeater. Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  36. Model of System w/o APS CTMC State (b,k) b=No. of BR up k=No. of talking channels Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  37. Model of System w/ APS CTMC State (b,k) b=No. of BR up k=No. of talking channels Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University A Segment of the Composite Markov Chain Model

  38. Numerical Results Handoff Call Blocking Probability Improvement by APS Unavailability in handoff call dropping probability Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  39. s0 ON l1 0FF l0 s1 ATM Networks under Overloads • To quantify the effects of transients in ATM networks with • Markov Modulated Poisson (MMPP) used to allow for • Correlated arrivals • Transient effects: relaxation time, maximum overshoot, • and Expected excess number of losses in overload High Low 2-state MMPP traffic model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  40. Queuing Model for ATM A failure occurs in the connection (N1 and N3). Therefore, the traffic destined for N3 is re-routed through N2 Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  41. SRN Model for ATM Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  42. Numerical Results Loss probability vs. burstiness Queue length vs. burstiness Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  43. Error Recovery in Communication Study error recovery in communication networks using (non-exponential detection/restoration times) MRGP 1:N protection switching: Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  44. Two Modeling Methods CTMC is used with ignoring the detection and restoration times MRGP (a non-Markovian modeling) is used with including the detection and restoration times Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  45. Numerical Result Comparison of CTMC and non-Markovian model Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  46. Open Problems • Theoretical studies in the modeling and analysis of complex systems and failures • Capabilities, limitations, and relationships among formalisms such as FSPN and HS, FSPN and SDE. • Fast algorithms • Fast algorithms for FSPN • Fast algorithms for discrete-event simulation • Fast algorithms for non-Markovian models • Applications • Software: further exploration of software rejuvenation • Performability analysis of (wired and wireless) networks Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

  47. The End Thank you! Center for Advanced Computing and Communication Department of Electrical and Computer Engineering, Duke University

More Related