PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (HPS, SA) • Day 6 Reliability Aspects (HPS) Organized by HP Schwefel & H Schiøler

Introduction • System availablility/reliablility/dependability= ”ability of system to provide ’correct’ (specified) functionality” • for communication networks, e.g. • Connectivity • Services (network services, applications) • Properties when applied to complex systems: • Specification of ’correct’ functionality difficult (as well as verification and testing) • Stateful systems: individual faults lead to subsequent dependent faults (significance of fault detection) • Dependability is in general load dependent; faults influence performance  strong mutual relation between performance and dependability aspects (’performability’) • Analysis of dependability aspects via • Analytic models: Frequently independence and Markovian assumptions • Simulation models: Relevance of rare-event techniques • Experimental Settings (Prototype/Test systems)

Main Application Areas • Safety critical systems • Air-plane control systems • Vehicular electronic systems (e.g. ABS) • Satellites/Space missions • Business critical systems • Electronic stock markets • Electronic ordering/booking systems • High-Performance/Parallel Computing • Distributed Systems: Telecommunication systems • Traditional PSTNs: Availability 99.94% (end-to-end) • Internet: below 99%

Motivation: Failure Types in Communication Networks Source: Hal Stern, Evan Marcus WANs PSTNs

Economic Aspects of System Failures

Content Today: Basic concepts & Mathematical models • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models • Life-time models • Repair-Models • Performability Models • Availability and Network Management • Rolling upgrade, planned downtime, etc. • High-Availability for Hosts/Servers • Cluster solution, distributed approach • Network Availability • SCTP, HSRP, VRRP March 23: Architectures & Protocols Note: this slide-set contains only the outline of the lecture; many formulas and the mathematical derivations will be presented on the black-board!

Basic concepts (I) • Fault/defect/error: component behavior deviates from desired behavior • Transient vs. Persistent faults • Possible causes: • Design errors • Aging/wear-out • Mis-use • Maintenance (planned down-times) • System Failure • Observed system behavior deviates from specification, defined by • Failure criteria (time/functionality) • Note: Faults are a necessary condition for failures but not sufficient • faul-tolerance, ’robust’ failure criteria • Redundancy • Active (functional) vs passive (non-functional, stand-by) • Structural (components) vs. Information (e.g. additional header fields) vs. Temporal (retries)

Basic concepts (II): Metrics • Availability (’accessibility’) A(t) • System operational at time t • System accepts a service request at time t • Reliability R(t1,t2) • System operational within time interval [t1,t2] provided it was operational (’accessible’) at time t1 • System serves a request given that it had accepted it at time t1 • Dependability D(t1,t2) • System available at t1 and operational within time interval [t1,t2] • Successfully accepts and serves a service request • Systems without repair • Life-times LT • Hazard/failure rates • Systems with repair • Down-times, up-times • Distribution/Mean of time to failure • Distribution/Mean of time to repair

Content Today: Basic concepts & Mathematical models • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models • Life-time models • Repair-Models • Performability Models • Availability and Network Management • Rolling upgrade, planned downtime, etc. • High-Availability for Hosts/Servers • Cluster solution, distributed approach • Network Availability • SCTP, HSRP, VRRP March 24: Architectures & Protocols

Availability Models (I) • System description • System consists of N components • Components states Z1,...,ZN availability state vector • Binary: Zi in {0,1} • Multi-state or continuous • System availability: function of component availability Z=f(Z1,...,ZN) • Disjunct normal form (DNF) of binary availability function • Monotonicity, coherence (no irrelevant components) • Availability graph (binary case) • Graph: Start node S, End-node E, Path for each ’summand’ in DNF, node for each ’factor’ • Representation of system availability function • Simplification of graph through factorization

Availability Models (II) Common Patterns (with independence assumptions!) • Serial structures • All components need to be available for system availability • A=Ai • Boolscher Term and Availability Graph: • Parallel structures • At least one component has to be available for system availability • A=1-(1-Ai) • Boolscher Term and Availability Graph: • k-out-of-n systems • At least k out of n identical components have to be available for system availability • Use-cases: load-balancing/throughput, failure detection by majority votes • A=ni=k { binomial(n,i) A1i (1-A1)n-i} • Special cases: Parallel structure (k=1) and serial structure (k=n) • General case: A=...

Availability Models (III): Examples Examples • Triple Modular Redundancy (TMR) • 3 functionally identical components, result by majority vote2-out-of-3 case • Efficient use of redundancy in serial structure? • Replication of serial structure • Replicate each component • Combinations when neglecting components necessary for • Failure detection • Node Elimination • Majority votes

Lifetime models (I) • States time-dependent Zi(t) and Z(t)  description as stochastic process • Time LT=mint{Z(t)=0} is called life-time of system • Compare to first passage times in queueing models • Reliability: R(t)=Pr(Z(t)=1) [for systems without repair] • Serial structure: R(t)=Prod(Ri(t)) [independence!] • Example: exponential distributed LTs E[LT]=1/(sum i) • Parallel structure: R(t)= 1- Prod(1-Ri(t)) • Example: exponential distributed LTs  E[LT]=1/ sum(1/i) (law of diminishing return) • Hazard/failure Rate: • (t)=lim0 1/ Pr(’failure occurs in interval [t;t+[ | system operational at time t’) • Constant Hazard rate  exponential life-time

Lifetime models (II) • Typical hazard rates • HW: bath-tub functions • Exponential LT • Power-tailed LT • Examples: • Exponentially distributed lifetimes of components • Hazard Rate of serial structure equals sum of individual rates • TMR • R(t), E(LT), condition for RTMR(t)>R(t) • Computation of R(t) in Markovian Models • Via Matrix Exponential distributions

Repair models • Description of component failure and repair processes • Repair time distribution with mean MTTR • Failure Time distribution with mean MTTF Common assumption for modeling: Exponential repair time and exponential time to failure • Availability: A(t)=Pr(’system operational at time t’) • lim A(t) = MTTF/(MTTR+MTTF) • Reliability: R(t,t+)=Pr(’system operational in [t ,t+[ | operational at time t’) • Dependability: D(t,t+)= Pr(’system operational in [t ,t+[’) • D(t,t+)= R(t,t+) A(t) (definition of conditional probabilities) • Example: steady-state and transient behavior for • Single back-up component (passive stand-by): stationary probabilities, transient proces • Redundant unit fails only when in operation • Redundant unit can also fail in stand-by • TMR • Matrix Exponential distribution for Time to failur and time to repair

Performability models • Strong relation between system performance and dependability • Load/traffic dependent failure probabilities (e.g. overload scenarios!) • Failures lead to reduced throughput • Additional delay when redundant components have to become active • Failure detection and recovery cause additional traffic • Extensions of traditional queueing networks (Jackson, BCMP, etc.) • Introduction of hazard rates with components • Spontaneous faults: indep. of load • Load-induced faults • Increase of traffic as function of hazard rates (failure detection, etc.) • Service rates as function of hazard rates  Iterative solutions

Exercises: • Availability Models: A service is implemented on two redundant servers in the following network topology. The service is availability if at least one of the servers is up and reachable. • Draw the availability graph for the service availability. • Write down the DNF of the service availability function • Compute the service availability for the following values • Availability of the wireless link: 85% • Availability of the wired links: 99% • Server Availability (HW+SW): 96% • Performability Modles: A redundant network service in passive stand-by is found to be adequately described by an M/M/1 queueing model with request arrival rate (Traffic)  and service rate . In case of a failure (with rate ), the second server takes over until the failed server is repaired (repair rate ). If both servers are failed, requests will be queued until service starts again. Draw the Markov state diagram including state transition rates for this model. How will the transition rates change, if the second server is active and utilized with perfect load-balancing?

References/Acknowledgments • Heyman, Sobel (ed.), ‘Stochastic Models’,Chapt. 13 ‘Reliability and Maintainability’ (Shaked, Shanthikumar), North-Holland,1990 • Stern, Marcus, ‘Blueprints for High Availability: Designing Resilient Distributed Systems’. Wiley, 2000. • E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German), TU Munich, 1996. • K. S. Trivedi, ‘Probability and Statistics with Reliability, Queueing, and Computer Science Applications’, 1982.

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks