Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante YanyongZhang Presenter : Sajala Rajendran

Abstract • Fault occurrence due to increased complexity of hardware and software • This paper analyzes the empirical and statistical properties of system errors and failures testing a network of 395 nodes. • Results show that the system errors and failures are of time varying behavior containing long stationary intervals which show strong correlation structures and periodic patterns

Outline • Technological Factors • Solutions • Related Work • System Environment • System–Wide Errors and Failures • Per–Node Errors and Failures • Conclusion

Technological Factors • Lowering operating voltages in order to reduce power consumption leads to the emission of alpha particle and cosmic rays which in turn causes- Bit-flips : A random event that corrupts the value in a memory cell rather than the cell itself • High workload imposed on these systems lead to thermal instability that result in breakdowns.

More complex software and applications that make the systems prone to bugs such as memory leaks etc. This may lead to crashes. • Parallel/distributed systems where nodes depend on one another are susceptible to another’s failure/errors.

Solutions • Provide sufficient redundancy, e.g. duplicate file servers that can avoid problems when failures occur but additional use of software/hardware is more expensive. • Anticipating failures and taking pro-active measures in advance. E.g. “Software rejuvenation” is aimed to prevent unexpected/unplanned outages due to software aging. • Take action when the failure has actually occurred. E.g. When nodes/disks fail, replace them

Observations Each of the above solutions has its own pros and cons. Thus, it is very important to understand the properties of errors and failures that can occur which will help in developing schemes to improve system performance and availability.

Related Work • Dong Tang, Ravishankar and Sujatha used a VAX cluster system consisting of 7 machines and 4 storage controllers. • More than 46% of the failures were due to shared resources and 98.8% of errors are recovered. Semi Markov failure model to point out that failure distribution on machines are correlated • Heath, Martin and Nguyen collected failure data from three clustered servers ranging from 18 to 89 workstations. • Time between failures are independent and nodes that just failed are more likely to fail again • Weibull Distribution

System Environment • Event logs obtained from 395 nodes in a machine room (single CPU servers, 2/4/8/12 way SMP’s) over 487 days • Workload – Long running scientific and commercial applications

Node Breakdown

Error Logs • The kernel/applications log errors in /dev/error • Daemon process (errdaemon) monitors the above file and compares each entry against a database of error record templates. • Each entry of error log contains the following information • Node number (Node) • Error Identifier (ID) • Time Stamp (Time) • Error Type (type) • Error Class (Class) • Description of the problem (Description)

Classification of Errors • Error Types • TEMP: Error condition recovered after a number of unsuccessful attempts. • PERF: Performance degradation • UNKN: Unknown error • PEND: Loss of availability of device or component is forthcoming • PERM: Permanent error

Error Classes • Class O : Informational error only • Class S : Software related errors • Class U : Undermined errors • Class H : Hardware related errors

Event Processing and Filtering • 6,533,152 events collected from the nodes • Needs to be filtered • Events appearing within a short time interval in a node are the result of the same errors as proved by Tang in his article • Using this result, the filtering algorithm works as follows:

Contd.… • New event type recorded as new event ID and new node number as new node ID • Event ID and node ID at any time T is compared with the event ID and node ID of all the events since time ( T - Tth ) ( Tth )is the threshold time and chosen to be 5 minutes • Eliminated 91.82% of the total raw events

Failures – PERM type errors + small subset of PEND type error entries • Errors – Remaining PEND type entries + TEMP + PERF + UNKN type entries

Background Information • Hazard Rate Function Defined as h(x) = f(x) (1 – F(x)) h(x) – conditional probability that an item will fail given that it survived until time x. f(x) – Marginal density function F(x) – Marginal cumulative distribution function • Cross Correlation Function Plot of similarity between one waveform and time shifted version of the other, as a function of the time shift • Autocorrelation Function Cross-correlation of a signal with itself

System-Wide Errors and Failures

Rate process – Number of events per minute • Increment process – Time between events

Observations • Failures are a small fraction • Average time between failures is larger than average time between all other error types. • The variability of the interfailure times is large with a coefficient of variation (CV) exceeding 2.5 • Variability of inter error time is much larger with a CV of 14 • CV = Standard deviation mean

Empirical Distribution • Tail properties explain the high variance of the failure and error processes

Hazard Rate Function • Failures and errors increases when the time since the last failure and error increases beyond a particular point

Auto-SLEX • Partitions error and failure rate and increment processes into stationary intervals • Result: • Error/failure rate and increment processes are nonstationary and have long intervals that are stationary • E.g. Two stationary intervals in the error+failure rate process of length 8192 minutes. • 5 stationary intervals of length 4096 minutes.

Autocorrelation Function • Plot for stationary interval of error+failure rate process of length 4096 minutes. • Results showed significant correlation structures • To gain a deeper understanding, ACF is again plotted separately for (TEMP, PERF, UNKN) termed as less serious errors and (PEND + failures) termed as more serious errors. • Magnitude of the correlation is much smaller and dies out with increasing lag. These results suggest that there exists interactions between the statistical properties of the two type of errors

Cross Correlation Function • Plot for stationary interval of 4096 minutes • Positive valued lags show the relationship between less serious errors and more serious errors+failures at lags of k minutes apart • Negative valued lags show the relationship at lags of –k minutes apart • Results : significant cross correlation structures between the two partitions and considerable periodic behavior.

Per-NodeErrors and Failures

Hazard rate functions for an individual node are quite similar to those that of the entire system

Almost 70 % of the failures occur in less than 4 % of the nodes

Observations • High failure rates on 5 nodes • 3 file servers • 2 database servers • Previous studies show that • High workload leads to failures • Most hardware failures occur in the I/O subsystem

Failures occurring in these nodes and their time varying behavior are influenced by the load imposed on them.

Conclusion • Understanding failure behavior helps to modify existing techniques and also develop new mechanisms. • Event logs were collected from 395 systems in a machine room and results were analyzed • Error and failure processes exhibit time varying behavior and different forms of strong correlation structures. • A small fraction of the nodes incur most of the failures. • The nodes having the most failures have a strong temporal correlation with time of day at the hourly level.

Thank You !!!

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Presentation Transcript

SQL Server: A Data Platform for Large-Scale Applications

SQL Server: A Data Platform for Large-Scale Applications

Proteomics Analysis and integration of large-scale data sets

Large-Scale Phylogenetic Analysis

Towards Viable Large Scale Heterogeneous Wireless Networks

Analysis of Large Scale Visual Recognition

Integrative Analysis of multiple large-scale molecular biological data

DataMeadow A Visual Canvas for Analysis of Large-Scale Multivariate Data

A Comparison of Approaches to Large-Scale Data Analysis

Large scale data processing

SQL Server Parallel Data Warehouse: Supporting Large Scale Analytics

Internal environment of large scale organisations

Large scale data flow in local and GRID environment

Web Research - Large-Scale Web Data Analysis

Large Scale Data Integration

Large Scale Data Analytics

Job Failure Analysis and Its Implications in a Large-scale Production Grid

Managing a Large Scale Student Environment:

large scale data analysis

Computational Mathematics for Large-scale Data Analysis

Large Scale Coordination of Heterogeneous Agents

DataMeadow A Visual Canvas for Analysis of Large-Scale Multivariate Data