1 / 34

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

This paper analyzes the statistical properties of system errors and failures in a network of 395 nodes, showing time varying behavior and periodic patterns. It discusses technological factors, solutions, related work, and system-wide and per-node errors and failures.

qualls
Download Presentation

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Failure Data Analysis of a Large-Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante YanyongZhang Presenter : Sajala Rajendran

  2. Abstract • Fault occurrence due to increased complexity of hardware and software • This paper analyzes the empirical and statistical properties of system errors and failures testing a network of 395 nodes. • Results show that the system errors and failures are of time varying behavior containing long stationary intervals which show strong correlation structures and periodic patterns

  3. Outline • Technological Factors • Solutions • Related Work • System Environment • System–Wide Errors and Failures • Per–Node Errors and Failures • Conclusion

  4. Technological Factors • Lowering operating voltages in order to reduce power consumption leads to the emission of alpha particle and cosmic rays which in turn causes- Bit-flips : A random event that corrupts the value in a memory cell rather than the cell itself • High workload imposed on these systems lead to thermal instability that result in breakdowns.

  5. More complex software and applications that make the systems prone to bugs such as memory leaks etc. This may lead to crashes. • Parallel/distributed systems where nodes depend on one another are susceptible to another’s failure/errors.

  6. Solutions • Provide sufficient redundancy, e.g. duplicate file servers that can avoid problems when failures occur but additional use of software/hardware is more expensive. • Anticipating failures and taking pro-active measures in advance. E.g. “Software rejuvenation” is aimed to prevent unexpected/unplanned outages due to software aging. • Take action when the failure has actually occurred. E.g. When nodes/disks fail, replace them

  7. Observations Each of the above solutions has its own pros and cons. Thus, it is very important to understand the properties of errors and failures that can occur which will help in developing schemes to improve system performance and availability.

  8. Related Work • Dong Tang, Ravishankar and Sujatha used a VAX cluster system consisting of 7 machines and 4 storage controllers. • More than 46% of the failures were due to shared resources and 98.8% of errors are recovered. Semi Markov failure model to point out that failure distribution on machines are correlated • Heath, Martin and Nguyen collected failure data from three clustered servers ranging from 18 to 89 workstations. • Time between failures are independent and nodes that just failed are more likely to fail again • Weibull Distribution

  9. System Environment • Event logs obtained from 395 nodes in a machine room (single CPU servers, 2/4/8/12 way SMP’s) over 487 days • Workload – Long running scientific and commercial applications

  10. Node Breakdown

  11. Error Logs • The kernel/applications log errors in /dev/error • Daemon process (errdaemon) monitors the above file and compares each entry against a database of error record templates. • Each entry of error log contains the following information • Node number (Node) • Error Identifier (ID) • Time Stamp (Time) • Error Type (type) • Error Class (Class) • Description of the problem (Description)

  12. Classification of Errors • Error Types • TEMP: Error condition recovered after a number of unsuccessful attempts. • PERF: Performance degradation • UNKN: Unknown error • PEND: Loss of availability of device or component is forthcoming • PERM: Permanent error

  13. Error Classes • Class O : Informational error only • Class S : Software related errors • Class U : Undermined errors • Class H : Hardware related errors

  14. Event Processing and Filtering • 6,533,152 events collected from the nodes • Needs to be filtered • Events appearing within a short time interval in a node are the result of the same errors as proved by Tang in his article • Using this result, the filtering algorithm works as follows:

  15. Contd.… • New event type recorded as new event ID and new node number as new node ID • Event ID and node ID at any time T is compared with the event ID and node ID of all the events since time ( T - Tth ) ( Tth )is the threshold time and chosen to be 5 minutes • Eliminated 91.82% of the total raw events

  16. Failures – PERM type errors + small subset of PEND type error entries • Errors – Remaining PEND type entries + TEMP + PERF + UNKN type entries

  17. Background Information • Hazard Rate Function Defined as h(x) = f(x) (1 – F(x)) h(x) – conditional probability that an item will fail given that it survived until time x. f(x) – Marginal density function F(x) – Marginal cumulative distribution function • Cross Correlation Function Plot of similarity between one waveform and time shifted version of the other, as a function of the time shift • Autocorrelation Function Cross-correlation of a signal with itself

  18. System-Wide Errors and Failures

  19. Rate process – Number of events per minute • Increment process – Time between events

  20. Observations • Failures are a small fraction • Average time between failures is larger than average time between all other error types. • The variability of the interfailure times is large with a coefficient of variation (CV) exceeding 2.5 • Variability of inter error time is much larger with a CV of 14 • CV = Standard deviation mean

  21. Empirical Distribution • Tail properties explain the high variance of the failure and error processes

  22. Hazard Rate Function • Failures and errors increases when the time since the last failure and error increases beyond a particular point

  23. Auto-SLEX • Partitions error and failure rate and increment processes into stationary intervals • Result: • Error/failure rate and increment processes are nonstationary and have long intervals that are stationary • E.g. Two stationary intervals in the error+failure rate process of length 8192 minutes. • 5 stationary intervals of length 4096 minutes.

  24. Autocorrelation Function • Plot for stationary interval of error+failure rate process of length 4096 minutes. • Results showed significant correlation structures • To gain a deeper understanding, ACF is again plotted separately for (TEMP, PERF, UNKN) termed as less serious errors and (PEND + failures) termed as more serious errors. • Magnitude of the correlation is much smaller and dies out with increasing lag. These results suggest that there exists interactions between the statistical properties of the two type of errors

  25. Cross Correlation Function • Plot for stationary interval of 4096 minutes • Positive valued lags show the relationship between less serious errors and more serious errors+failures at lags of k minutes apart • Negative valued lags show the relationship at lags of –k minutes apart • Results : significant cross correlation structures between the two partitions and considerable periodic behavior.

  26. Per-NodeErrors and Failures

  27. Hazard rate functions for an individual node are quite similar to those that of the entire system

  28. Almost 70 % of the failures occur in less than 4 % of the nodes

  29. Observations • High failure rates on 5 nodes • 3 file servers • 2 database servers • Previous studies show that • High workload leads to failures • Most hardware failures occur in the I/O subsystem

  30. Failures occurring in these nodes and their time varying behavior are influenced by the load imposed on them.

  31. Conclusion • Understanding failure behavior helps to modify existing techniques and also develop new mechanisms. • Event logs were collected from 395 systems in a machine room and results were analyzed • Error and failure processes exhibit time varying behavior and different forms of strong correlation structures. • A small fraction of the nodes incur most of the failures. • The nodes having the most failures have a strong temporal correlation with time of day at the hourly level.

  32. Thank You !!!

More Related