Loading in 2 Seconds...
Loading in 2 Seconds...
A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory. Welcome to HPCVirt 2009. Goal of the Presentation. Can we anticipate failures and avoid their impact on application execution?. Introduction.
System Research Team
Presented by Geoffroy Vallee
Oak Ridge National Laboratory
Welcome to HPCVirt 2009
Can we anticipate failures and avoid their impact on application execution?
Traditional Fault Tolerance Policies in HPC Systems
Other approach: pro-active fault tolerance
Two critical capabilities to make pro-active FT successful
Testing / Experimentation
Is proactive fault tolerance the solution?
Study non-intrusive monitoring techniques
Postmortem failure analysis
System log analysis
Live analysis for failure prediction
Collaboration with George Ostrouchov
Statistical tool for anomaly detection
Anomaly Analyzer (George Ostrouchov)
Ability to view groups of components as statistical distributions
Identify anomalous components
Identify anomalous time periods
Based on numeric data with no expert knowledge for grouping
Scalable approach, only statistical properties of simple summaries
Power from examination of high-dimensional relationships
Visualization utility used to explore data
R project for statistical computing
GGobi visualization tool for high-dimensional data exploration
With good failure data, could be used for failure prediction
Monitoring / Data collection
Prototype developed using XTORC
Ganglia monitoring system
Standard metrics, e.g., memory/cpu utilization
LM_sensor data, e.g., cpu/mb temperature
Leveraged RRD reader from Ovis v1.1.1
Goal: move the application away from the component that is about to fail
Major proactive FT mechanisms
Virtual machine migration
In our context
Do not care about the underlying mechanism
We can easily switch between solutions
How to evaluate a failure prediction mechanism?
How to evaluate the impact of a given proactive policy?
First purpose: testing our research
Inject failure at different levels: system, OS, application
Framework for fault injection
Controller: Analyzer, Detector & Injector
Target system & user level targets
Testing of failure prediction/detection mechanisms
Mimic behavior of other systems
“Replay” failures sequence on another system
Based on system logs, we can evaluate the impact of different policies
Bit-flips - CPU registers/memory
Memory errors - mem corruptions/leaks
Disk faults - read/write errors
Network faults - packet loss, etc.
Representative failures (fidelity)
Transparency and low overhead
Detection/Injection are linked
Techniques: Hardware vs. Software
Software FI can leverage perf./debug hardware
Not many publicly available tools
System logs based
Currently based on LLNL ASCI White
Evaluate impact of
System/FT mechanisms parameters (e.g., checkpoint cost)
Enable studies & evaluation of different configurations before actual deployment
Compute nodes: ~45-60 (P4 @ 2 Ghz)
Head node: 1 (P4 @ 1.7Ghz)
Service/log server: 1 (P4 @ 1.8Ghz)
Network: 100 Mb Ethernet
Operating systems span RedHat 9, Fedora Core 4 & 5
FC4: node4, 58, 59, 60
FC5: node1-3, 5-52, 61
RH9 is Linux 2.4
FC4/5 is Linux 2.6
NFS exports ‘/home’
Data classified and grouped automatically
However, those results were manually interpreted (admin & statistician)
Node 0 is the most different from the rest, particularly hours 13, 37, 46, and 47. This is the head node where most services are running.
Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5).
It turned out that nodes 12, 31, 39, 43, and 63 were all down.
Node 13 … and particularly its hour 47!
Node 30 hour 7 … ?
Node 1 & Node 5 … ?
Three groups emerged in data clustering
1. temperature/memory related, 2. cpu related, 3. i/o related
Reduce overhead in data gathering
Monitor more fields
Investigate methods to aid data interpretation
Identify significant fields for given workloads
Base (no/low work)
Loaded (benchmark/app work)
Loaded + Fault Injection
Working toward links between anomalies and failures
Proactive & reactive fault tolerance
Process level: BLCR + LAM-MPI
Virtual machine level: Xen + any kind of MPI implementation
Monitoring framework: based on Ganglia
Anomaly detection tool
System log based
Enable customization of policies and system/application parameters
Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT
No “one-fit-all” solution
Combination of different policies
“Holistic” fault tolerance
Example: decrease the checkpoint frequency combining proactive and reactive FT policies
Optimization of existing policies
Leverage existing techniques/policies
Geoffroy Vallee <firstname.lastname@example.org>
Important variance between different runs of the same experiment
Only few studies to address the problem
Critical to scale up
Scientists want strict answer
What are the problems:
Lack of tools?
VMMs are too big/complex?
Not enough VMM-bypass/optimization?
FT mechanisms are not yet mainstream (out-of-the-box)
But different solutions start to be available (BLCR, Xen, etc.)
Support of as many mechanisms as possible
Reactive FT mechanisms
Virtual machine checkpoint/restart
Proactive FT mechanisms
Virtual machine migration
Pro: focused on FI & experiments, code available
Con: older project, lots of dependencies, slow
Pro: works with ‘qemu’ emulator, code available
Con: patch for ARM arch, limited capabilities
Linux (>= 2.6.20)
Pro: extensible, kernel & user level targets, maintained by Linux community
Con: immature, focused on testing Linux
Implementation of the RAS framework
Ultimately have an “end-to-end” solution for system resilience
From initial studies based on the simulator
To deployment and testing on computing platforms
Using different low-level mechanisms (process level versus virtual machine level mechanisms)
Adapting the policies to both the platform and the applications