slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team PowerPoint Presentation
Download Presentation
A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team

Loading in 2 Seconds...

play fullscreen
1 / 26

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team Presented by Geoffroy Vallee Oak Ridge National Laboratory. Welcome to HPCVirt 2009. Goal of the Presentation. Can we anticipate failures and avoid their impact on application execution?. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Proactive Resiliency Approach for Large-Scale HPC Systems System Research Team' - bart


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

A Proactive Resiliency Approach for Large-Scale HPC Systems

System Research Team

Presented by Geoffroy Vallee

Oak Ridge National Laboratory

Welcome to HPCVirt 2009

goal of the presentation
Goal of the Presentation

Can we anticipate failures and avoid their impact on application execution?

introduction
Introduction

Traditional Fault Tolerance Policies in HPC Systems

Reactive policies

Other approach: pro-active fault tolerance

Two critical capabilities to make pro-active FT successful

Failure prediction

Anomaly detection

Application migration

Pro-active policy

Testing / Experimentation

Is proactive fault tolerance the solution?

failure detection prediction
Failure Detection & Prediction

System monitoring

Live monitoring

Study non-intrusive monitoring techniques

Postmortem failure analysis

System log analysis

Live analysis for failure prediction

Postmortem analysis

Anomaly analysis

Collaboration with George Ostrouchov

Statistical tool for anomaly detection

anomaly detection
Anomaly Detection

Anomaly Analyzer (George Ostrouchov)‏

Ability to view groups of components as statistical distributions

Identify anomalous components

Identify anomalous time periods

Based on numeric data with no expert knowledge for grouping

Scalable approach, only statistical properties of simple summaries

Power from examination of high-dimensional relationships

Visualization utility used to explore data

Implementation uses

R project for statistical computing

GGobi visualization tool for high-dimensional data exploration

With good failure data, could be used for failure prediction

anomaly detection prototype
Anomaly Detection Prototype

Monitoring / Data collection

Prototype developed using XTORC

Ganglia monitoring system

Standard metrics, e.g., memory/cpu utilization

LM_sensor data, e.g., cpu/mb temperature

Leveraged RRD reader from Ovis v1.1.1

proactive fault tolerance mechanisms
Proactive Fault Tolerance Mechanisms

Goal: move the application away from the component that is about to fail

Migration

Pause/unpause

Major proactive FT mechanisms

Process-level migration

Virtual machine migration

In our context

Do not care about the underlying mechanism

We can easily switch between solutions

slide8

System and application resilience

  • What policy to use for proactive FT?
  • Modular framework
    • Virtual machine ckpt/rsrt and migration
    • Process-level ckpt/rsrt and migration
    • Implementation of new policies via our SDK
    • Feedback loop
  • Policy simulator
    • Ease initial phase of study of new policies
    • Results match experimental virtualization results
type 1 feedback loop control architecture
Type 1 Feedback-Loop Control Architecture
  • Alert-driven coverage
    • Basic failures
  • No evaluation of application health history or context
    • Prone to false positives
    • Prone to false negatives
    • Prone to miss real-time window
    • Prone to decrease application heath through migration
    • No correlation of health context or history
type 2 feedback loop control architecture
Type 2 Feedback-Loop Control Architecture
  • Trend-driven coverage
    • Basic failures
    • Less false positives/negatives
  • No evaluation of application reliability
    • Prone to miss real-time window
    • Prone to decrease application heath through migration
    • No correlation of health context or history
type 3 feedback loop control architecture
Type 3 Feedback-Loop Control Architecture
  • Reliability-driven coverage
    • Basic and correlated failures
    • Less false positives/negatives
    • Able to maintain real-time window
    • Does not decrease application heath through migration
    • Correlation of short-term health context and history
  • No correlation of long-term health context or history
    • Unable to match system and application reliability patterns
type 4 feedback loop control architecture
Type 4 Feedback-Loop Control Architecture
  • Reliability-driven coverage of failures and anomalies
    • Basic and correlated failures, anomaly detection
    • Less prone to false positives
    • Less prone to false negatives
    • Able to maintain real-time window
    • Does not decrease application heath through migration
    • Correlation of short and long-term health context & history
testing and experimentation
Testing and Experimentation

How to evaluate a failure prediction mechanism?

Failure injection

Anomaly detection

How to evaluate the impact of a given proactive policy?

Simulation

Experimentation

fault injection testing
Fault Injection / Testing

First purpose: testing our research

Inject failure at different levels: system, OS, application

Framework for fault injection

Controller: Analyzer, Detector & Injector

Target system & user level targets

Testing of failure prediction/detection mechanisms

Mimic behavior of other systems

“Replay” failures sequence on another system

Based on system logs, we can evaluate the impact of different policies

fault injection
Fault Injection

Example faults/errors

Bit-flips - CPU registers/memory

Memory errors - mem corruptions/leaks

Disk faults - read/write errors

Network faults - packet loss, etc.

Important characteristics

Representative failures (fidelity)‏

Transparency and low overhead

Detection/Injection are linked

Existing Work

Techniques: Hardware vs. Software

Software FI can leverage perf./debug hardware

Not many publicly available tools

simulator
Simulator

System logs based

Currently based on LLNL ASCI White

Evaluate impact of

Alternate policies

System/FT mechanisms parameters (e.g., checkpoint cost)‏

Enable studies & evaluation of different configurations before actual deployment

anomaly detection experimentation on xtorc
Anomaly Detection: Experimentation on “XTORC”

Hardware

Compute nodes: ~45-60 (P4 @ 2 Ghz)‏

Head node: 1 (P4 @ 1.7Ghz)‏

Service/log server: 1 (P4 @ 1.8Ghz)

Network: 100 Mb Ethernet

Software

Operating systems span RedHat 9, Fedora Core 4 & 5

RH9: node53

FC4: node4, 58, 59, 60

FC5: node1-3, 5-52, 61

RH9 is Linux 2.4

FC4/5 is Linux 2.6

NFS exports ‘/home’

xtorc idle 48 hr results
XTORC Idle 48-hr Results

Data classified and grouped automatically

However, those results were manually interpreted (admin & statistician)‏

Observations

Node 0 is the most different from the rest, particularly hours 13, 37, 46, and 47. This is the head node where most services are running.

Node 53 runs the older Red Hat 9 (all others run Fedora Core 4/5).

It turned out that nodes 12, 31, 39, 43, and 63 were all down.

Node 13 … and particularly its hour 47!

Node 30 hour 7 … ?

Node 1 & Node 5 … ?

Three groups emerged in data clustering

1. temperature/memory related, 2. cpu related, 3. i/o related

anomaly detection next steps
Anomaly Detection - Next Steps

Data

Reduce overhead in data gathering

Monitor more fields

Investigate methods to aid data interpretation

Identify significant fields for given workloads

Heterogeneous nodes

Different workloads

Base (no/low work)

Loaded (benchmark/app work)‏

Loaded + Fault Injection

Working toward links between anomalies and failures

prototypes overview
Prototypes - Overview

Proactive & reactive fault tolerance

Process level: BLCR + LAM-MPI

Virtual machine level: Xen + any kind of MPI implementation

Detection

Monitoring framework: based on Ganglia

Anomaly detection tool

Simulator

System log based

Enable customization of policies and system/application parameters

is proactive the answer
Is proactive the answer?

Most of the time: prediction accuracy is not good enough, we may loose all the benefit of proactive FT

No “one-fit-all” solution

Combination of different policies

“Holistic” fault tolerance

Example: decrease the checkpoint frequency combining proactive and reactive FT policies

Optimization of existing policies

Leverage existing techniques/policies

Tuning

Customization

resource http www csm ornl gov srt contacts geoffroy vallee valleegr@ornl gov
Resource

http://www.csm.ornl.gov/srt/

Contacts

Geoffroy Vallee <valleegr@ornl.gov>

performance prediction
Performance Prediction

Important variance between different runs of the same experiment

Only few studies to address the problem

“System noise”

Critical to scale up

Scientists want strict answer

What are the problems:

Lack of tools?

VMMs are too big/complex?

Not enough VMM-bypass/optimization?

fault tolerance mechanisms
Fault Tolerance Mechanisms

FT mechanisms are not yet mainstream (out-of-the-box)‏

But different solutions start to be available (BLCR, Xen, etc.)‏

Support of as many mechanisms as possible

Reactive FT mechanisms

Process-level checkpoint/restart

Virtual machine checkpoint/restart

Proactive FT mechanisms

Process-level migration

Virtual machine migration

existing system level fault injection
Existing System Level Fault Injection

Virtual Machines

FAUmachine

Pro: focused on FI & experiments, code available

Con: older project, lots of dependencies, slow

FI-QEMU (patch)‏

Pro: works with ‘qemu’ emulator, code available

Con: patch for ARM arch, limited capabilities

Operating System

Linux (>= 2.6.20)‏

Pro: extensible, kernel & user level targets, maintained by Linux community

Con: immature, focused on testing Linux

future work
Future Work

Implementation of the RAS framework

Ultimately have an “end-to-end” solution for system resilience

From initial studies based on the simulator

To deployment and testing on computing platforms

Using different low-level mechanisms (process level versus virtual machine level mechanisms)‏

Adapting the policies to both the platform and the applications