Using failure injection mechanisms to experiment and evaluate a grid failure detector
Download
1 / 20

Using failure injection mechanisms to experiment and evaluate a grid failure detector - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Using failure injection mechanisms to experiment and evaluate a grid failure detector. Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team. Systems evaluation. Simulations Fast/easy System model Formal proofs Reliable System model Experimentations on real testbeds

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using failure injection mechanisms to experiment and evaluate a grid failure detector' - addison


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using failure injection mechanisms to experiment and evaluate a grid failure detector

Using failure injection mechanisms to experiment and evaluate a grid failure detector

Sébastien Monnet and Marin Bertier

IRISA / INRIA, PARIS project-team

WCGC 2006 - Rio de Janeiro


Systems evaluation
Systems evaluation evaluate

  • Simulations

    • Fast/easy

    • System model

  • Formal proofs

    • Reliable

    • System model

  • Experimentations on real testbeds

    • Real system code / real environment

    • Hard !

WCGC 2006 - Rio de Janeiro


Running experimentations
Running experimentations evaluate

  • Find ressources

  • Deploy the system

  • Launch the test

  • Control the test

  • Get and analyze results

WCGC 2006 - Rio de Janeiro


Experimenting fault tolerance
Experimenting fault-tolerance evaluate

  • Evaluate fault tolerance mechanisms

    • Fault-free runs

    • With failures

  • Fault prevention cost

  • Resilience to failures

  • Overhead due to failures (recovery, adaptation, etc.)

WCGC 2006 - Rio de Janeiro


Volatility control needs
Volatility control - needs evaluate

  • Assumption: a stable testbed

  • Injecting failures

    • At large scale

    • Accurately

    • Reproducibly

      • Using failure scenarios

WCGC 2006 - Rio de Janeiro


Jxta distributed framework jdf
JXTA Distributed Framework evaluate (JDF)

  • A tool to automate the tests of JXTA-based systems (Sun Microsystems, Paris research team)

  • Test description

    • Nodes file

    • Files to deploy file

    • XML file describing nodes profile

  • Set of scripts to deploy, launch and fetch results

WCGC 2006 - Rio de Janeiro


Description language extension

Adding a specific XML tag for failure injection evaluate

<failure grp=“groupName”>

<failure dep=“profileName”>

Single failure

Correlated failures

(00) <network analyze-class="test.Analyze">

(01) <profile name="manager" replicas="1">

(02) <!-- peer information -->

(03) <peer base-name="peerA"/>

...

(11) <bootstrap class="test.MyClass1"/>

(12) <!-- argument -->

(13) <arg value="x"/>

(14) </profile>

(15) <profile name="non-manager" replicas="20">

(16) <peer base-name="peerB"/>

...

(23) <bootstrap class="test.MyClass2"/>

(24) </profile>

(25) </network>

Description language extension

WCGC 2006 - Rio de Janeiro


Injecting failures when
Injecting failures - when ? evaluate

  • Active research field

  • A failure schedule generator

    • Input

      • The failure tags in the XML description file

      • Probabilistic parameters (MTBF)

    • Output

      • A new configuration file for JDF

        Format: peerID=uptime

WCGC 2006 - Rio de Janeiro


Injecting failures how
Injecting failures - how ? evaluate

WCGC 2006 - Rio de Janeiro


Using failure injectors 1
Using failure injectors (1) evaluate

  • Launching a simple test

  • Correlated failures

WCGC 2006 - Rio de Janeiro


Using failure injectors 2
Using failure injectors (2) evaluate

  • Refining the failure schedule

WCGC 2006 - Rio de Janeiro


Failure detectors
Failure detectors evaluate

  • Basic building bloc for fault-tolerance mechanisms

  • Basic principle

    • Periodical Heartbeat exchanges (all-to-all)

    • On each node a suspects list is updating according heartbeats arrivals

WCGC 2006 - Rio de Janeiro


Grid failure detectors gfd
Grid failure detectors ( evaluate GFD)

  • Adaptability

    • Network load

    • Quality of service

  • Scalability

    • Hierarchical failure detectors

      • All-to-all within clusters

      • Leader-to-leader among clusters

WCGC 2006 - Rio de Janeiro


Experimental testbed
Experimental testbed evaluate

  • Grid5000 grid platform

    • 9 cities inter-connected by Renater

      • Bandwidth: 1Gb/s (10Gb/s soon)

      • Latency: from 4 to ~30ms

    • In each city clusters provides high performance networks

      • Bandwidth: 1Gb/s

      • Latency: few micro seconds

        http://www.grid5000.fr/

WCGC 2006 - Rio de Janeiro


Experimental setup
Experimental setup evaluate

  • 64 nodes partitioned in 4 different cities

Cluster 1

Cluster 2

Cluster 4

Cluster 3

WCGC 2006 - Rio de Janeiro


Failure injector alone
Failure injector - alone evaluate

  • MTBF = 1 minute

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Correlated failures
Correlated failures evaluate

  • Adding a failure dependencies in cluster 1:

    <failure dep=“cluster1-leader”>

Cluster1 leader

crashes

WCGC 2006 - Rio de Janeiro


Failure d tection in subgroups
Failure détection in subgroups evaluate

  • No leader failures

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Between groups
Between groups evaluate

  • Failure dependency in each cluster to avoid new leader selection

WCGC 2006 - Rio de Janeiro


Conclusion
Conclusion evaluate

  • Evaluating a distributed system is complex

  • Running experimentations provides the ability to

    • Evaluate a new concept or software

    • Debug during implementation phase

  • Failure-injection mechanisms provide the ability to experiment fault-tolerance mechanisms

  • We have designed a failure injection tool that allows the tester to run large scale experiments

    • with various volatility conditions

    • In a reproducible manner

WCGC 2006 - Rio de Janeiro