Using failure injection mechanisms to experiment and evaluate a grid failure detector
Download
1 / 20

Using failure injection mechanisms to experiment and evaluate a grid failure detector - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Using failure injection mechanisms to experiment and evaluate a grid failure detector. Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team. Systems evaluation. Simulations Fast/easy System model Formal proofs Reliable System model Experimentations on real testbeds

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using failure injection mechanisms to experiment and evaluate a grid failure detector' - addison


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using failure injection mechanisms to experiment and evaluate a grid failure detector

Using failure injection mechanisms to experiment and evaluate a grid failure detector

Sébastien Monnet and Marin Bertier

IRISA / INRIA, PARIS project-team

WCGC 2006 - Rio de Janeiro


Systems evaluation
Systems evaluation evaluate

  • Simulations

    • Fast/easy

    • System model

  • Formal proofs

    • Reliable

    • System model

  • Experimentations on real testbeds

    • Real system code / real environment

    • Hard !

WCGC 2006 - Rio de Janeiro


Running experimentations
Running experimentations evaluate

  • Find ressources

  • Deploy the system

  • Launch the test

  • Control the test

  • Get and analyze results

WCGC 2006 - Rio de Janeiro


Experimenting fault tolerance
Experimenting fault-tolerance evaluate

  • Evaluate fault tolerance mechanisms

    • Fault-free runs

    • With failures

  • Fault prevention cost

  • Resilience to failures

  • Overhead due to failures (recovery, adaptation, etc.)

WCGC 2006 - Rio de Janeiro


Volatility control needs
Volatility control - needs evaluate

  • Assumption: a stable testbed

  • Injecting failures

    • At large scale

    • Accurately

    • Reproducibly

      • Using failure scenarios

WCGC 2006 - Rio de Janeiro


Jxta distributed framework jdf
JXTA Distributed Framework evaluate (JDF)

  • A tool to automate the tests of JXTA-based systems (Sun Microsystems, Paris research team)

  • Test description

    • Nodes file

    • Files to deploy file

    • XML file describing nodes profile

  • Set of scripts to deploy, launch and fetch results

WCGC 2006 - Rio de Janeiro


Description language extension

Adding a specific XML tag for failure injection evaluate

<failure grp=“groupName”>

<failure dep=“profileName”>

Single failure

Correlated failures

(00) <network analyze-class="test.Analyze">

(01) <profile name="manager" replicas="1">

(02) <!-- peer information -->

(03) <peer base-name="peerA"/>

...

(11) <bootstrap class="test.MyClass1"/>

(12) <!-- argument -->

(13) <arg value="x"/>

(14) </profile>

(15) <profile name="non-manager" replicas="20">

(16) <peer base-name="peerB"/>

...

(23) <bootstrap class="test.MyClass2"/>

(24) </profile>

(25) </network>

Description language extension

WCGC 2006 - Rio de Janeiro


Injecting failures when
Injecting failures - when ? evaluate

  • Active research field

  • A failure schedule generator

    • Input

      • The failure tags in the XML description file

      • Probabilistic parameters (MTBF)

    • Output

      • A new configuration file for JDF

        Format: peerID=uptime

WCGC 2006 - Rio de Janeiro


Injecting failures how
Injecting failures - how ? evaluate

WCGC 2006 - Rio de Janeiro


Using failure injectors 1
Using failure injectors (1) evaluate

  • Launching a simple test

  • Correlated failures

WCGC 2006 - Rio de Janeiro


Using failure injectors 2
Using failure injectors (2) evaluate

  • Refining the failure schedule

WCGC 2006 - Rio de Janeiro


Failure detectors
Failure detectors evaluate

  • Basic building bloc for fault-tolerance mechanisms

  • Basic principle

    • Periodical Heartbeat exchanges (all-to-all)

    • On each node a suspects list is updating according heartbeats arrivals

WCGC 2006 - Rio de Janeiro


Grid failure detectors gfd
Grid failure detectors ( evaluate GFD)

  • Adaptability

    • Network load

    • Quality of service

  • Scalability

    • Hierarchical failure detectors

      • All-to-all within clusters

      • Leader-to-leader among clusters

WCGC 2006 - Rio de Janeiro


Experimental testbed
Experimental testbed evaluate

  • Grid5000 grid platform

    • 9 cities inter-connected by Renater

      • Bandwidth: 1Gb/s (10Gb/s soon)

      • Latency: from 4 to ~30ms

    • In each city clusters provides high performance networks

      • Bandwidth: 1Gb/s

      • Latency: few micro seconds

        http://www.grid5000.fr/

WCGC 2006 - Rio de Janeiro


Experimental setup
Experimental setup evaluate

  • 64 nodes partitioned in 4 different cities

Cluster 1

Cluster 2

Cluster 4

Cluster 3

WCGC 2006 - Rio de Janeiro


Failure injector alone
Failure injector - alone evaluate

  • MTBF = 1 minute

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Correlated failures
Correlated failures evaluate

  • Adding a failure dependencies in cluster 1:

    <failure dep=“cluster1-leader”>

Cluster1 leader

crashes

WCGC 2006 - Rio de Janeiro


Failure d tection in subgroups
Failure détection in subgroups evaluate

  • No leader failures

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Between groups
Between groups evaluate

  • Failure dependency in each cluster to avoid new leader selection

WCGC 2006 - Rio de Janeiro


Conclusion
Conclusion evaluate

  • Evaluating a distributed system is complex

  • Running experimentations provides the ability to

    • Evaluate a new concept or software

    • Debug during implementation phase

  • Failure-injection mechanisms provide the ability to experiment fault-tolerance mechanisms

  • We have designed a failure injection tool that allows the tester to run large scale experiments

    • with various volatility conditions

    • In a reproducible manner

WCGC 2006 - Rio de Janeiro


ad