Using failure injection mechanisms to experiment and evaluate a grid failure detector
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Using failure injection mechanisms to experiment and evaluate a grid failure detector PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Using failure injection mechanisms to experiment and evaluate a grid failure detector. Sébastien Monnet and Marin Bertier IRISA / INRIA, PARIS project-team. Systems evaluation. Simulations Fast/easy System model Formal proofs Reliable System model Experimentations on real testbeds

Download Presentation

Using failure injection mechanisms to experiment and evaluate a grid failure detector

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using failure injection mechanisms to experiment and evaluate a grid failure detector

Using failure injection mechanisms to experiment and evaluate a grid failure detector

Sébastien Monnet and Marin Bertier

IRISA / INRIA, PARIS project-team

WCGC 2006 - Rio de Janeiro


Systems evaluation

Systems evaluation

  • Simulations

    • Fast/easy

    • System model

  • Formal proofs

    • Reliable

    • System model

  • Experimentations on real testbeds

    • Real system code / real environment

    • Hard !

WCGC 2006 - Rio de Janeiro


Running experimentations

Running experimentations

  • Find ressources

  • Deploy the system

  • Launch the test

  • Control the test

  • Get and analyze results

WCGC 2006 - Rio de Janeiro


Experimenting fault tolerance

Experimenting fault-tolerance

  • Evaluate fault tolerance mechanisms

    • Fault-free runs

    • With failures

  • Fault prevention cost

  • Resilience to failures

  • Overhead due to failures (recovery, adaptation, etc.)

WCGC 2006 - Rio de Janeiro


Volatility control needs

Volatility control - needs

  • Assumption: a stable testbed

  • Injecting failures

    • At large scale

    • Accurately

    • Reproducibly

      • Using failure scenarios

WCGC 2006 - Rio de Janeiro


Jxta distributed framework jdf

JXTA Distributed Framework(JDF)

  • A tool to automate the tests of JXTA-based systems (Sun Microsystems, Paris research team)

  • Test description

    • Nodes file

    • Files to deploy file

    • XML file describing nodes profile

  • Set of scripts to deploy, launch and fetch results

WCGC 2006 - Rio de Janeiro


Description language extension

Adding a specific XML tag for failure injection

<failure grp=“groupName”>

<failure dep=“profileName”>

Single failure

Correlated failures

(00) <network analyze-class="test.Analyze">

(01) <profile name="manager" replicas="1">

(02) <!-- peer information -->

(03) <peer base-name="peerA"/>

...

(11) <bootstrap class="test.MyClass1"/>

(12) <!-- argument -->

(13) <arg value="x"/>

(14) </profile>

(15) <profile name="non-manager" replicas="20">

(16) <peer base-name="peerB"/>

...

(23) <bootstrap class="test.MyClass2"/>

(24) </profile>

(25) </network>

Description language extension

WCGC 2006 - Rio de Janeiro


Injecting failures when

Injecting failures - when ?

  • Active research field

  • A failure schedule generator

    • Input

      • The failure tags in the XML description file

      • Probabilistic parameters (MTBF)

    • Output

      • A new configuration file for JDF

        Format: peerID=uptime

WCGC 2006 - Rio de Janeiro


Injecting failures how

Injecting failures - how ?

WCGC 2006 - Rio de Janeiro


Using failure injectors 1

Using failure injectors (1)

  • Launching a simple test

  • Correlated failures

WCGC 2006 - Rio de Janeiro


Using failure injectors 2

Using failure injectors (2)

  • Refining the failure schedule

WCGC 2006 - Rio de Janeiro


Failure detectors

Failure detectors

  • Basic building bloc for fault-tolerance mechanisms

  • Basic principle

    • Periodical Heartbeat exchanges (all-to-all)

    • On each node a suspects list is updating according heartbeats arrivals

WCGC 2006 - Rio de Janeiro


Grid failure detectors gfd

Grid failure detectors (GFD)

  • Adaptability

    • Network load

    • Quality of service

  • Scalability

    • Hierarchical failure detectors

      • All-to-all within clusters

      • Leader-to-leader among clusters

WCGC 2006 - Rio de Janeiro


Experimental testbed

Experimental testbed

  • Grid5000 grid platform

    • 9 cities inter-connected by Renater

      • Bandwidth: 1Gb/s (10Gb/s soon)

      • Latency: from 4 to ~30ms

    • In each city clusters provides high performance networks

      • Bandwidth: 1Gb/s

      • Latency: few micro seconds

        http://www.grid5000.fr/

WCGC 2006 - Rio de Janeiro


Experimental setup

Experimental setup

  • 64 nodes partitioned in 4 different cities

Cluster 1

Cluster 2

Cluster 4

Cluster 3

WCGC 2006 - Rio de Janeiro


Failure injector alone

Failure injector - alone

  • MTBF = 1 minute

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Correlated failures

Correlated failures

  • Adding a failure dependencies in cluster 1:

    <failure dep=“cluster1-leader”>

Cluster1 leader

crashes

WCGC 2006 - Rio de Janeiro


Failure d tection in subgroups

Failure détection in subgroups

  • No leader failures

  • No failure dependencies

WCGC 2006 - Rio de Janeiro


Between groups

Between groups

  • Failure dependency in each cluster to avoid new leader selection

WCGC 2006 - Rio de Janeiro


Conclusion

Conclusion

  • Evaluating a distributed system is complex

  • Running experimentations provides the ability to

    • Evaluate a new concept or software

    • Debug during implementation phase

  • Failure-injection mechanisms provide the ability to experiment fault-tolerance mechanisms

  • We have designed a failure injection tool that allows the tester to run large scale experiments

    • with various volatility conditions

    • In a reproducible manner

WCGC 2006 - Rio de Janeiro


  • Login