Using fault model enforcement fme to improve availability l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Using Fault Model Enforcement (FME) to Improve Availability PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Using Fault Model Enforcement (FME) to Improve Availability. EASY ’02 Workshop Kiran Nagaraja, Ricardo Bianchini, Richard Martin, Thu Nguyen Department of Computer Science Rutgers University. Motivation. Network services are extremely complex Typically many software and hardware components

Download Presentation

Using Fault Model Enforcement (FME) to Improve Availability

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using fault model enforcement fme to improve availability l.jpg

Using Fault Model Enforcement (FME) to Improve Availability

EASY ’02 Workshop

Kiran Nagaraja, Ricardo Bianchini,

Richard Martin, Thu Nguyen

Department of Computer Science

Rutgers University


Motivation l.jpg

Motivation

  • Network services are extremely complex

    • Typically many software and hardware components

    • Numerous fault points and types

      • E.g, nodes, disks, cables, links, switches, etc.

  • Extremely difficult for services to tolerate all these faults

    • Hard to reason about all possible faults

    • Difficult to determine actual fault

      • Many faults exhibit same runtime symptoms


Fme approach l.jpg

FME Approach

  • Define a reduced abstract fault model

    • Components, faults, symptoms, component behavior during faults

  • Enforce this fault model at run-time

    • If an “unexpected” fault occurs, map to one that was planned for in the abstract model

    • “If the facts don’t fit the theory, change the facts.” - Albert Einstein

  • Allow designer to concentrate on tolerating a well-defined, yet limited in complexity, set of faults


Our study l.jpg

Our Study

  • Estimate potential impact of FME

    • Have not yet implemented FME

  • Case study: PRESS cluster-based web server

    • PRESS has simple abstract fault model

    • In companion study, only achieve around three 9’s

  • Study hypothetical improvement if FME was used to enforce PRESS’s abstract fault model

    • FME can reduce the unavailability by up to 50%


Outline l.jpg

Outline

  • FME in more detail

  • Evaluation methodology

  • PRESS web server

  • Availability study

  • Related work

  • Conclusions

  • Future directions


Fault model enforcement fme l.jpg

Fault Model Enforcement (FME)

  • Enforce a reduced fault model at runtime

    • Allow service to perform correct recovery action to regain full functionality

  • How to enforce a reduced fault model?

    • Two ideas so far

      • Map an unexpected fault to an expected fault

        • E.g., crash a node if the network link connecting it to the switch fails

      • Fail outer component if sub-component fails

        • E.g., crash a node if the disk fails

  • How is it different from fail-stop ?

    • Allows reasoning about failures at a desired abstraction


Evaluation methodology l.jpg

Evaluation Methodology

  • Want to evaluate FME’s potential impact

  • Two phase methodology

    • Phase I - Single fault injection analysis

      • Define and inject faults on “live” system

      • Monitor system performance (throughput T) and availability(A) = fraction of successful requests

    • Phase II - Use an analytical model to determine performability

      • Computes average availability and average throughput


Case study press web server l.jpg

Case Study: PRESS Web Server

  • Cluster-based, locality-conscious web server

    • Serve requests out of global memory pool

    • Exclusion from pool  lower performance

  • Simple fault model

    • Connection failure/lost heartbeats = node failure

    • Recovery through rejoin of “new” node

  • Several versions developed over time

    • TCP, VIA

    • Different fault detection mechanism

      • Heart-beat for TCP

      • Connection breaks for VIA


Fault set l.jpg

Fault Set

  • Fault Load

    Link down

    Switch down

    SCSI timeout

    Node crash

    Node freeze

    Application crash

    Application hang

  • All faults are modeled as fail-stop


Press with fme l.jpg

PRESS with FME

  • Recovery upon fault model mismatch

    • Restart 0, 1 or all nodes?

  • FME approach: reboot the appropriate node after a fault and its recovery have occurred

    • Link down – reboot unreachable node

    • Switch down – reboot all nodes

    • Disk failure – reboot node with faulty disk

    • Node, application crash – do nothing


Single fault experiments l.jpg

Single-Fault Experiments

  • Setup: 4 PC cluster running at 90% load

  • 3 versions: TCP, TCP-HB, VIA

  • Use results to evaluate impact of FME


Single fault results l.jpg

Single Fault - Results

Link Failure

Application Hang


Modeling seven stage model l.jpg

Modeling – Seven Stage Model

  • Input: measured throughput and availability

  • Parameters: MTTF, MTTR, operator on site time

  • Output: average availability & average throughput


Modeling availability l.jpg

Modeling Availability

  • Assumptions:

    • Effects of faults are independent

    • Fault arrivals are exponential

  • Overall unavailability = ΣT(unavailability of all faults)


Modeling results l.jpg

Modeling Results

  • Application fault rate: 1/month

  • Time to operator intervention: 5 minutes

  • Unavailability of TCP-HB reduced by ~50%

  • VIA: ~36% reduction


Modeling results16 l.jpg

Modeling Results

  • Application fault rate: 1/day - unstable s/w

  • Time to operator intervention: 5 minutes

  • Unavailability of TCP-HB reduces by > 50%

  • VIA: ~13% reduction


Related work l.jpg

Related Work

  • Enforcing fail-stop

    • Tandem Non-Stop – process pairs

      • Robust design with rigorous internal assertions

  • Fault detection and fail-over

    • HA-Linux

  • Reactive and proactive rejuvenation

    • Recursive restartability(ROC) – Berkeley & Stanford

    • Software rejuvenation – Duke


Conclusion l.jpg

Conclusion

  • FME allows for very simple fault models

  • FME can cut the unavailability by up to 50%

  • Fault detection mechanism is crucial for effectiveness

    • Benefits increase with fault coverage


Fme future directions l.jpg

FME - Future Directions

  • How extensive should the fault model be?

    • Determines programming complexity/effort

  • How to prevent FME from reducing availability?

    • Bugs within enforcement?

    • When to declare a symptom a fault?

  • FME reduces human intervention

    • Are humans better at deciding?

      • 8-23 % of recovery procedures are botched [Brown 2001]


Thank you l.jpg

Thank you.

http://www.panic-lab.rutgers.edu/Projects/vivo


Communication architecture l.jpg

Communication Architecture

  • All operations by main thread are non-blocking

  • Separate send, receive and multiple disk helper threads

  • Filling up of queues could stall the entire node


Performability l.jpg

Performability

  • Model computes 2 metrics:

    • Average throughput (AT)

    • Average Availability (AA)

  • Performability

    P = Tn x log(AI)

    log(AA)

    • AI : Availability of Ideal system with 99.999

    • Log scale ratio allows a linear relationship with unavailability


Experiments single fault loads l.jpg

Experiments: Single-Fault Loads

  • 4 800Mhz PIII PCs, 206MB, 2x10000 SCSI disks, 1Gb/s cLan interconnect (TCP or VIA)

  • PRESS: 128MB file cache, static content

  • Clients: constant rate ~ 90% server capacity

    • Modified sclient [Banga 97]

    • Rutgers trace; file size = avg. request size


Mendosus fault injection l.jpg

Events

Central Controller

User-Level

Daemon

Process Ctrl

Applications E.g. PRESS

Mlib

comLib

glibc

sys_calls

n/w stack

Kernel

Node A

Node B

emulation

SCSI

n/w faults

Node/OS

Fast & Reliable SAN

Mendosus – Fault Injection


Phase ii modeling performability l.jpg

Phase II – Modeling Performability

  • 5 minutes duration for operator intervention(E) and restart(F) stages


  • Login