the virtue of dependent failures in multi site systems n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The virtue of dependent failures in multi-site systems PowerPoint Presentation
Download Presentation
The virtue of dependent failures in multi-site systems

Loading in 2 Seconds...

play fullscreen
1 / 25

The virtue of dependent failures in multi-site systems - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

The virtue of dependent failures in multi-site systems. Flavio Junqueira and Keith Marzullo University of California, San Diego Workshop on Hot Topics in System Dependability (HotDep), Yokohama, Japan, June 2005. Collection of sites across a WAN Multiple processors per site Storage nodes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The virtue of dependent failures in multi-site systems' - macy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the virtue of dependent failures in multi site systems

The virtue of dependent failures in multi-site systems

Flavio Junqueira andKeith Marzullo

University of California, San Diego

Workshop on Hot Topics in System Dependability (HotDep), Yokohama, Japan, June 2005

multi site systems
Collection of sites across a WAN

Multiple processors per site

Storage nodes

Computing nodes

Share resources

E.g. BIRN, Geon, TeraGrid

Failures

Processors unavailable

Services do not mask failures

Improve availability under failures

Replication

Minimize overhead

Multi-site systems
introduction
Introduction
  • Failures in multi-site systems
    • Processor failures
    • Site failures
      • Processors of the site become unavailable
    • A new failure model
  • Availability through replication
    • Replica placement
    • Operations on replicas: quorums
      • Replicated data: quorum update
      • Replicated functionality: state-machine using Paxos
    • Quorum constructions
  • Failure model in practice
    • Implement the model
    • Site availability in BIRN
    • Model for processor failures within a site

Software and hardware faults

  • Misconfigured software
  • Shared resources
    • Storage
    • Power circuits
    • Cooling pipes
    • Air conditioning
    • Network
a dependent failure model
A dependent failure model
  • Threshold model
    • Limit on the number of processor failures
    • Simple
    • Model well homogeneous processors that fail independently
  • Multi-site: sites unavailable frequently enough
    • Processor failures are not IID
    • All processors become unavailable
  • The multi-site threshold model
    • Two components
      • Threshold on the number of site failures (fs)
      • One threshold per site on processor failures (t)
    • Assumptions
      • Sites are homogeneous
      • Processors within a site are homogeneous
      • Processor failure = crash
quorum systems
Quorum systems
  • Quorum system Q
    • Quorum system: set of quorums
    • Quorum: set of processors
    • Intersection property: every pair of quorums in Q intersect
  • Algorithms: access a quorum
  • Example: Majority system
      • n processors
      • Every subset of size (n+1)/2 is a quorum
      • Optimal availability for IID processor failures
a quorum construction qsite
QSite

Select at least (2fs +1) sites: S

Select at least (2t +1) processors from each site in S

Quorum

Majority of sites in S

Majority of processors in each site

An example (fs = 1, t = 1)

Quorums

A quorum construction: QSite

Site 1

Site 2

Site 3

qsite vs majority
Properties of multi-site threshold model hold

Same replicas for QSite and Majority

Availability

fsunavailable sites

Remaining fs + 1 sites

tunavailable processors

Majority: no quorum available

Requires:

Available:

QSite: one quorum available

QSite has better availability

Majority is not optimal

Quorum sizes

QSite produces smaller quorums

Reduces load

Increases capacity

QSite vs. Majority
reducing quorum sizes and sites
QSite, fs = 2, t = 1:

5 sites

3 processors per site

6 processors per quorum

Compromise availability

Quorums

Reducing quorum sizes and sites

Site 1

Site 2

Site 3

Site 4

site availability
Site availability
  • Goals
    • Show that sites are unavailable frequently enough
    • Threshold on the number of site failures
  • BIRN - Biomedical Informatics Research Network
    • Test bed projects centered around brain imaging
    • Currently: 19 universities, 26 research groups
  • Availability
    • Monthly basis
    • Pings (BIRN-CC)
    • Storage broker logs
  • Site availability
    • Jan/04-Aug/04
    • Availability under 100%
      • On average in 5 out of the 8 months
slide10

BIRN site availability

10 sites experience at least one outage

One site under 97%

threshold on unavailable sites
Threshold on unavailable sites
  • Worst-case scenario
    • Assumption: independent site failures
    • nmost unavailable sites in each month
    • Probability that all n sites are unavailable
    • Each 1% of unavailability is approximately 7 hours
modeling failures in a site
Modeling failures in a site
  • Homogeneous set of processors
    • Independent processor failures
    • Identical probability of failure
  • Processors are repaired
    • Repair probabilities change with number of failures
  • Markov chain
  • From the model: threshold on the number of failures (t)
    • Desired degree of availability
    • Stationary probabilities
an example
An example
  • Three processors per site
  • Probabilities
    • Failure probability much smaller than repair probabilities
    • Repair probabilities increase with failures

t = 1

Availability  0.001

discussion future work
Discussion & Future work
  • Multi-site systems: important class of distributed systems
    • Share resources
    • Collaboration among distant groups
  • Improve availability through replication
    • A useful abstraction: quorum systems
    • Algorithms built on top of quorum systems
  • Dependent failures
    • Site failures
    • Enables smaller, higher available quorums
  • Lessons to learn
    • Considering dependent failures may improve results
    • Models are not necessarily complex
  • Future work
    • Validate model, evaluate constructions in practice, more constructions, etc.
introduction1

Software and hardware faults

  • Software incompatibility, misconfiguration
  • Shared resources (e.g. storage)
  • Power failures
  • Broken pipes
  • Loss of air conditioning
  • Network problems
Introduction
  • Failures in multi-site systems
    • Processor failures
    • Site failures
      • Processors of the site become unavailable
    • A new failure model
  • Availability through replication
    • Replica placement
    • Operations on replicas: quorums
    • Replicated data (quorum update)
    • Replicated functionality (state-machine using Paxos)
    • Quorum constructions
  • Failure model in practice
    • Implementability of the model
    • Real system for site availability (BIRN)
    • Model for processor failures within a site
introduction2
Software incompatibility, misconfiguration

Shared resources (e.g. storage)

Power failures

Broken pipes

Loss of air conditioning

Network problems

Introduction
  • Failures in multi-site systems
    • Processor failures
      • E.g. HW failures
    • Site failures
  • Strategies for replica placement
    • Large number of sites and nodes
  • Updates
    • Naïve approach: every non-faulty replica up to date
    • Quorum update: contact a quorum of processors
  • Distributed shared register (replicated data)
    • Multiple copies of a data set (Quorum Update)
    • E.g. Brain images (BIRN); Geological data (Geon)
  • Consensus (replicated functionality)
    • State-machine approach (Paxos algorithm)
    • E.g.: Parallel computation (TeraGrid)
why sites fa
Why sites fa
  • Software incompatibility, misconfiguration
  • Shared resources (e.g. storage)
  • Power failures
  • Broken pipes
  • Loss of air conditioning
  • Network problems
quorums in a multi site system
Quorums in a multi-site system
  • Data replication
    • Multiple copies of data sets
  • Functionality replication
    • State-machine approach
    • Paxos (Coteries for Classic Paxos)
  • Question: How do we choose nodes to replicate?
    • Flat organization
    • Organization into sites
quorum systems1
Quorum systems
  • Quorum system Q
    • Quorum system: set of quorums
    • Quorum: set of processors
    • Intersection property: every pair of quorums in Q intersect
    • Algorithms: access a quorum when executing some operation
  • Examples
    • Majority system:
      • n processors
      • Every subset of size (n+1)/2 is a quorum
      • Optimal availability for IID processor failures
    • Multi-colored: colors as sites

Processors

Quorums

quorum systems cont
Quorum systems (cont.)
  • In multi-site systems
    • Replicated data
      • Multiple copies of a data set (Quorum update)
      • E.g. Brain images(BIRN); Geological data (Geon)
    • Replicated functionality
      • State-machine approach (Paxos algorithm)
      • E.g.: Parallel computation (TeraGrid)
  • Quorums for multi-site systems
    • Replicating on every node is excessive
    • Quorum construction
      • Set of processors to replicate on
      • Quorums
examples of quorum systems
Examples of quorum systems
  • Majority system:
    • n processors
    • Every subset of size (n+1)/2 is a quorum
  • Multi-colored: colors as sites
  • Majority has optimal availability for independent and identically distributed processor failures (IID)

Universe

Quorum patterns

birn site availability
BIRN site availability

10 sites have at least one outage

One site under 97%

discussion future work1
Discussion & Future work
  • Multi-site systems: important class of distributed systems
    • Share resources
    • Collaboration among distant groups
  • Improve availability through replication
    • A useful abstraction: quorum systems
    • Algorithms built on top of quorum systems
  • Dependent failures
    • Site failures
    • Enables smaller, higher available quorums
  • Future work
    • Validate multi-site threshold model
    • Evaluate proposed constructions in practice
    • More constructions
    • More issues with dependent failures