starfish highly available block storage l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
StarFish: highly-available block storage PowerPoint Presentation
Download Presentation
StarFish: highly-available block storage

Loading in 2 Seconds...

play fullscreen
1 / 38

StarFish: highly-available block storage - PowerPoint PPT Presentation


  • 491 Views
  • Uploaded on

StarFish: highly-available block storage Eran Gabber, Jeff Fellin, Michael Flaster , Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu Ö zden, and Elizabeth Shriver Computing Sciences Research Lucent Technologies, Bell Laboratories Your Data Your Data Your Data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'StarFish: highly-available block storage' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
starfish highly available block storage

StarFish: highly-available block storage

Eran Gabber, Jeff Fellin,

Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu Özden, and

Elizabeth Shriver

Computing Sciences Research

Lucent Technologies, Bell Laboratories

starfish
StarFish
  • Geographically distributed on-the-fly replication.
  • Dynamic recovery from element failures
  • Works with any application / OS / file-system – looks like a SCSI disk
  • Built on commodity hardware
what starfish is not
What StarFish is Not
  • Distributed file system
  • Solution to the multiple writer problem
  • iSCSI
status
Status
  • Implemented on FreeBSD 4.x.
  • Live system working in lab since February 28th, 2001.
  • No production data was ever lost due to software error. However, there were operator errors.
  • Source available to the public at

http://www.bell-labs.com/topic/swdist/

outline
Outline
  • How Does it Work?
  • When Things Go Wrong
  • What’s a Good Configuration?
  • Performance Measurements
starfish architecture11
StarFish Architecture

Host

Star

Fish

SCSI

starfish architecture12
StarFish Architecture

RAID

SE

SCSI

Host

HE

SCSI

RAID

SE

SCSI

Star

Fish

Network

how it works writes

RAID

SE

SCSI

Host

HE

SCSI

RAID

RAID

SE

SE

SCSI

SCSI

How it Works – Writes

HE assigns seq num, then propagates

SCSI Write

SCSI Ack

HE waits for a quorum of acks

how it works reads
How it Works – Reads
  • Depends on “Read Policy”
    • SendAll – Read is sent to all active SEs, and the HE responds to the host on the 1st reply.
    • SendOne – Read is sent to lowest latency SE. The HE retries if no response.
outline15
Outline
  • How Does it Work?
  • When Things Go Wrong
  • What’s a Good Configuration?
  • Performance Measurements
when things go wrong se failures
When Things Go Wrong – SE Failures
  • If an out-of-date SE restarts, one of three types of recovery will take place:
    • Quick (HE)
    • Replay (SE)
    • Full (SE)
when things go wrong he failures
When Things Go Wrong – HE Failures
  • Manual switchover to backup HE via SNMP command to SEs. SEs will then reconnect to secondary HE.
  • SCSI connection is a single point of failure.
    • Partial implementation of automatic redundant host architecture, using controllable SCSI switch.
when things slow down throttling
When Things Slow Down – Throttling
  • One SE will always be slower than the others.
  • When queues build up, the HE will delay SCSI processing to allow SEs to keep up.
  • The HE will make sure that a Quorum of SEs can keep up. Extra SEs that are too slow, even after some throttling, are dropped.
outline19
Outline
  • How Does it Work?
  • When Things Go Wrong
  • What’s a Good Configuration?
  • Performance Measurements
availability definitions
Availability Definitions
  • Read Availability – an up-to-date version of the data is available for reading.
  • Write Availability – the system is available to accept new writes.
choosing the number of ses
Choosing the Number of SEs

Write Availability

Q=1

SE Availability

outline24
Outline
  • How Does it Work?
  • When Things Go Wrong
  • What’s a Good Configuration?
  • Performance Measurements
measurements testbed
Measurements – testbed
  • Local SE – different machine connected to the same GbE switch – no artificial delay
  • Near SE – artificial latency increase simulated through dummynet
  • Far SE – Using dummynet, simulated increased latency, with and without bandwidth restrictions
experimental cases
Experimental Cases
  • “Dark Fiber”
    • Dedicated bandwidth
    • 1ms delay is 200km, e.g. distance to neighboring city
  • “Internet”
    • TCP/IP over fractional OC-3
    • 1/3 of an OC-3 link is 51 Mbps
    • 65ms latency – reflects latency on the AT&T backbone between NY and LA.
related work
Related Work
  • High end: EMC SRDF
  • Mid range: NetApp SnapMirror
  • DataCore SANsymphony
  • Petal
concluding points
Concluding Points
  • N=3, Q=2 is Good.
  • Replicas not in the quorum can have high latency / low bandwidth connection.
  • Recovery activity does not significantly degrade performance.
making a good configuration definitions
Making a Good Configuration – definitions
  • Consistency – Writes, once acknowledged, are not “forgotten”.
  • Write availability – The system is able to accept and acknowledge writes.
  • Read Availability – The system is able to respond to read requests.
choosing a quorum size consistency
Choosing a Quorum Size – Consistency
  • If Q > N/2, StarFish can only lose data if the HE and Q SEs fail simultaneously.
    • Even if the Q up-to-date SEs fail after the HE has acked the write, as long as the HE is up, it will ensure that all SEs will get the most recent writes.
  • If Q <= N/2, on restart, the HE might not have access to current data even if Q SEs are available.
choosing the number of ses35
Choosing the Number of SEs

For a highly available system, where

Q = floor(N/2)

how it works writes36
How it Works – Writes
  • Single-owner (the HE) access semantics
  • Writes are assigned sequence numbers by the HE.
  • Each SE applies them in order. Any gap necessitates an SE recovery
  • No reordering/coalescing/optimizations
  • HE acks the write to the host once a quorum of SEs report it’s been committed.