Dynamo amazon s highly available key value store
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Dynamo: Amazon’s Highly Available Key-value Store PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on
  • Presentation posted in: General

Dynamo: Amazon’s Highly Available Key-value Store. Presented By: Devarsh Patel. Introduction. Amazon’s e-commerce platform Requires performance, reliability and efficiency To support continuous growth, platform needs to be highly scalable

Download Presentation

Dynamo: Amazon’s Highly Available Key-value Store

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dynamo amazon s highly available key value store

Dynamo: Amazon’s Highly Available Key-value Store

Presented By: Devarsh Patel

CS5204 – Operating Systems


Introduction

Introduction

  • Amazon’s e-commerce platform

    • Requires performance, reliabilityand efficiency

    • To support continuous growth, platform needs to be highly scalable

  • Dynamo – A highly available and scalable distributed data store built for Amazon’s platform

  • Dynamo is used to manage services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.

  • Dynamo provides a simple primary-key only interface to meet requirements of applications like best seller lists, shopping carts, customer preferences, session management, etc.

  • A completely decentralized system with minimal need for manual administration.

CS5204 – Operating Systems


System assumptions and requirements

System Assumptions and Requirements

  • Simple key-value interface

    • Highly available

    • Efficient in resource usage

    • Simple scale out scheme to address growth in data set size or request rates

  • Each service that uses Dynamo runs its own Dynamo instances

  • Used only by Amazon’s internal services

    • Non-hostile environment

    • No security requirements like authentication and authorization

  • Targets applications that operate with weaker consistency in favor of high availability

  • Service level agreements (SLA)

    • Measured at the 99.9th percentile of the distribution

    • Key factors: service latency at a given request rate

    • Example: response time of 300ms for 99.9% of requests at peak client load of 500 requests per second

    • State management is the main component of a service’s SLAs

CS5204 – Operating Systems


Design considerations

Design Considerations

  • Designed to be an eventually consistent data store

  • “Always writeable” data store

  • Consistency vs. availability

    • To achieve a level of consistency, replication algorithms are forced to tradeoff the availability of the data under certain failure scenarios.

    • To improve availability,

      • Dynamo uses weaker form of consistency (eventual consistency)

      • Allows optimistic replication techniques

        • Can lead to conflicting changes which must be detected and resolved

  • Data store or application performs conflict resolution to the reads

  • Other key principles

    • Incremental scalability – One storage node at a time

    • Symmetry – Every node has same set of responsibilities

    • Decentralization – Favor decentralized peer-to-peer techniques

    • Heterogeneity – Work distribution must be proportional

CS5204 – Operating Systems


System architecture

System Architecture

  • Core distributed system techniques used in Dynamo:

    • Partitioning, Replication, Versioning, Membership, Failure handling and Scaling

CS5204 – Operating Systems


System interface

System Interface

Two operations: get() and put()

get(key) – Locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context

put(key, context, object) - Determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk

context – encodes system metadata about the object

MD5 hash on the key generates 128-bit identifier to identify storage nodes

CS5204 – Operating Systems


Partitioning algorithm

Partitioning Algorithm

  • Consistent Hashing

  • Output range is a fixed circular space or “ring”

  • Advantage

    • Departure or arrival of a node only affects immediate neighbors

  • Issues

    • Non-uniform data and load distribution

  • Dynamo uses a variant of consistent hashing by using concept of “virtual nodes”

CS5204 – Operating Systems


Replication

Replication

  • Replicate data on multiple hosts

    • Reason – To achieve high availability and durability

  • “per-instance”

  • Preference list – List of

    nodes responsible for

    storing particular key

    Figure 1: Partitioning and replication of keys inDynamo ring.

CS5204 – Operating Systems


Data versioning

Data Versioning

  • Dynamo treats the result of each modification as a new and immutable version of the data

  • Allows for multiple versions of an object to be present in the system at the same time.

  • Problem

    • Version branching due to failures combined with concurrent updates, resulting in conflicting versions of object

    • Updates in the presence of network partitions and node failures result in an object having distinct version sub-histories

CS5204 – Operating Systems


Data versioning1

Data Versioning

Uses vector clocks – A list of (node, counter) pairs

Determines two version of an object are on parallel branches or have causal ordering

Conflict requires reconciliation

Conflicting versions passed to application as output of getoperation

Application resolves conflicts and puts a new (consistent) version

CS5204 – Operating Systems


Data versioning2

Data Versioning

Figure: Version evolution of an object over time

CS5204 – Operating Systems


Execution of get put operations

Execution ofget/put operations

  • Two strategies to select a node:

    • Request through a load balancer

    • Request directly to the coordinator nodes

  • Coordinator– Node handling read and write operation

    • First among the top N nodes in the preference list

  • Quorum system

    • Two key configurable values: R and W

    • R - minimum nodes participated in successful readoperation

    • W - minimum nodes participated in successful write operation

    • Quorum like system requires, R+W > N

    • (N, R, W) can be chosen to achieve desired tradeoff

    • R and W are usually configured to be less than N, to provide better latency.

  • Write is successful – If W-1 nodes respond to put() request

  • Read is successful – If R noes respond to get() request

CS5204 – Operating Systems


Hinted handoff

Hinted Handoff

  • “Sloppy quorum”

    • All read and write operations are done on Top N healthy nodes in the preference list

    • Coordinator is first in this group

    • Replicas sent to node will have a “hint” in its metadata indicating the original node that should hold the replica

    • Hinted replicas are stored by available node and sent forwarded when original node recovers.

  • Ensures read and write operations are not failed due to node or network failures

CS5204 – Operating Systems


Replica synchronization

Replica synchronization

  • Detect the inconsistencies between replicas faster and to minimize the amount of transferred data using Merkle tree.

  • Separate tree maintained by each node for each key range

  • Advantage:

    • each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set

  • Disadvantage:

    • Adds overhead to maintain Merkle trees when a node joins or leaves the system

CS5204 – Operating Systems


Membership and failure detection

Membership and Failure Detection

  • Ring Membership

    • Explicit mechanism to add or remove node from a ring

    • Done by administrator using command line tool or browser

    • Gossip-based protocol propagates membership, partitioning, and placement information via periodic exchanges

    • Nodes eventually know key ranges of its peers and can forward requests to them

  • External Discovery

    • To prevent logical partitions, some nodes play role of seeds

    • “Seed” nodes discovered via external mechanism are known to all nodes

  • Failure Detection

    • Nodes failures are detected by lack of responsiveness and recovery detected by periodic retry

CS5204 – Operating Systems


Experiences lessons learned

Experiences & Lessons Learned

  • Main patterns in which Dynamo is used:

    • Business logic specific reconciliation

    • Timestamp based reconciliation

    • High performance read engine

  • Client applications can tune values of N, R and W

  • Common (N,R,W) configuration used by several instances of Dynamo is (3,2,2)

CS5204 – Operating Systems


Experiences lessons learned1

Experiences & Lessons Learned

Balancing performance and Durability

CS5204 – Operating Systems


Experiences lessons learned2

Experiences & Lessons Learned

Ensuring Uniform Load Distribution

CS5204 – Operating Systems


Partitioning placement strategies

Partitioning & Placement Strategies

Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A, B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

CS5204 – Operating Systems


Partitioning placement strategies1

Partitioning & Placement Strategies

  • Strategy 1

    • T random tokens per node and partition by token value:

      • It needs to steal its key ranges from other nodes

      • Bootstrapping of new node is lengthy

      • Other nodes process scanning/transmission of key ranges for new node as background activities

      • Disadvantages:

        • Numerous nodes have to adjust their Merkle trees when a new node joins or leaves system

        • Archiving entire key space is highly inefficient

CS5204 – Operating Systems


Partitioning placement strategies2

Partitioning & Placement Strategies

  • Strategy 2

    • T random tokens per node and equal sized partitions:

      • Divided into Q equally sized partitions

      • Q >> N and Q >> S*T, where S is no. of nodes in the system

      • Advantages:

        • Decoupling of partition and partition placement

        • Allows changing of placement scheme at run-time

  • Strategy 3

    • Q/S tokens per node, equal sized partitions:

      • Decoupling of partition and placement

      • Advantages:

        • Faster bootstrapping/recovery

        • Ease of archival

CS5204 – Operating Systems


Partitioning placement strategies3

Partitioning & Placement Strategies

Strategies have different tuning parameters

Fair way to compare strategies is to evaluate the skew in their load distributions for a fixed amount of space to maintain membership information

Strategy 3 achieves best load balancing efficiency

CS5204 – Operating Systems


Client driven or server driven coordination

Client-driven or Server-driven Coordination

Any node can coordinate read requests; write requests handled by coordinator

State-machine for coordination can be in load balancing server or incorporated into client

Client-driven coordination has lower latency because it avoids extra network hop (redirection)

CS5204 – Operating Systems


Thank you

Thank You

CS5204 – Operating Systems


  • Login