dynamo amazon s highly available key value store giuseppe decandia et al amazon com l.
Skip this Video
Loading SlideShow in 5 Seconds..
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon] PowerPoint Presentation
Download Presentation
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon]

Loading in 2 Seconds...

play fullscreen
1 / 26

Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon] - PowerPoint PPT Presentation

  • Uploaded on

Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com]. Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann) 20-Jul-2010. Outline of Talk. Motivation (1) Contribution (1) Context (1) Background (3) Related Work (2) System Architecture (7)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon]' - jacob

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
dynamo amazon s highly available key value store giuseppe decandia et al amazon com

Dynamo: Amazon’s Highly Available Key-value StoreGiuseppe DeCandia et al.[Amazon.com]

Jagrut Sharma


CSCI-572 (Prof. Chris Mattmann)


outline of talk
Outline of Talk
  • Motivation (1)
  • Contribution (1)
  • Context (1)
  • Background (3)
  • Related Work (2)
  • System Architecture (7)
  • Implementation (1)
  • Experiences, Results & Lessons Learnt (4)
  • Conclusion (1)
  • Pros (1)
  • Cons (1)
  • Questions (1)

Tens of millions of customers

Tens of thousands of servers

Globally distributed data centers

24 * 7 * 365 operations




Financial consequences


Customer Trust



  • Evaluation of how different techniques can be combined to provide a highly-available system
  • Demonstration of how a consistent storage system (like Dynamo) can be used in production environment with demanding applications
  • Provision of tuning methods to meet requirements of production systems with very strict performance demands
  • Amazon’s e-commerce platform
    • Highly de-centralized
    • Loosely coupled
    • Service-oriented architecture
    • Hundreds of services
    • Millions of components
    • Failure is a way of life
  • Critical requirement
    • Always available storage
  • Storage techniques
    • S3 (Amazon Simple Storage Service)
    • Dynamo
      • Highly available and scalable distributed data store for Amazon’s platform
      • Provides primary-key only interface for selected applications (e.g. shopping cart)
      • Combined multiple, high-performance techniques & algorithms
      • Excellent performance in real-world scenarios
background 1 of 3
Background (1 of 3)
  • E-commerce platform services: Stateless & Stateful
  • Relational Databases an over-kill for stateful lookups by primary key
  • Dynamo:
    • Simple key/value interface
    • Highly available
    • Efficient in resource usage
    • Scalable
  • Each service that uses Dynamo runs its own Dynamo instances
  • Dynamo’s target applications:
    • Store small-sized objects (<1 MB)
    • Operate with weaker consistency if this gives high availability
  • Simple read-write to a data item uniquely identified by a key
  • No query operations span multiple data items
  • Services use Dynamo to give priority to latency & throughput
  • Amazon’s SLAs are expressed and measured at the 99.9th percentile of the distribution (in contrast to common industry approach of using average, median and expected variance)
background 2 of 3
Background (2 of 3)

Assumptions About Dynamo

  • Used only by Amazon’s internal services
  • Operation environment is non-hostile
  • There are no security-related requirements (e.g. authentication, authorization)
  • Each service uses its distinct instance of Dynamo
  • Dynamo’s initial design targets a scale of up to hundreds of storage hosts
background 3 of 3
Background (3 of 3)
  • Dynamo Design Considerations
  • Conflict resolution between replication & consistency ?
    • Eventually consistent data store
  • When to resolve update conflicts ?
    • “always writeable” data store
  • Who performs conflict resolution?
    • Both data store & application allowed
  • Incremental scalability at node-level
  • Symmetry among nodes
  • Favors decentralization
  • Capable of exploiting infrastructure heterogeneity

SOA of Amazon’s platform

related work 1 of 2
Related Work (1 of 2)
  • Peer to Peer Systems
    • Tackle problems of data storage and distribution
    • Only support flat namespaces
    • Unstructured P2P: Freenet, Gnutella
      • Search query floods network
    • Structured P2P systems: Pastry, Chord, Oceanstore, PAST
      • Employ globally consistent query routing protocol
      • Bounded number of hops
      • Maintain local routing tables
      • Provide rich storage services with conflict resolution
  • Distributed File Systems and Databases
    • Support both flat & hierarchical namespaces
    • Ficus, Coda: high availability at expense of consistency
    • Farsite: high availability and scalability using replication
    • Google File System: master server, chunkservers
    • Bayou: Distributed RDBMS, disconnected operations
    • Antiquity: Wide-area distributed storage system
    • BigTable: Distributed storage system for structured data
related work 2 of 2
Related Work (2 of 2)

Dynamo Vs Other Systems

  • Targeted mainly at apps that need an “always writeable” data store
  • Built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted
  • Applications using Dynamo do not require support for hierarchical namespaces or complex relational schema
  • Built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds.
  • Avoids routing requests through multiple nodes. Hence, similar to a zero-hop Distributed Hash Table.
system architecture 1 of 7
System Architecture (1 of 7)

List Of Techniques Used By Dynamo & Their Advantages

system architecture 2 of 7
System Architecture (2 of 7)

System Interface

  • get (key)
    • locates the object replicas associated with key in the storage system
    • Returns a single object/list of objects with conflicting versions + context
  • put(key, context, object)
    • Determines where the replicas of the object should be placed based on the associated key
    • Writes replicas to disk
  • context
    • encodes system metadata about object
    • includes additional information (e.g. object version)
  • key, object: considered as an opaque array of bytes
  • MD5 hash (key) -> 128-bit identifier, used to determine the storage nodes that are responsible for serving the key
system architecture 3 of 7
System Architecture (3 of 7)

Partitioning Algorithm

  • Provides mechanism to dynamically partition the data over the set of nodes (i.e. storage hosts)
  • Uses variant of consistent hashing (output range of a hash function is treated as a fixed circular space or ‘ring’ - largest hash value wraps around to the smallest hash value)
    • Advantage: departure or arrival of a node only affects its immediate neighbors
    • Limitation 1: leads to non-uniform data and load distribution
    • Limitation 2: oblivious to heterogeneity in the performance of nodes
  • (single node) -> multiple points in the ring i.e. virtual nodes
  • Advantages of virtual nodes:
    • Graceful handling of failure of a node
    • Easy accommodation of a new node
    • Heterogeneity in physical infrastructure can be exploited
system architecture 4 of 7
System Architecture (4 of 7)


  • Each data item replicated at N hosts
  • N is configured per-instance
  • Each node is responsible for the region of the ring between it and its Nth predecessor
  • Preference list: List of nodes responsible for storing a particular key

Data Versioning

  • Eventual consistency: Allows updates to be propagated to all replicas asynchronously
  • put() may return to caller before update has been applied at all replicas
  • get() may return an object that does not have the latest updates
  • Multiple versions of an object can be present in the system at same time
  • syntactic reconciliation: performed by system
  • semantic reconciliation: performed by client
  • vector clock: (node, counter) pair. Used for capturing causality between different versions of the same object. One vector clock per version per object.
system architecture 5 of 7
System Architecture (5 of 7)

Execution of get() and put() Operations

  • Any storage node in Dynamo is eligible to receive client get() and put() operations for any key
  • Client can select a node using:
    • generic load balancer
    • partition-aware client library
  • Coordinator:
    • node handing read or write operation
    • typically, first among the top N nodes in the preference list
  • Consistency protocol used to maintain consistency among replicas. Two key configurable values are:
    • R: min. no. of nodes that must participate in a successful read operation
    • W: min. no. of nodes that must participate in a successful write operation
    • R + W > N is preferable
system architecture 6 of 7
System Architecture (6 of 7)

Handling Failures: Hinted Handoff

  • Mechanism to ensure that the read and write operations are not failed due to temporary node or network failures.
  • All read and write operations are performed on the first N healthy nodes from the preference list, which may NOT always be the first N nodes encountered while walking the consistent hashing ring.
  • Each object is replicated across multiple data centers, which are connected through high-speed network links.

Handling Permanent Failures: Replica Synchronization

  • Dynamo implements an anti-entropy protocol to keep replicas synchronized. Uses Merkle trees.
  • Merkle tree: A hash tree where leaves are hashes of the values of individual keys.
system architecture 7 of 7
System Architecture (7 of 7)

Membership and Failure Detection

  • Explicit mechanism available to initiate the addition and removal of nodes from a Dynamo ring.
  • To prevent logical partitions, some Dynamo nodes play the role of seed nodes.
  • Seeds: Nodes that are discovered by an external mechanism and known to all nodes.
  • Failure detection of communication done in a purely local manner.
  • Gossip-based distributed failure detection and membership protocol

Storage Node

Request Coordination

Membership & Failure Detection

Local Persistence Engine

  • Pluggable Storage Engines
  • Berkeley Database (BDB) Transactional Data Store
  • BDB Java Edition
  • MySQL
  • In-memory buffer with persistent backing store
  • Chosen based on application’s object size distribution
  • Built on top of event-driven messaging substrate
  • Uses Java NIO
  • Coordinator executes client read & write requests
  • State machines created on nodes serving requests
  • Each state machine instance handles exactly one client request
  • State machine contains entire process and failure handling logic
experiences results lessons learnt 1 of 4
Experiences, Results & Lessons Learnt (1 of 4)
  • Main Dynamo Usage Patterns
  • Business logic specific reconciliation
    • E.g. Merging different versions of a customer’s shopping cart
  • Timestamp based reconciliation
    • E.g. Maintaining customer’s session information
  • High performance read engine
    • E.g. Maintaining product catalog and promotional items
  • Client applications can tune parameters to achieve specific objectives:
    • N: Performance {no. of hosts a data item is replicated at}
    • R: Availability {min. no. of participating nodes in a successful read opr}
    • W: Durability {min. no. of participating nodes in a successful write opr}
    • Commonly used configuration (N,R,W) = (3,2,2)
  • Dynamo exposes data consistency & reconciliation logic to developers
  • Dynamo adopts a full membership model – each node is aware of the data hosted by its peers
experiences results lessons learnt 2 of 4
Experiences, Results & Lessons Learnt (2 of 4)
  • Typical SLA of service using Dynamo: 99.9% of the read and write requests execute within 300 ms
  • Balancing Performance and Durability

Average & 99.9th percentile latencies of Dynamo’s read and write operations during a period of 30 days

Comparison of performance of 99.9th percentile latencies for buffered vs. non-buffered writes over 24 hours

experiences results lessons learnt 3 of 4
Experiences, Results & Lessons Learnt (3 of 4)
  • Ensuring Uniform Load Distribution
    • Dynamo uses consistent hashing to partition its key space across its replicas and to ensure uniform load distribution.
    • Node “in-balance”: request load for node deviates from the average load by a value less than a certain threshold. Otherwise, Node “out-of-balance”
    • Imbalance ratio = Nodes out-of-balance / Total Nodes

Comparison of load distribution efficiency of different strategies

Node imbalance & Workload

experiences results lessons learnt 4 of 4
Experiences, Results & Lessons Learnt (4 of 4)
  • Three strategies for load distribution
    • T random tokens per node and partition by token value
    • T random tokens per node and equal sized partitions
    • Q/S tokens per node, equal-sized partitions (S= #allnodes, Q= #partitions)
  • Divergent versions of data item (rarely) arise in two scenarios:
    • System is facing failure scenarios (node/data center/network)
    • Large number of concurrent writers to a single data item
  • Server-driven coordination: client requests are uniformly assigned to nodes in the ring by a load balancer.
  • Client-driven coordination: client applications use a library to perform request coordination locally.


  • Is a highly available and scalable data store
  • Is used for storing state of a number of core services of Amazon.com’s e-commerce platform
  • Has provided desired levels of availability and performance and has been successful in handling:
    • Server failures
    • Data center failures
    • Network partitions
  • Is incrementally scalable
  • Sacrifices consistency under certain failure scenarios
  • Extensively uses object versioning and application-assisted conflict resolution
  • Allows service owners to:
    • scale up and down based on their current request load
    • customize their storage system to meet desired performance, durability and consistency SLAs by allowing tuning of N, R, W parameters
  • Combination of decentralized techniques can be combined to provide a single highly-available system.
  • Excellent description of core distributed systems techniques used in Dynamo:
    • partitioning, replication, versioning, membership, failure handling, scaling
  • Liberal use of diagrams, charts and tables to explain concepts
  • Real-world examples have been provided to enable the user to understand and appreciate the theoretical concepts
  • Theoretical and implementation-level differences have been clearly explained
  • Exhaustive list of references for the interested researcher
  • Well-written paper with logical transition from one topic to the next
  • Little description of supporting techniques used in Dynamo for:
    • state transfer, concurrency & job scheduling, request marshalling, request routing, system monitoring and alarming
  • Certain problems which are theoretically possible, have not been investigated in detail, since they have not been encountered in production systems.
  • Sophisticated comparison with existing systems has not been provided.
  • For protecting Amazon.com’s business interests, certain parts of the system have either not been entirely described or described at a very-high level.
  • Future work and possible extensions have not been mentioned clearly.