Dynamo: Amazon’s Highly Available Key-value Store

Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07

Introduction • Dynamo: used to manage applications that require only primary-key access to data • Dynamo applications need scalability, high availability, fault tolerance, but don’t need the complexity of a relational DB • ACID properties => little parallelism, low availability

Assumptions: • Applications perform simple read/write ops on single, small ( < 1MB) data objects which are identified by a unique key. • Example: the shopping cart • Replace ACID properties with weaker guarantees: eventual consistency, no isolation promises • Services must operate efficiently on commodity hardware • Used only by internal services, so security isn’t an issue

Service Level Agreements (SLA) • Clients and servers negotiate SLAs to establish the kind of service and the expected performance • Amazon expects the guarantees to apply to 99.9% of requests • Claim that most industry systems express SLAs in terms of “average”, “median”, and “expected variance” – much weaker than Amazon’s requirements

Design Considerations • Services control properties such as durability and consistency, evaluate tradeoffs (cost v performance, for example) • Replicated databases cannot guarantee strong consistency and high availability at the same time • Optimistic replication updates replicas as a background process to get eventual consistency

Design Considerations:Resolving Conflicting Updates • When • Since Dynamo targets services that require “always writeable” data storage; e.g., users must always be able to add/delete from the shopping cart; resolve conflicts during reads, not writes • By Whom • Let each application decide for itself • But … default is “last write wins”.

Other Key Design Principles • Incremental scalability: adding a single node should not affect the system significantly • Symmetry: all nodes have the same responsibilities • Decentralization: favor P2P techniques over centralized control • Heterogeneity: take advantage of differences in server capabilities.

Comparison to Other Systems • Peer-to-Peer (Freenet, Chord, …) • Structured v unstructured: access times • Conflict resolution for concurrent updates without wide-area file locking • Distributed File Systems and Databases (Google, Bayou, Coda, …) • Treatment of system partitions • Conflict resolution, eventual consistency • Strong consistency v eventual consistency

Dynamo v Other DecentralizedStorage Systems • “always writeable”; • updates won’t be rejected because of failure or concurrent updates • One administrative domain; nodes are assumed to be trustworthy • Don’t require hierarchical name spaces or relational schema • Operations must be performed within a few hundred milliseconds.

System Architecture • The Dynamo data storage system contains items that are associated with a single key • Operations that are implemented: get( ) and put( ). • get(key) • put(key, context, object) where context refers to various kinds of system metadata

Problem Technique Advantage Partitioning Consistent Hashing Incremental scalability High availability Vector clocks, reconciled Version size is decoupled for writes during reads from update rates Temporary Sloppy Quorum, Provides high availability & failures hinted handoff durability guarantee when some of the replicas are not available Permanent Anti-entropy using Synchronizes divergent replicas failures Merkle trees in the background Membership & Gossip-based protocol Preserves symmetry and avoids failure detection having a centralized registry for storing membership and node liveness information Table 1: Summary of techniques used in Dynamo and their advantages

Partitioning Algorithm • Partitioning = dividing data storage across all nodes. Supports scalability • Very similar to Chord-based schemes • Consistent hashing scheme distributes content across multiple nodes • In consistent hashing the effect of adding a node is localized – on average, K/n objects must be remapped (K = # of keys, n = # of nodes)

Partitioning Algorithm • Hash function produces an m-bit number which defines a circular name space (like Chord) • Nodes are assigned numbers randomly in the name space • Hash(data key) and assign to node using successor function like Chord

Load Distribution • Random assignment of node to position in ring may produce non-uniform distribution of data. • Solution: virtual nodes • Assign several random numbers to each physical node; now it is responsible for itself and data that would be stored on the virtual nodes, if they existed

Replication • Data is replicated at N nodes • Succ(key) = coordinator node • The coordinator replicates the object at the N-1 successor nodes in the ring, skipping virtual nodes to increase fault tolerance • Preference list: the list of nodes that store a particular key • There are actually > N nodes on the preference list, in order to ensure N “healthy” nodes at all times.

Data Versioning • Updates can be propagated to replicas asynchronously – the put( ) call may return before all updates have been applied. • Implication: a subsequent get( ) may return stale data. • Barring failure, most updates are applied within bounded time, but server or network failure can delay updates “for an extended period of time”.

Data Versioning • Some app’s can be designed to work in this environment; e.g., the “add-to/delete-from cart” operation. • It’s okay to add to an old cart, as long as all versions of the cart are eventually reconciled • Dynamo treats each modification as a new (& immutable) version of the object. • Multiple versions can exist at the same time

Reconciliation • Usually, new versions contain the old versions – no problem • Sometimes concurrent updates and failures generate conflicting versions • Typically this is handled by merging • For add-to-cart operations, nothing is lost • For delete-from cart, deleted items might reappear after the reconciliation

Parallel Version Branches • There may be multiple versions of the same data, each coming from a different path (e.g., if there’s been a network partition) • Vector clocks are used to identify causally related versions and parallel (concurrent) versions • For causally related versions, accept the final version as the “true” version • For parallel (concurrent) versions, use some reconciliation technique to resolve the conflict

Execution of get( ) and put( ) • Operations can originate at any node in the system. • Clients may • Route request through a load-balancing coordinator node • Use client software that routes the request directly to the coordinator for that object • The coordinator contacts R nodes for reading and W nodes for writing, where R + W > N

“Sloppy Quorum” • put( ): the coordinator writes to the first N healthy nodes on the preference list. If W writes succeed, the write is considered to be successful • get( ): coordinator reads from N nodes; waits for R responses. • If they agree, return value. • If they disagree, but are causally related, return the most recent value • If they are causally unrelated apply reconciliation techniques and write back the corrected version

Hinted Handoff • What if a write operation can’t reach some of the nodes on the preference? • To preserve availability and durability, store the replica temporarily on another node, accompanied by a metadata “hint” that remembers where the replica should be stored. • Hinted handoff ensures that read and write operations don’t fail because of network partitioning or node failures.

Handling Permanent Failures • Hinted replicas may be lost before they can be returned to the original node. Other problems may cause replicas to be lost or fall out of agreement • Merkle trees allow two nodes to compare a set of replicas and determine fairly easily • Whether or not they are consistent • Where the inconsistencies are

Handling Permanent Failures • Merkle trees have leaves whose values are hashes of the values associated with keys (one key/leaf) • Parent nodes contain hashes of their children • Eventually, root contains a hash that represents everything in that replica • To detect inconsistency between two sets of replicas, compare the roots • Source of inconsistency can be detected by looking at internal nodes

Failures • Like Google, Amazon has a number of data centers, each with many commodity machines. • Individual machines fail regularly • Sometimes entire data centers fail due to power outages, network partitions, tornados, etc. • To handle failure of entire centers, replicas are spread across multiple data centers.

Membership and Failure Detection • Temporary failures or accidental additions of nodes are possible but shouldn’t cause load re-balancing. • Additions and deletions of nodes are explicitly executed by an administrator. • A gossip-based protocol is used to ensure that every node eventually has a consistent view of the membership list.

Gossip-based Protocol • Periodically, each node contacts another node in the network, randomly selected. • Nodes compare their membership histories and reconcile them.

Load Balancing for Additions and Deletions • When a node is added, it acquires key values from other nodes in the network. • Nodes learn of the addition through the gossip protocol, contact the node to offer their keys, which are then transferred after being accepted • When a node is removed, a similar process happens in reverse • Experience has shown that this approach leads to a relatively uniform distribution of key/value pairs across the system

Summary • Experience with Dynamo indicates that it meets the requirements of scalability and availability. • Service owners are able to customize their storage system to emphasize performance, durability, or consistency. The primary parameters are N, R, and W. • The developers conclude that decentralization and eventual consistency can provide a satisfactory platform for hosting highly-available applications.

Dynamo: Amazon’s Highly Available Key-value Store