1 / 26

Amazon’s Key-Value Store: Dynamo

DeCandia,Hastorun,Jampani , Kakulapati , Lakshman , Pilchin , Sivasubramanian , Vosshall , Vogels : Dynamo: Amazon's highly available key-value store . SOSP 2007. Amazon’s Key-Value Store: Dynamo. Adapted from Amazon’s Dynamo Presentation. Motivation. Reliability at a massive scale

teleri
Download Presentation

Amazon’s Key-Value Store: Dynamo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 Amazon’s Key-Value Store: Dynamo Adapted from Amazon’s Dynamo Presentation UCSB CS271

  2. Motivation • Reliability at a massive scale • Slightest outage  significant financial consequences • High write availability • Amazon’s platform: 10s of thousands of servers and network components, geographically dispersed • Provide persistent storage in spite of failures • Sacrifice consistency to achieve performance, reliability, and scalability UCSB CS271

  3. Dynamo Design rationale • Most services need key-based access: • Best-seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, and so on. • Prevalent application design based on RDBMS technology will be catastrophic. • Dynamo therefore provides primary-key only interface. UCSB CS271

  4. Dynamo Design Overview • Data partitioning using consistent hashing • Data replication • Consistency via version vectors • Replica synchronization via quorum protocol • Gossip-based failure-detection and membership protocol UCSB CS271

  5. System Requirements • Data & Query Model: • Read/write operations via primary key • No relational schema: use <key, value> object • Object size < 1 MB, typically. • Consistency guarantees: • Weak • Only single key updates • Not clear if read-modify-write isolate • Efficiency: • SLA 99.9 percentile of operations • Notes: • Commodity hardware • Minimal security measures since for internal use UCSB CS271

  6. Service Level Agreements (SLA) • Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. • Example SLA:service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. UCSB CS271

  7. System Interface • Two basic operations: • Get(key): • Locates replicas • Returns the object + context (encodes meta data including version) • Put(key, context, object): • Writes the replicas to the disk • Context: version (vector timestamp) • Hash(key)  128-bit identifier UCSB CS271

  8. Partition Algorithm • Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring” a la Chord. • “Virtual Nodes”:Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution) UCSB CS271

  9. Virtual Nodes UCSB CS271

  10. Advantages of using virtual nodes • The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure. • A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node. • If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. • When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. UCSB CS271

  11. Replication • Each data item is replicated at N hosts. • preference list: The list of nodes that is responsible for storing a particular key. • Some fine-tuning to account for virtual nodes UCSB CS271

  12. Replication UCSB CS271

  13. Replication UCSB CS271

  14. Preference Lists • List of nodes responsible for storing a particular key. • Due to failures, preference list contains more than N nodes. • Due to virtual nodes, preference list skips positions to ensure distinct physical nodes. UCSB CS271

  15. Data Versioning • A put() call may return to its caller before the update has been applied at all the replicas • A get() call may return many versions of the same object. • Challenge: an object may have distinct versions • Solution: use vector clocks in order to capture causality between different versions of same object. UCSB CS271

  16. Vector Clock • A vector clock is a list of (node, counter) pairs. • Every version of every object is associated with one vector clock. • If the allcounters on the first object’s clock are less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten. • Applicationreconciles divergent versions and collapses into a single new version. UCSB CS271

  17. Vector clock example UCSB CS271

  18. Routing requests • Route request through a generic load balancer that will select a node based on load information. • Use a partition-aware client library that routes requests directly to relevant node. • A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories. UCSB CS271

  19. Sloppy Quorum • R and Wis the minimum number of nodes that must participate in a successful read/write operation. • Setting R + W > N yields a quorum-like system. • In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability. UCSB CS271

  20. Highlights of Dynamo • High write availability • Optimistic: vector clocks for resolution • Consistent hashing (Chord) in controlled environment • Quorums for relaxed consistency. UCSB CS271

  21. Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009 Cassandra (Facebook) UCSB CS271

  22. Data Model • Key-value store—more like Bigtable. • Basically, a distributed multi-dimensional map indexed by a key. • Value is structured into Columns, which are grouped into Column Families: simple and super (column family within a column family). • An operation is atomic on a single row. • API: insert, get and delete. UCSB CS271

  23. System Architecture • Like Dynamo (and Chord). • Uses order preserving hash function on a fixed circular space. Node responsible for a key is called the coordinator. • Non-uniform data distribution: keep track of data distribution and reorganize if necessary. UCSB CS271

  24. Replication • Each item is replicated at N hosts. • Replicas can be: Rack Unaware; Rack Aware (within a data center); Datacenter Aware. • System has an elected leader. • When a node joins the system, the leader assigns it a range of data items and replicas. • Each node is aware of every other node in the system and the range they are responsible for. UCSB CS271

  25. Membership and Failure Detection • Gossip-based mechanism to maintain cluster membership. • A node determines which nodes are up and down using a failure detector. • The Φ accrual failure detector returns a suspicion level, Φ, for each monitored node. • Say a node suspects A when Φ=1, 2, 3, then the likelihood of a mistake is 10%, 1% and .1%. • Every node maintains a sliding window of interarrival times of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution. UCSB CS271

  26. Operations • Use quorums: R and W • If R+W < N then read will return latest value. • Read operations return value with highest timestamp, so may return older versions • Read Repair: with every read, send newest version to any out-of-date replicas. • Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive) • Each write: first into a persistent commit log, then an in-memory data structure. UCSB CS271

More Related