1 / 19

Dynamo

Dynamo. Highly Available Key-Value Store. Context. Core e-commerce services need scalable and reliable storage for massive amounts of data n x 100 of services n x 100,000 concurrent sessions on key services Size and scalability require a storage architecture that is highly decentralized

cyrah
Download Presentation

Dynamo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamo Highly Available Key-Value Store Dennis Kafura – CS5204 – Operating Systems

  2. Context • Core e-commerce services need scalable and reliable storage for massive amounts of data • n x 100 of services • n x 100,000 concurrent sessions on key services • Size and scalability require a storage architecture that is • highly decentralized • high component count • commodity hardware • High component count creates reliability problems (“treats failure handling as the normal case”) • Address reliability problems by replication • Replication raises issues of: • Consistency (replicas differ after failure) • Performance • When to enforce consistency (on read, on write) • Who enforces consistency (client, storage system) Dennis Kafura – CS5204 – Operating Systems

  3. System Elements • Maintain state of services with • High reliability requirements • Latency-sensitive performance • Control tradeoff between consistency and performance • Used only internally • Can leverage characteristics of services and workloads • Non-hostile environment (no security requirements) • Simple key-value interface • Applications do not require more complicated (e.g. database) semantics or hierarchical name space • Key is unique identifier for data item; Value is a binary object (blob) • No operations over multiple data items • Adopts weaker model of consistency (eventual consistency) in favor of higher availability • Service level agreements (SLA) • At 99.9% percentile • Key factors: service latency at a given request rate • Example: response time of 300ms for 99.9% of requests at peak client load of 500 requests per second • State manage (storage) efficiency a key factor in SLAs Dennis Kafura – CS5204 – Operating Systems

  4. Design Considerations • Consistency vs. availability • Strict consistency means that data is unavailable in case of failure to one of the replicas • To improve availability, • use weaker form of consistency (eventual consistency) • allow optimistic updates (changes propagate in the background) • Can lead to conflicting changes which must be detected and resolved • Conflicts • Dynamo applications require “always writeable” storage • Perform conflict detection/resolution on reads • Other factors • Incremental scalability • Symmetry/decentralization (P2P organization/control) • Heterogeneity (not all servers the same) Dennis Kafura – CS5204 – Operating Systems

  5. Design Overview Dennis Kafura – CS5204 – Operating Systems

  6. Partitioning preference list • Interface • get(key) • Returns context and • A single object or a list of conflicting objects • put(key, context, object) • Context from previous read • Object placement/replication • MD5 hash of key yields 128 bit identifier • Consistent hashing Dennis Kafura – CS5204 – Operating Systems

  7. Versioning put put replicas replicas ? Failure free operation What to do in case of failure? Dennis Kafura – CS5204 – Operating Systems

  8. Versioning put v1 v1 v1 v2 v2 v2 Object content is treated as immutable and an update operation creates a new version Dennis Kafura – CS5204 – Operating Systems

  9. Versioning put v1 v1 v1 v2 v2 • Versoning can lead to inconsistency • Due to network partitioning Dennis Kafura – CS5204 – Operating Systems

  10. Versioning v1 puta v1 putb v1 v2a v2b v2b v2a v2a v2b • Versoning can lead to inconsistency • Due to concurrent updates Dennis Kafura – CS5204 – Operating Systems

  11. Object Resolution Uses vector-clocks Conflicting versions passed to application as output of get operation Application resolves conflicts and puts a new (consistent) version Inconsistent version rare: 99.94% of get operations saw exactly one version Dennis Kafura – CS5204 – Operating Systems

  12. Handling get/put operations • Operating handled by coordinator: • First among the top N nodes in the preference list • Located by • call to load balancer (no Dynamo-specific node needed in application but may require extra level of indirection) • Direct call to coordinator (via Dynamo-specific client library) • Quorum voting • R nodes must agree to a get operation • W nodes must agree to a put operation • R+W > N • (N, R, W) can be chosen to achieve desired tradeoff • Common configuration is (3,2,2) • “Sloppy quorum” • Top N’ healthy nodes in the preference list • Coordinator is first in this group • Replicas sent to node contain a “hint” indicating the (unavailable) original node that should hold the replica • Hinted replicas are stored by available node and sent forwarded when original node recovers. Dennis Kafura – CS5204 – Operating Systems

  13. Replica synchronization Accelerates detection of inconsistent replicas using Merkle tree Separate tree maintained by each node for each key range Adds overhead to maintain Merkle trees Dennis Kafura – CS5204 – Operating Systems

  14. Ring membership Nodes are explicitly added to/removed from a ring Membership, partitioning, and placement information propagates via periodic exchanges (a gossip protocol) Existing nodes transfer key ranges to newly added node or receive key ranges from exiting nodes Nodes eventually know key ranges of its peers and can forward requests to them Some “seed” nodes are well-known Nodes failures detected by lack of responsiveness and recovery detected by periodic retry Dennis Kafura – CS5204 – Operating Systems

  15. Partition/Placement Strategies S = number of nodes Dennis Kafura – CS5204 – Operating Systems

  16. Strategy Performance Factors • Strategy 1 • Bootstrapping of new node is lengthy • It must acquire its key ranges from other nodes • Other nodes process scanning/transmission of key ranges for new node as background activities • Has taken a full day during peak periods • Numerous nodes many have to adjust their Merkle trees when a new node joins/leaves system • Archival process difficult • Key ranges may be in transit • No obvious synchronization/checkpointing structure Dennis Kafura – CS5204 – Operating Systems

  17. Strategy Performance Factors • Strategy 2 • Decouples partition and placement • Allows changing of placement scheme at run-time • Strategy 3 • Decouples partition and placement • Faster bootstrapping/recovery and ease of archiving because key ranges can be segregates into different files that can be shared/archived separately Dennis Kafura – CS5204 – Operating Systems

  18. Partition Strategies - Performance Strategies have different tuning parameters Fair comparison: evaluate the skew in their load distributions for a fixed amount of space to maintain membership information Strategy 3 superior Dennis Kafura – CS5204 – Operating Systems

  19. Client- vs Server-Side Coordination Any node can coordinate read requests; write requests handled by coordinator State-machine for coordination can be in load balancing server or incorporated into client Client-driven coordination has lower latency because it avoids extra network hop (redirection) Dennis Kafura – CS5204 – Operating Systems

More Related