Understanding Yahoo! PNUTS: A Distributed Database System for Web Applications
This tutorial delves into Yahoo! PNUTS, a massively parallel and geographically distributed database system designed for web applications. PNUTS organizes data in hashed or ordered tables, ensuring low latency for numerous concurrent requests, such as updates and queries. It balances consistency and scalability by providing per-record consistency guarantees while allowing for eventual consistency in less critical operations. With a focus on a simplified query model and efficient tablet storage, PNUTS addresses both data partitioning and balancing, making it suitable for dynamic web environments.
Understanding Yahoo! PNUTS: A Distributed Database System for Web Applications
E N D
Presentation Transcript
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2013-2014
Yahoo! PNUTS • A massively parallel and geographically distributed database system for Yahoo!’s web applications • provides data storage organized as hashed or ordered tables • low latency for large numbers of concurrent requests including updates and queries • per-record consistency guarantees
Consistency • Serializability of general transaction is inefficient and often unnecessary • If a user changes an avatar, posts new pictures, or invites several friends to connect, little harm is done if the new avatar is not initially visible to one friend • Many distributed applications go to the extreme of providing only eventual consistency • Too weak and inadequate for web applications • PNUTS suggests a consistency model that falls between those two extremes
SYSTEM ARCHITECTURE • Data is organized into tables of records with attributes • In addition to typical data types, “blob” is a valid data type, allowing arbitrary structures inside a record • Data tables are horizontally partitioned into groups of records called tablets. • Tabletsare scattered across many servers • each server might have hundreds or thousands of tablets, but each tablet is stored on a single server within a region
Distributed Hash Table 0x0000 0x2AF3 Tablet 0x911F
Distributed Hash Table Tablet clustered by key range
Query model • PNUTS supports very simple queries sacrificing rich API in favor of response time and overall simplicity • No joins, group-by, etc. • This is stated as future work • The system is designed to work well with queries that read and write single records or small groups of records
PNUTS-Single Region • a single pair of active/standby servers • Maintains map from database.table.key to tablet to storage-unit • Routes client requests to correct storage unit • Caches the maps from the tablet controller • Stores records • Services get/set/delete requests
Tablet Splitting & Balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Tablets may grow over time Overfull tablets split Shed load by moving tablets to other servers
Consistency Options • Eventual Consistency • Low latency updates and inserts done locally • Record Timeline Consistency • Each record is assigned a “master region” • Inserts succeed, but updates could fail during outages • Primary Key Constraint + Record Timeline • Each tablet and record is assigned a “master region” • Inserts and updates could fail during outages Availability Consistency
Record Timeline Consistency • One of the replicas is designated as the master • Per record • All updates to that record are forwarded to the master • If a replica is receiving the majority of write requests – it becomes the master • Each update advances the generation of the record
(Alice, Home, Awake) Work Awake (Alice, Work, Awake) Record Timeline Consistency Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Work, Awake) Region 1 (Alice, Work, Awake) Work (Alice, Home, Sleeping) Region 2 No replica should see record as (Alice, Work, Sleeping)
API calls • Read-any • Returns a possibly stale version of the record. • The returned record is always a valid one from the record’s history. • This call has lower latency than other read calls with stricter guarantees • Read-critical(required version) • Returns a version of the record that is strictly newer than, or the same as the required version. • Read-latest • Returns the latest copy of the record that reflects all writes that have succeeded. • Write • This call gives the same ACID guarantees as a transaction with a single write operation in it. This call is useful for blind writes, e.g., a user updating his status on his profile. • Test-and-set-write(required version) • This call performs the requested write to the record if and only if the present version of the record is the same as required version.
Eventual Consistency • Timeline consistencycomes at a price • Writes not originating in record master region forward to master and have longer latency • The mastership of a record can migrate between replicas • When master region down, record is unavailable for write • eventual consistencymode • On conflict, latest write per field wins • Target customers • Those that externally guarantee no conflicts • Those that understand/can cope
Yahoo! Message Broker (YMB) • A topic-based publish/subscribe system • Data updates are considered “committed” when they have been published to YMB. • At some point after being committed, the update will be asynchronously propagated to different regions and applied to their replicas • YMB guarantees that published messages will be delivered to all topic subscribers even in the presence of single broker machine failures • by logging the message to multiple disks on different servers. two copies are logged initially, and more copies are logged as the message propagates • The message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database • YMB provides partial ordering of published messages. • Messages published to a particular YMB cluster will be delivered to all subscribers in the order they were published
Recovery • Recovering from a failure involves copying lost tablets from another replica. • A three step process • The tablet controller requests a copy from a particular remote replica (the “source tablet”). • A “checkpoint message” is published to YMB, to ensure that any in-flight updates at the time the copy is initiated are applied to the source tablet. • The source tablet is copied to the destination region. • To support this recovery protocol, tablet boundaries are kept synchronized across replicas, and tablet splits are conducted by having all regions split a tablet at the same point • coordinated by a two-phase commit between regions.
For more info http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf