Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07

Problem • Building (scalable) distributed systems is hard • Specifically, sharing data via message passing is error prone • Distributed state protocols must be developed for: • Replication • File data and metadata management • Cache Consistency • Group Membership

Goals • Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services • Want shared application data that is fault-tolerant, scalable, and consistent • Want to make building these applications easier

Solution • Change the paradigm for building scalable distributed systems • Transform the problem from message passing protocols to data structure design and manipulation • Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes

Design Principles • Principle 1: Reduce operation coupling to obtain scalability • Sinfonia does this by not imposing structure on the data it services • Principle 2: Make components reliable before scaling them • Individual Sinfonia nodes are fault-tolerant

Components • Memory nodes – hold application data, either in RAM or on stable storage. • User library – runs on application nodes • Memory nodes and application nodes are logically distinct, but may run on the same machine • Linear address space referenced via (memory-node-id, address) pairs

Minitransactions • Coordinator executes a transaction by asking participants to perform one or more actions • At the end of the transaction the coordinator executes two-phase commit • Sinfonia piggybacks transactions on top of the two-phase commit protocol • Guarantees: • Atomicity – minitransaction executes completely or not at all • Consistency – data is not corrupted • Isolation – minitransactions are serializable • Durability – minitransactions are not lost even given failures

Minitransaction Details • Minitransaction contains • Compare items • Read items • Write items • Minitransactions are powerful enough to implement powerful primitives • Swap – read item returns old value and write item replaces it • Compare and swap • Atomic read of many data • Acquire a lease • Acquire multiple leases atomically • Change data if lease is held • Application uses the user library to communicate with memory nodes through RPCs • Minitransactions are implemented on top of this

Various Implementation Details and Optimizations • Fault tolerance - transparent recovery from: • Coordinator crashes – Dedicated recovery coordinator node • Participant crashes – Redo logs, decided lists • Complete system crashes – Replay logs and vote • Log garbage collection • Read only minitransactions are not logged • Consistent backups – via locked disk snapshots • Replication – primary copy replication scheme

Application: Cluster File System • NFS v2 interface for cluster file system • Superblock - global info • Inodes keep file attributes • Data blocks 16KB each • Free-block bitmap • Chaining-list blocks - indicate blocks in a file • All NFS functions implemented with a single minitransaction

Application: Group Communication • Service ensures that all members receive the same messages and in the same order • Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node • Messages are threaded together with “next” pointers to create a global list • Each message is given a global sequence number(GSN) once threaded • Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer • To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN • Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages

Costs and Considerations • It is shown that the system does not scale for data spread or for contention • Application writer’s job to consider node locality during application design (data accessed together should be on the same node) • In contrast to data striping which is argued improves single-user throughput, but reduces scalability • Load migration is also an application’s responsibility • All evaluations focused on data throughput, but there are few evaluations for latency • This seems fairly important for group communication systems

Discuss

Sinfonia: A New Paradigm for Building Scalable Distributed Systems