130 likes | 164 Views
Sinfonia proposes a paradigm shift for building scalable distributed systems, focusing on data structure design and manipulation rather than message passing protocols. It introduces minitransactions for atomic data access and conditional modifications across nodes, ensuring fault tolerance, scalability, and consistency. This innovative approach simplifies the development of infrastructure applications like cluster file systems and group communication services. Sinfonia's design principles prioritize reducing operation coupling and ensuring reliability before scaling components, leading to improved system performance and flexibility.
E N D
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07
Problem • Building (scalable) distributed systems is hard • Specifically, sharing data via message passing is error prone • Distributed state protocols must be developed for: • Replication • File data and metadata management • Cache Consistency • Group Membership
Goals • Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services • Want shared application data that is fault-tolerant, scalable, and consistent • Want to make building these applications easier
Solution • Change the paradigm for building scalable distributed systems • Transform the problem from message passing protocols to data structure design and manipulation • Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes
Design Principles • Principle 1: Reduce operation coupling to obtain scalability • Sinfonia does this by not imposing structure on the data it services • Principle 2: Make components reliable before scaling them • Individual Sinfonia nodes are fault-tolerant
Components • Memory nodes – hold application data, either in RAM or on stable storage. • User library – runs on application nodes • Memory nodes and application nodes are logically distinct, but may run on the same machine • Linear address space referenced via (memory-node-id, address) pairs
Minitransactions • Coordinator executes a transaction by asking participants to perform one or more actions • At the end of the transaction the coordinator executes two-phase commit • Sinfonia piggybacks transactions on top of the two-phase commit protocol • Guarantees: • Atomicity – minitransaction executes completely or not at all • Consistency – data is not corrupted • Isolation – minitransactions are serializable • Durability – minitransactions are not lost even given failures
Minitransaction Details • Minitransaction contains • Compare items • Read items • Write items • Minitransactions are powerful enough to implement powerful primitives • Swap – read item returns old value and write item replaces it • Compare and swap • Atomic read of many data • Acquire a lease • Acquire multiple leases atomically • Change data if lease is held • Application uses the user library to communicate with memory nodes through RPCs • Minitransactions are implemented on top of this
Various Implementation Details and Optimizations • Fault tolerance - transparent recovery from: • Coordinator crashes – Dedicated recovery coordinator node • Participant crashes – Redo logs, decided lists • Complete system crashes – Replay logs and vote • Log garbage collection • Read only minitransactions are not logged • Consistent backups – via locked disk snapshots • Replication – primary copy replication scheme
Application: Cluster File System • NFS v2 interface for cluster file system • Superblock - global info • Inodes keep file attributes • Data blocks 16KB each • Free-block bitmap • Chaining-list blocks - indicate blocks in a file • All NFS functions implemented with a single minitransaction
Application: Group Communication • Service ensures that all members receive the same messages and in the same order • Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node • Messages are threaded together with “next” pointers to create a global list • Each message is given a global sequence number(GSN) once threaded • Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer • To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN • Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages
Costs and Considerations • It is shown that the system does not scale for data spread or for contention • Application writer’s job to consider node locality during application design (data accessed together should be on the same node) • In contrast to data striping which is argued improves single-user throughput, but reduces scalability • Load migration is also an application’s responsibility • All evaluations focused on data throughput, but there are few evaluations for latency • This seems fairly important for group communication systems