1 / 13

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Sinfonia: A New Paradigm for Building Scalable Distributed Systems. Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07. Problem. Building (scalable) distributed systems is hard

Download Presentation

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07

  2. Problem • Building (scalable) distributed systems is hard • Specifically, sharing data via message passing is error prone • Distributed state protocols must be developed for: • Replication • File data and metadata management • Cache Consistency • Group Membership

  3. Goals • Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services • Want shared application data that is fault-tolerant, scalable, and consistent • Want to make building these applications easier

  4. Solution • Change the paradigm for building scalable distributed systems • Transform the problem from message passing protocols to data structure design and manipulation • Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes

  5. Design Principles • Principle 1: Reduce operation coupling to obtain scalability • Sinfonia does this by not imposing structure on the data it services • Principle 2: Make components reliable before scaling them • Individual Sinfonia nodes are fault-tolerant

  6. Components • Memory nodes – hold application data, either in RAM or on stable storage. • User library – runs on application nodes • Memory nodes and application nodes are logically distinct, but may run on the same machine • Linear address space referenced via (memory-node-id, address) pairs

  7. Minitransactions • Coordinator executes a transaction by asking participants to perform one or more actions • At the end of the transaction the coordinator executes two-phase commit • Sinfonia piggybacks transactions on top of the two-phase commit protocol • Guarantees: • Atomicity – minitransaction executes completely or not at all • Consistency – data is not corrupted • Isolation – minitransactions are serializable • Durability – minitransactions are not lost even given failures

  8. Minitransaction Details • Minitransaction contains • Compare items • Read items • Write items • Minitransactions are powerful enough to implement powerful primitives • Swap – read item returns old value and write item replaces it • Compare and swap • Atomic read of many data • Acquire a lease • Acquire multiple leases atomically • Change data if lease is held • Application uses the user library to communicate with memory nodes through RPCs • Minitransactions are implemented on top of this

  9. Various Implementation Details and Optimizations • Fault tolerance - transparent recovery from: • Coordinator crashes – Dedicated recovery coordinator node • Participant crashes – Redo logs, decided lists • Complete system crashes – Replay logs and vote • Log garbage collection • Read only minitransactions are not logged • Consistent backups – via locked disk snapshots • Replication – primary copy replication scheme

  10. Application: Cluster File System • NFS v2 interface for cluster file system • Superblock - global info • Inodes keep file attributes • Data blocks 16KB each • Free-block bitmap • Chaining-list blocks - indicate blocks in a file • All NFS functions implemented with a single minitransaction

  11. Application: Group Communication • Service ensures that all members receive the same messages and in the same order • Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node • Messages are threaded together with “next” pointers to create a global list • Each message is given a global sequence number(GSN) once threaded • Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer • To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN • Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages

  12. Costs and Considerations • It is shown that the system does not scale for data spread or for contention • Application writer’s job to consider node locality during application design (data accessed together should be on the same node) • In contrast to data striping which is argued improves single-user throughput, but reduces scalability • Load migration is also an application’s responsibility • All evaluations focused on data throughput, but there are few evaluations for latency • This seems fairly important for group communication systems

  13. Discuss

More Related