1 / 30

Sinfonia : A New Paradigm for Building Scalable Distributed Systems

Sinfonia : A New Paradigm for Building Scalable Distributed Systems. Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch , and Christos Karamanolis. Presented by Dennis Iannicca. Introduction. Distributed systems traditionally use a message-passing interface

zlata
Download Presentation

Sinfonia : A New Paradigm for Building Scalable Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamanolis Presented by Dennis Iannicca

  2. Introduction • Distributed systems traditionally use a message-passing interface • Error prone and hard to use due to complex MPI protocols for handling distributed state • Sinfonia uses a distributed shared memory architecture • Developers are free to design and manipulate data structures directly in Sinfonia without dealing with complicated protocols

  3. Introduction • Sinfonia presents an address space in which to store data without imposing any sort of structure on the developer • No types, schemas, tuples, or tables • The system is distributed over multiple nodes for fault tolerance and performance

  4. Introduction Sinfonia uses a simple protocol to communicate between application and memory nodes called minitransactions

  5. Goals • Help developers build distributed infrastructure applications that support other applications • Lock managers • Cluster file systems • Group communication services • Distributed name services • These applications need to provide reliability, consistency, and scalability

  6. Design • Sinfonia relies on two main design principles: • Reduce operation coupling to allow parallel execution • Make components reliable before scaling them

  7. Design • Basic Components • Memory Nodes • Hold application data in RAM or disk • Keeps a sequence of raw or uninterpreted words of a standard length organized as a linear address space without any structure • Application Nodes • User library provides the mechanism for programs to manipulate data stored on memory nodes through minitransactions

  8. Design • Minitransactions • Allow applications to update data in multiple memory nodes • Ensures atomicity, consistency, isolation, and durability • Examples: • Swap • Compare-and-swap • Atomic read of many locations • Acquire a lease • Acquire multiple leases atomically • Change data if lease is held

  9. Design • Caching and Consistency • Sinfonia doesn’t cache data at application nodes • Applications must perform their own caching • Application-controlled caching has three advantages: • Greater flexibility on policies of what to cache • Cache utilization potentially improves since applications know their data access patterns • Sinfonia becomes a simpler service to use because data access is always current and performance is predictable

  10. Design • Fault Tolerance • Crashes of application nodes never affect data in Sinfonia since minitransactions are atomic • Optional protections: • Masking independent failures • Preserving data on correlated failures • Restoring data on disasters • Mechanisms: • Disk images • Logging • Replication • Backup

  11. Design Modes of operation and fault tolerance

  12. Implementation • Minitransaction Protocol • Integrates execution of the minitransaction into the commit protocol for efficiency • In the event of a participant crash, minitransactions block until the participant node recovers • Participants (memory nodes) can be optionally replicated and grouped into logical participants to avoid blocking in the event of a single node crash

  13. Implementation • Minitransaction Protocol (cont.) • Uses a two-phase commit protocol • Phase 1: Prepare and Execute Minitransaction • The coordinator (application node) generates a new transaction id (tid) and sends an EXEC-AND-PREPARE message to the participants (memory nodes) • The participant tries to acquire locks for the memory locations requested • If the participant acquires all the locks, it performs comparisons and reads. If all comparisons succeed, the participant buffers the conditional-writes in a redo-log • The participant chooses a vote: If it acquired all locks and all compare items succeeded, the vote is for committing, otherwise it is for aborting

  14. Implementation • Minitransaction Protocol (cont.) • Two-phase commit protocol (cont.) • Phase 2: Commit Minitransaction • Coordinator tells participants to commit if all votes are for committing • Participants apply conditional-write items if committing • Participants release all locks on memory acquired by the minitransaction • If a minitransaction aborts because locks were busy or it has been marked “forced-abort” by the recovery process, the coordinator retries the minitransaction after awhile using a new tid • Participants log minitransactions in phase 1 if logging is enabled. • Participants keep an “uncertain” list of transaction ids that are in the redo-log but that are still undecided, a “forced-abort” list, and a “decided” list of finished tids and their outcome

  15. Implementation • Recovery from Coordinator Crashes • In the event of a coordinator crash, a recovery protocol is executed by a dedicated management node • The management node periodically probes memory nodes for minitransactions in the “uncertain” list that have not yet committed after a timeout period

  16. Implementation • Recovery from Coordinator Crashes (cont) • The recovery protocol is executed by a recovery coordinator which will try to abort the minitransactions • The recovery coordinator forces a participant to choose an abort vote if it has not yet chosen a vote

  17. Implementation • Recovery from Participant Crashes • System blocks outstanding minitransactionsinvoving participant while it is offline • If participant loses stable storage (disk), it must be recovered from backup • If participant was rebooted without losing stable storage, it synchronizes its disk image by replaying its redo-log and remains unavailable for execution of new minitransactions until its data structures are recovered

  18. Implementation • Optimized Recovery from Crash of the Whole System • If many memory nodes or the whole system crashes and restarts (e.g. a power failure), the management node attempts to recover all the memory nodes at once • To avoid too many queries of votes, memory nodes send each other their tid logs for which they chose a commit vote • Such queries are batched together so that the recovering node only sends one message per memory node instead of one message per minitransaction per memory node

  19. Implementation • Garbage Collection • Management node collects the lists to be garbage-collected and relays them to the appropriate memory nodes in batches • A committed minitransactiontid can be removed from the redo-log head only when tid has been applied to the disk image of every memory node involved in tid • The uncertain, decided, and all-log-tids lists are garbage-collected at the same time as the redo-log • The forced-abort list is garbage-collected when a memory node’s epoch number is stale to avoid system corruption

  20. Implementation • Consistent Backups • Sinfonia can perform transactionally consistent backups of its data • While disk image is being backed up, new minitransactions are temporary prevented from updating the disk image • Minitransactions are not blocked, but instead written to the redo-log and memory until backup is complete, then the transactions in the redo-log are flushed to the disk image

  21. Implementation • Replication • Sinfonia can replicate memory nodes for high-availability • Integrates primary-copy replication into the two-phase minitransaction protocol • The replica is updated while the primary writes its redo-log to stable storage in the first phase of commit • The replica writes the updates from the primary to its redo-log and then acknowledges the primary

  22. Application: Cluster File System • SinfoniaFS • Scalable and fault tolerant • Performance can be increased by adding more machines • Even if all nodes crash, data is never lost or left inconsistent • Exports NFS v2 and cluster nodes mount their own NFS server locally • All NFS servers export the same filesystem • Cluster nodes are applications nodes and use Sinfonia to store file system metadata and data including inodes, the free block map, chaining information with a list of data blocks for inodes, and the contents of the files • Sinfonia is used in mode LOG or LOG-REPL so that the address space is as large as the disk

  23. Application: Group Communication • SinfoniaGCS • Each member has a dedicated circular queue stored on one memory node where it can place its own broadcast messages without contention • Messages of all queues are threaded together with next pointers to create a single global list

  24. Evaluation Sinfonia vs. BerkeleyDB on one memory node

  25. Evaluation Scalability

  26. Future Work • Disk Arrays and Memory Nodes • Should be possible to implement the memory node functionality inside disk arrays • It is difficult for different hosts to share a writable disk volume • Network-attached disk arrays would benefit from having a concurrency control mechanism, such as Sinfonia’sminitransactions, that would allow volumes to be pooled and shared among multiple hosts

  27. Future Work • Extending Minitransactions • Sometimes it is not efficient to use minitransactions to execute some operations such as atomic increment, especially under high contention • Extend minitransactions to address this limitation by allowing them to have more general conditional tests (besides equality comparisons) and updates (instead of only writes)

  28. Related Work • GFS • Bigtable • Chubby • MapReduce • Distributed Shared Memory (DSM) • Plurix • Perdis • Thor • BerkeleyDB • Stasis • Pond • Paxos Commit • QuickSilver • Camelot • Remote Direct Memory Access • Boxwood • Gribble

  29. Conclusion • New paradigm for building scalable distributed systems • Streamlined minitransactions over unstructured data rather than message passing interfaces • Shifts concerns about failures and distributed protocol design into the higher-level and easier problem of designing shared data structures

  30. Questions?

More Related