1 / 28

MinCopysets: Derandomizing Replication in Cloud Storage

MinCopysets: Derandomizing Replication in Cloud Storage. Asaf Cidon , Ryan Stutsman, Stephen Rumble, Sachin Katti , John Ousterhout and Mendel Rosenblum. Stanford University. Overview. Assumptions: no geo-replication, Azure uses much smaller clusters in practice.

lamond
Download Presentation

MinCopysets: Derandomizing Replication in Cloud Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MinCopysets: Derandomizing Replication in Cloud Storage AsafCidon, Ryan Stutsman, Stephen Rumble, SachinKatti, John Ousterhoutand Mendel Rosenblum Stanford University

  2. Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

  3. RAMCloud • Primary data stored on master (memory) • Divide each master’s data into chunks • Chunks are replicated on backups (disk) • When master crashes, recover from thousands of backups CrashedMaster Masters Backups Unpublished – Please do not distribute

  4. Random Replication Chunk 1 Chunk 2 Chunk 3 Node 4 Node 3 Node 2 Node 1 Node 5 Chunk 1 Secondary Chunk 2 Primary Chunk 1 Secondary Chunk 3 Primary Chunk 3 Secondary Node 9 Node 8 Node 7 Node 6 Node 10 Chunk 2 Secondary Chunk 1 Primary Chunk 3 Secondary Chunk 2 Secondary Unpublished – Please do not distribute

  5. The Problem • Randomized replication loses data in power outages • 0.5-1% of the nodes fail to reboot • 1-2 times a year • Result: handful of chunks (GBs of data) are unavailable (LinkedIn ‘12) • Sub-problem: managed power downs • Software upgrades • Reduced power consumption Unpublished – Please do not distribute

  6. Intuition • If we have one chunk, we are safe: • Replicate chunk on three nodes • Data is lost if failed nodes contain three copies of a chunk • 1% of the nodes fail: 0.0001% of data loss • If we have millions of chunks, we lose data: • 1000 node HDFS cluster has 10 million chunks • 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

  7. Mathematical Intuition • A copyset of nodes is a single unit of failure • Each chunk is replicated on a single copyset • For one chunk, the probability of data loss is: • F = number of failed nodes • R = replication factor • N = number of nodes • For all chunks, the probability is: • B = number of chunks Unpublished – Please do not distribute

  8. Changing R Doesn’t Help Unpublished – Please do not distribute

  9. Changing the Chunk Size Doesn’t Help Unpublished – Please do not distribute

  10. MinCopysets: Decouple Load Balancing and Durability • Split nodes into fixed replication groups • Random Distribution:Place primary replica on random node • Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

  11. MinCopysets Architecture Chunk 1 Chunk 2 Chunk 3 Chunk 4 Replication Group 1 Replication Group 2 Replication Group 3 Node 2 Node 55 Node 1 Chunk 2 Secondary Chunk 1 Secondary Chunk 4 Primary Chunk 3 Primary Node 83 Node 8 Node 7 Node 24 Node 22 Node 47 Chunk 2 Secondary Chunk 2 Primary Chunk 1 Primary Chunk 1 Secondary Chunk 4 Secondary Chunk 4 Secondary Chunk 3 Secondary Chunk 3 Secondary Unpublished – Please do not distribute

  12. Unpublished – Please do not distribute

  13. Unpublished – Please do not distribute

  14. Unpublished – Please do not distribute

  15. Extreme Failure Scenarios • In the extreme scenario of 3-4% of the cluster’s nodes fail to reboot, MinCopysets provides low data loss probabilities • For example: • 4000 node HDFS cluster • 120 nodes fail to reboot after power outage • Only 3.5% probability of data loss Unpublished – Please do not distribute

  16. Extreme Failure Scenarios: Normal Clusters Unpublished – Please do not distribute

  17. Extreme Failure Scenarios: Big Clusters Unpublished – Please do not distribute

  18. MinCopysets’ Trade Off • Trades off frequency and magnitude of failures • Expected data loss is the same • Data loss occurs very rarely • The magnitude of data loss is greater Unpublished – Please do not distribute

  19. Frequency vs. Magnitude of Failures • Setup: • 5000 node HDFS cluster • 3 TB per machine • R = 3 • Power outage once a year • Random replication • Lose 5.5 GB every single year • MinCopysets • Lose data once every 625 years • Lose an entire node in case of failure Unpublished – Please do not distribute

  20. RAMCloud Implementation • RAMCloud implementation was relatively straightforward • Two non-trivial issues: • Need to manage groups of nodes • Allocate chunks on entire groups • Manage nodes joining and leaving groups • Machine failures are more complex • Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

  21. RAMCloud Implementation RAMCloud Coordinator Request: Assign Replication Group RPC Coordinator Server List Request: Open New Chunk RPC RAMCloud Backup RAMCloud Master Reply: Replication Group Unpublished – Please do not distribute

  22. HDFS Implementation • Even simpler than RAMCloud • In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed • NameNodeassigns DataNodes to replication groups • Prototyped in 200 LoC Unpublished – Please do not distribute

  23. HDFS Issues • Has the same issues as RAMCloud in managing groups of nodes • Issue: Repair bandwidth • Solution: Hybrid scheme • Issue: Network bottlenecks and load balancing • Solution: Kill replication group, re-replicate its data elsewhere • Issue:Replication group’s capacity is limited by node with the smallest capacity • Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

  24. Facebook’s HDFS Replication • Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss • Facebook’s Algorithm: • Primary replica is replicated on node j and rack k • Secondary replicas are replicated on randomly selected nodes among (j+1,… ,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

  25. Facebook’s Replication Unpublished – Please do not distribute

  26. Hybrid MinCopysets • Split nodes into replication groups of 2 and 15 • First and second replica are always placed on the group of 2 • Third replica is randomly placed on the group of 15

  27. Thank You! Stanford University

More Related