1 / 35

Robustness in the Salus scalable block store

Robustness in the Salus scalable block store. Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin. Salus overview. Usage: Provide remote disks to users (Amazon EBS) Scalability: Thousands of servers

Download Presentation

Robustness in the Salus scalable block store

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

  2. Salus overview • Usage: • Provide remote disks to users (Amazon EBS) • Scalability: • Thousands of servers • Robustness: • Tolerate disk/memory corruptions, CPU errors, … • Do NOT hurt performance/scalability.

  3. Scalability and robustness MegaStore Spanner BigTable GFS Chubby Local FS Disk Driver More hardware -> more failures More complex software -> more failures

  4. Achieving both scalability and robustness is hard Strong protections (End-to-end checks, BFT, Depot, …) Combining them is challenging. Scalable systems (GFS/Bigtable, HDFS/HBase, WAS, Spanner, FDS, …..)

  5. Challenge: Parallelism vs Consistency Clients Infrequent metadata transfer Parallel data transfer Storage servers Metadata server State-of-the-art architecture: GFS/Bigtable, HDFS/HBase, WAS, … Data is replicated for durability and availability

  6. Challenges • Write in parallel and in order • Eliminate single points of failure • Write: prevent a single node from corrupting data • Read: read safely from one node • Do not increase replication cost

  7. Write in parallel and in order Clients Metadata server Data servers

  8. Write in parallel and in order Write 1 Write 2 Write 2 is committed but write 1 is not. Not allowed for block store.

  9. Prevent a single node from corrupting data Clients Metadata server Data servers

  10. Prevent a single node from corrupting data • Tasks of computation nodes: • Data forwarding, garbage collection, etc • Examples of computation nodes: • Tablet server (Bigtable), Region server (HBase), … (WAS) Computation node

  11. Read safely from one node • Read is executed on one node: • Maximize parallelism • Minimize latency • If that node experiences corruptions, …

  12. Do not increase replication cost • Industrial systems: • Write to f+1 nodes and read from one node • BFT systems: • Write to 2f+1 nodes and read from f+1 nodes

  13. Salus’ approach Ensure robustness techniques do not hurt scalability Start from a scalable architecture (Bigtable/HBase)

  14. Salus’ key ideas • Pipelined commit • Guarantee ordering despite parallel writes • Active storage • Prevent a computation node from corrupting data • End-to-end verification • Read safely from one node

  15. Salus’ key ideas End-to-end verification Clients Pipelined commit Metadata server Active storage

  16. Pipelined commit • Goal: barrier semantic • A request can be marked as a barrier. • All previous ones must be executed before it. • Naïve solution: • The client blocks at a barrier: lose parallelism • A weaker version of distributed transaction • Well-known solution: two phase commit (2PC)

  17. Pipelined commit – 2PC Previous leader Servers Committed Batch i 1 3 Prepared 1 2 3 Leader 2 Client Batch i+1 Prepared 4 5 4 5 Leader

  18. Pipelined commit – 2PC Previous leader Servers Batch i-1 committed Batch i 1 3 Commit 1 2 3 Leader 2 Client Batch i committed Batch i+1 Commit 4 5 4 5 Leader

  19. Pipelined commit - challenge • Is 2PC slow? • Additional network messages? Disk is the bottleneck. • Additional disk write? Let’s eliminate that. • Challenge: whether to commit a write after recovery 1 3 2 is prepared. Should it be committed? Both cases are possible. 2 • Salus’ solution: ask other nodes

  20. Active Storage • Goal: a single node cannot corrupt data • Well-known solution: replication • Problem: replication cost vs availablity • Salus’ solution: use f+1 replicas • Require unanimous consent of the whole quorum • If one replica fails, replace the whole quorum

  21. Active Storage Computation node Storage nodes

  22. Active Storage Computation nodes • Unanimous consent: • All updates must be agreed by f+1 computation nodes. • Additional benefit: reduce network bandwidth usage Storage nodes

  23. Active Storage Computation nodes • What if one computation node fails? • Problem: we may not know which one is faulty. • Replace the whole quorum Storage nodes

  24. Active Storage Computation nodes • What if one computation node fails? • Problem: we may not know which one is faulty. • Replace the whole quorum • The new quorum must agree on the states. Storage nodes

  25. Active Storage • Does it provide BFT with f+1 replication? • No …. • During recovery, may accept stale states if: • The client fails; • At least one storage node provides stale states; • All other storage nodes are not available. • 2f+1 replicas can eliminate this case: • Is it worth adding f replicas to eliminate that?

  26. End-to-end verification • Goal: read safely from one node • The client should be able to verify the reply. • If corrupted, the client retries another node. • Well-known solution: Merkle tree • Problem: scalability • Salus’ solution: • Single writer • Distribute the tree among servers

  27. End-to-end verification Client maintains the top tree. Server 1 Server 3 Server 2 Server 4 Client does not need to store anything persistently. It can rebuild the top tree from the servers.

  28. Recovery • Pipelined commit • How to ensure write order after recovery? • Active storage: • How to agree on the current states? • End-to-end verification • How to rebuild Merkle tree if client recovers?

  29. Discussion – why HBase? • It’s a popular architecture • Bigtable: Google • HBase: Facebook, Yahoo, … • Windows Azure Storage: Microsoft • It’s open source. • Why two layers? • Necessary if storage layer is append-only • Why append-only storage layer? • Better random write performance • Easy to scale

  30. Discussion – multiple writers?

  31. Lessons • Strong checking makes debugging easier.

  32. Evaluation

  33. Evaluation

  34. Evaluation

  35. Scalability and robustness Distributed Protocol Operating System More hardware -> more failures Complex software -> more failures BigTable: 1 corruption/5TB of data?

More Related