1 / 38

Robustness in the Salus scalable block store

Robustness in the Salus scalable block store. Yang Wang , Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin. Storage: Do not lose data. Past. Fsck, IRON, ZFS, …. Now & Future.

symona
Download Presentation

Robustness in the Salus scalable block store

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

  2. Storage: Do not lose data Past Fsck, IRON, ZFS, … Now & Future

  3. Salus: Provide robustness to scalable systems Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus Scalable systems: Crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, …)

  4. Start from “Scalable” File system Key value Key value Block store Block store Database Robust Borrow scalable designs Scalable

  5. Scalable architecture Clients Computing node: • No persistent storage Parallel data transfer Data servers Metadata server Storage nodes Data is replicated for durability and availability

  6. Problems • No ordering guarantees for writes • Single points of failures: • Computing nodes can corrupt data. • Client can accept corrupted data. ?

  7. 1. No ordering guarantees for writes Clients 1 2 Metadata server Data servers Block store requires barrier semantics 1 2 3

  8. 2. Computing nodes can corrupt data Clients Metadata server Data servers Single point of failure

  9. 3. Client can accept corrupted data Clients Metadata server Data servers Does a simple disk checksum work? Can not prevent errors in memory, etc. Single point of failure

  10. Salus: Overview End-to-end verification Block drivers (single driver per virtual disk) Pipelined commit Metadata server Failure model: Almost arbitrary but no impersonation Active storage

  11. Pipelined commit • Goal: barrier semantics • A request can be marked as a barrier. • All previous ones must be executed before it. • Naïve solution: • Waits at a barrier: Lose parallelism • Look similar to distributed transaction • Well-known solution: Two phase commit (2PC)

  12. Problem of 2PC Barriers Servers 1 2 3 4 5 Batch i 1 3 Prepared 1 2 3 Leader 2 Driver Batch i+1 Prepared 4 5 4 5 Leader

  13. Problem of 2PC Servers Batch i 1 3 Commit 1 2 3 Leader 2 Driver Problem can still happen. Need ordering guarantee between different batches. Batch i+1 Commit 4 5 4 5 Leader

  14. Pipelined commit Lead of batch i-1 Servers Batch i-1 committed Batch i 1 3 Commit 1 2 3 Leader 2 Driver Sequential commit Parallel data transfer Batch i committed Batch i+1 Commit 4 5 4 5 Leader

  15. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  16. Active storage • Goal: a single node cannot corrupt data • Well-known solution: BFT replication • Problem: require at least 2f+1 replicas • Salus’ goal: Use f+1 computing nodes • f+1 is enough to guarantee safety • How about availability/liveness?

  17. Active storage: unanimous consent Computing node Storage nodes

  18. Active storage: unanimous consent Computing nodes • Unanimous consent: • All updates must be agreed by f+1 computing nodes. • Additional benefit: • Colocate computing and storage: save network bandwidth Storage nodes

  19. Active storage: restore availability Computing nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computing nodes Storage nodes

  20. Active storage: restore availability Computing nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computing nodes • Computing nodes do not have persistent states. Storage nodes

  21. Discussion • Does Salus achieve BFT with f+1 replication? • No. Fork can happen in a combination of failures. • Is it worth using 2f+1 replicas to eliminate this case? • No. Fork can still happen as a result of other events (e.g. geo-replication).

  22. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  23. Evaluation • Is Salus robust against failures? • Do the additional mechanisms hurt scalability?

  24. Is Salus robust against failures? • Always ensure barrier semantics • Experiment: Crash block driver • Result: Requests after “holes” are discarded • Single computing node cannot corrupt data • Experiment: Inject errors into computing node • Result: Storage nodes reject it and trigger replacement • Block drivers never accept corrupted data • Experiment: Inject errors into computing/storage node • Result: Block driver detects errors and retries other nodes

  25. Do the additional mechanisms hurt scalability?

  26. Conclusion • Pipelined commit • Write in parallel • Provide ordering guarantees • Active Storage • Eliminate single point of failures • Consume less network bandwidth • High robustness ≠ Low performance

  27. Salus: scalable end-to-end verification • Goal: read safely from one node • The client should be able to verify the reply. • If corrupted, the client retries another node. • Well-known solution: Merkle tree • Problem: scalability • Salus’ solution: • Single writer • Distribute the tree among servers

  28. Scalable end-to-end verification Client maintains the top tree. Server 1 Server 3 Server 2 Server 4 Client does not need to store anything persistently. It can rebuild the top tree from the servers.

  29. Scalable and robust storage More hardware More failures More complex software

  30. Start from “Scalable” Robust Scalable

  31. Pipelined commit - challenge Key challenge: Recovery • Order across transactions • Easy when failure-free. • Complicate recovery. • Is 2PC slow? • Locking? Single writer, no locking • Phase 2 network overhead? Small compare to Phase 1. • Phase 2 disk overhead? Eliminate that, push complexity to recovery

  32. Scalable architecture Computation node Computation node: • No persistent storage • Data forwarding, garbage collection, etc • Tablet server (Bigtable), Region server (HBase),Stream Manager (WAS), … Storage nodes

  33. Salus: Overview • Block store interface (Amazon EBS): • Volume: a fixed number of fixed-size blocks • Single block driver per volume: barrier semantic • Failure model: arbitrary but no identity forge • Key ideas: • Pipelined commit – write in parallel and in order • Active storage – replicate computation nodes • Scalable end-to-end checks – client verifies replies

  34. Complex recovery • Still need to guarantee barrier semantic. • What if servers fail concurrently? • What if different replicas diverge?

  35. Support concurrent users • No perfect solution now • Give up linearizability • Casual or Fork-Join-Casual (Depot) • Give up scalability • SUNDR

  36. How to handle storage node failures? • Computing nodes generate certificates when writing to storage nodes • If a storage node fails, computing nodes agree on the current states • Approach: look for longest and valid prefix • After that, re-replicate data • As long as one correct storage node is available, no data loss

  37. Why barrier semantics? • Goal: provide atomicity • Example (file system): to append to a file, we need to modify a data block and the corresponding pointer to that block • It’s bad if pointer is modified but data is not. • Solution: modify data first and then pointer. Pointer is attached with a barrier.

  38. Cost of robustness Fully BFT Use signature and N-way programming Cost Eliminate fork Use 2f+1 replication Salus Crash Robustness

More Related