1 / 19

DStore: An Easy-to-Manage Persistent State Store

DStore: An Easy-to-Manage Persistent State Store. Andy Huang and Armando Fox Stanford University. Outline. Project overview Consistency guarantees Failure detection Benchmarks Next steps and bigger picture. Background: Scalable CHTs. LAN. LAN. Frontends. App Servers. DBs.

gomer
Download Presentation

DStore: An Easy-to-Manage Persistent State Store

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando FoxStanford University

  2. Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture

  3. Background: Scalable CHTs LAN LAN Frontends App Servers DBs Cluster hash tables (CHTs) Single-key-lookup data • Yahoo! user profiles • Amazon catalog metadata Underlying storage layer • Inktomi:wordID  docID listdocID  document metadata • DDS/Ninja:atomic compare-and-swap

  4. DStore: An easy-to-manage CHT Failure detection • Fast detection is at odds with accurate detection Capacity planning • High scaling costs necessitate accurate load prediction C H A L L E N G E S Cheap recovery Predictably fast and predictably small impact on availability/performance • Lowers the cost of acting on false positive • Effective failure detection not contingent on accuracy • Our online repartitioning algorithm lowers scaling cost • Reactive scaling adjusts capacity to match current load B E N E F I T S Manage like stateless frontends

  5. Cheap recovery: Principles and costs Write: send to all, wait for majority Read: read from majority dlib dlib Quorums • No recovery code to freeze writes & copy missed updates Single-phase writes • No locking and transactional logging T E C H N I Q U E S • Higher replication factor: 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA) • Sacrifice some consistency: Well-defined guarantees that provide consistent ordering C O S T S Trade storage and consistency for cheap recovery

  6. Nothing new under the sun, but… Technique Prior work DStore CHT Scalable performance Ease of management Quorums Availability during network partitions and Byzantine faults Availability during failures and recovery Relaxed consistency Availability and performance while nodes are unavailable Result High availability and performance (end goal) Cheap recovery (but that’s just the start…)

  7. Cheap recovery simplifies state management Challenge Prior work DStore Failure detection Difficult to make fast and accurate Effective even if it is not highly accurate Online repartitioning Relatively new area [Aqueduct] Duration and impact is predictably small Capacity planning Predict future load Scale reactively based on current load Data reconstruction [RAID] [Future work] Result State management is costly (administration- and availability-wise) Manage state with techniques used for stateless frontends

  8. Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture

  9. Consistency guarantees A client issues a request Request forwarded to a random Dlib Dlib issues quorum r/w on bricks • Assumption: Clients share data, but otherwise act independently c dlib • Usage model: • Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual clients. • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): two cases

  10. Case 1: Another user U2 performs a read w1(k1,vnew) r1(k1) vold r2(k1) w2(k1,vnew) Delayed commit vnew (k1,vold) U2 r(k1) returns: vold – no user has read vnew vnew – no user will later read vold Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant U1 B1 B2 B3 U2

  11. Case 2: U1 performs a read w1(k1,vnew) r1(k1) w2(k1,vnew) vnew (k1,vold) U1 r(k1): write is immediately committed or aborted – all future readers see either vold or vnew A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read U1 B1 B2 B3 U2

  12. Consistency guarantees • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): U1 reads – w1 is immediately committed or aborted U2 reads – if vold is returned, no user has read vnewif vnew is returned, no user will later read vold

  13. Two-phase commit vs. single phase writes Property 2-phase commit Single-phase writes Consistency Sequential consistency Consistent ordering Recovery Read log to complete in progress transactions No special-case recovery Availability Locking may cause request to block during failures No locking Performance 2 synchronous log writes2 roundtrips 1 synchronous update1 roundtrip Other costs None Read-repair (spreads out the cost of 2-PC to make common case faster)Write-in-progress cookie (spreads out the responsibility of 2-PC)

  14. Recovery behavior Run at 100% capacity Typically, run at 60-70% max utilization Recovery Predictably fast and small impact

  15. Application-generic failure detection Tarzan algorithm Failure detection techniques Operating statistics (CPU load, requests processed, etc.) Anomalies Beacon listener > treshold Median absolute deviation reboot Simple detection techniques “work” because resolution mechanism is cheap

  16. Failure detection and repartitioning behavior Online repartitioning Fail-stutter Aggressive failure detection Low scaling cost Low cost of acting on false positives

  17. Bigger picture: What is “self-managing”? reboot Indicator Brick performance a sign of system health Monitoring tests for potential problems Treatment low-impact resolution mechanism

  18. Bigger picture: What is “self-managing”? Brick performance System load Disk failures

  19. Bigger picture: What is “self-managing”? repartition reboot reconstruction Brick performance System load Disk failures Simple detection mechanisms & policies Key: low-cost mechanisms Constant “recovery”

More Related