1 / 23

Data-Centric Reconfiguration with Network-Attached Disks

Data-Centric Reconfiguration with Network-Attached Disks. Alex Shraer ( Technion ). Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar ( Technion ). Preview. The setting: data-centric replicated storage Simple network-attached storage-nodes Our contributions:

Thomas
Download Presentation

Data-Centric Reconfiguration with Network-Attached Disks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)

  2. Preview • The setting: data-centric replicated storage • Simple network-attached storage-nodes • Our contributions: • First distributed reconfigurable R/W storage • Asynch. VS. consensus-based reconfiguration Allows to add/remove storage-nodes dynamically

  3. Enterprise Storage Systems • Highly reliable customized hardware • Controllers, I/O ports may become a bottleneck • Expensive • Usually not extensible • Different solutions for different scale • Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks)

  4. Alternative – Distributed Storage • Made up of many storage nodes • Unreliable, cheap hardware • Failures are the norm, not an exception • Challenges: • Achieving reliability and consistency • Supporting reconfigurations

  5. Distributed Storage Architecture write read • Unpredictable network delays (asynchrony) Storage Clients Cloud Storage Dynamic, Fault-prone LAN/ WAN Storage Nodes Fault-prone

  6. A Case for Data-Centric Replication • Client-side code runs replication logic • Communicates with multiple storage nodes • Simple storage nodes (servers) • Can be network-attached disks • Not necessarily PCs with disks • Do not run application-specific code • Less fault-prone components • Simply respond to client requests • High throughput • Do not communicate with each other • If storage-nodes communicate, their failure is likely to be correlated! • Oblivious to where other replicas of each object are stored • Scalable, same storage node can be used for many replication sets not-so-thinclient thin storage node

  7. RealSystemsAreDynamic reconfig {–C, +F,…, +I} reconfig{–A, –B} F LAN/ WAN A E G H D B C I The challenge: maintain consistency , reliability,availability

  8. Pitfall of Naïve Reconfiguration E {A, B, C, D, E} {A, B, C, D, E} {A, B, C, D} D reconfig {+E} {A, B, C, D} {A, B, C, D, E} C delayed delayed {A, B, C, D, E} {A, B, C, D} B delayed delayed {A, B, C} {A, B, C, D} A reconfig {-D} {A, B, C} {A, B, C, D} {A, B, C} {A, B, C, D}

  9. Pitfall of Naïve Reconfiguration E X = “Spain”, 2 X = “Italy”, 1 {A, B, C, D, E} {A, B, C, D, E} D write x “Spain” Split Brain! X = “Spain”, 2 X = “Italy”, 1 {A, B, C, D, E} C X = “Spain”, 2 X = “Italy”, 1 Returns “Italy”! {A, B, C, D, E} B X = “Italy”, 1 {A, B, C} A read x X = “Italy”, 1 {A, B, C} {A, B, C}

  10. Reconfiguration Option 1: Centralized • Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am • Virtually Yours, • Moshe Barak • Can be automatic • E.g., Ursa Minor [Abd-El-Malek et al., FAST 05] • Downtime • Most solutions stop R/W while reconfiguring • Single point of failure • What if manager crashes while changing the system?

  11. Reconfiguration Option 2: Distributed Agreement • Servers agree on next configuration • Previous solutions not data-centric • No downtime • In theory, might never terminate [FLP85] • In practice, we have partial synchrony so it usually works

  12. Reconfiguration Option 3: DynaStore[Aguilera, Keidar, Malkhi, S., PODC09] • Distributed & completely asynchronous • No downtime • Always terminates • Not data-centric

  13. In this work: DynaDiskdynamic data-centric R/W storage • First distributed data-centric solution • No downtime • Tunable reconfiguration method • Modular design, coordination is separate from data • Allows easily setting/comparing the coordination method • Consensus-based VS. asynchronous reconfiguration • Many shared objects • Running a protocol instance per object too costly • Transferring all state at once might be infeasible • Our solution: incremental state transfer • Built with an external (weak) location service • We formally state the requirements from such a service

  14. Location Service • Used in practice, ignored in theory • We formalize the weak external service as an oracle: • Not enough to solve reconfiguration • oracle.query( ) returns some “legal” configuration • If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration

  15. The Coordination Module in DynaDisk Storage devices in a configuration conf = {+A, +B, +C} A B C y x z x y y x z z next config: next config: next config:    Distributed R/W objects Updated similarly to ABD Distributed “weak snapshot” object API: update(set of changes)→OK scan() → set of updates

  16. Coordination with Consensus update : scan: read & write-back next config from majority • every scan returns +D or  A B C next config: next config: next config:    +D +D +D z x y x y x y z z +D +D +D Consensus –C +D reconfig({+D}) reconfig({–C})

  17. Weak Snapshot – Weaker than consensus • No need to agree on the next configuration, as long as each process has a set of possible next configurations, and all such sets intersect • Intersection allows to converge and again use a single config • Non-empty intersection property of weak snapshot: • Every two non-empty sets returned by scan( ) intersect • Example: Client 1’s scan Client 2’s scan {+D} {+D} {–C} {+D, –C} {+D} {–C} Consensus

  18. Coordination without consensus update : scan: read & write-back proposals from majority (twice) A B C next config: next config: next config:        +D  –C –C z z y y z y x 2 1 1 0 1 2 2 0 0 WRITE ({–C}, 0) OK OK CAS({–C}, , 0) CAS({–C}, , 1) +D  reconfig({+D}) reconfig({–C})

  19. Tracking Evolving Config’s • With consensus: agree on next configuration • Without consensus – usually a chain, sometimes a DAG: +D  C • A,B,C,D • A, B, C • A, B, D • Inconsistent updates found and merged scan() returns {+D} weak snapshot • A,B,C,D  C +D • A, B, C • A, B, D  C scan() returns {+D, -C} • A,B +D • All non-empty scans intersect

  20. Consensus-based VS. Asynch. Coordination • Two implementations of weak snapshots • Asynchronous • Partially synchronous (consensus-based) • Active Disk Paxos[Chockler, Malkhi, 2005] • Exponential backoff for leader-election • Unlike asynchronous coordination, consensus-based might not terminate [FLP85] • Storage overhead • Asynchronous: vector of updates • vector size ≤ min(#reconfigs, #members in config) • Consensus-based: 4 integers and the chosen update • Per storage device and configuration

  21. Strong progress guarantees are not for free Slightly better,much more predictable reconfig latency when many reconfig execute simultaneously Consensus-based Asynchronous (no consensus) Significant negative effect on R/W latency The same when no reconfigurations

  22. Future & Ongoing Work • Combine asynch. and partially-synch. coordination • Consider other weak snapshot implementations • E.g., using randomized consensus • Use weak snapshots to reconfigure other services • Not just for R/W

  23. Summary • DynaDisk – dynamic data-centric R/W storage • First decentralized solution • No downtime • Supports many objects, provides incremental reconfig • Uses one coordination object per config. (not per object) • Tunable reconfiguration method • We implemented asynchronous and consensus-based • Many other implementations of weak-snapshots possible • Asynchronous coordination in practice: • Works in more circumstances → more robust • But, at a cost – significantly affects ongoing R/W ops

More Related