410 likes | 413 Views
OceanStore: An Architecture for Global-Scale Persistent Storage. Introduction. Vision: ubiquitous computing devices Goal: transparency Where to store persistent information? How to protect against system failures? How to upgrade components without losing configuration info?
E N D
OceanStore: An Architecture for Global-Scale Persistent Storage
Introduction • Vision: ubiquitous computing devices • Goal: transparency • Where to store persistent information? • How to protect against system failures? • How to upgrade components without losing configuration info? • How to manage consistency?
Introduction • Requirements • Intermittent connectivity • Secure from theft and denial-of-service • Durable information • Automatic and reliable archival services • Information divorced from location • Geographically distributed servers • Caching close to clients • Information can migrate to wherever it is needed • Scale: 1010 users, each with 10,000 files
OceanStore: A True Data Utility • Utility model: consumers pay a monthly fee in exchange for access to persistent storage • Highly available data from anywhere • Automatic replication for disaster recovery • Strong security • Providers would buy and sell capacity among themselves for mobile users • Deep archival storage: use excess of storage space to ease data management
Two Unique Goals • Use untrusted infrastructure • May crash without warning • Encrypted information in the infrastructure • Responsible party is financially responsible for the integrity of data • Support nomadic data • Data can be cached anywhere, anytime • Continuous introspective monitoring to manage caching and locality
System Overview • The fundamental unit in OceanStore: a persistent object • Named by a globally unique identifier (GUID) • Replicated and stored on multiple servers • Independent of the server (floating replicas) • Two mechanisms to locate a replica • Probabilistically probe neighboring machines • Slower deterministic algorithm
OceanStore Updates • Each update (or groups of updates) to an object creates a new version • Consistency is based on versioning • No need for backup • Pointers are permanent
OceanStore Objects • An active object is the latest version of its data • An archival object is a permanent, read-only version of the object • Encoded with an erasure code • Any m out of n fragments can reconstruct the original data • Can support either weak or strong consistency models
Applications • Groupware: calendar, email, contact lists, distributed design tools • Allow concurrent updates • Provide ways to merge information and detect conflicts
Applications • Digital libraries • Require massive quantities of storage • Replication for durability and availability • Deep archival storage to survive disaster • Seamless migration of data to where it is needed • Sensor data aggregation and dissemination
Naming • GUID: pseudo-random fixed-length bit string • Naming facility • Decentralized • Self-certifying path names • GUID = hash(user key, file name) • Multiple roots in OceanStore • GUID of a server is a secure hash of its key • GUID of a data fragment is a secure hash of the data content
Access Control • Reader restriction • Encrypt all data • Revocation • Delete all replicas • Encrypt all replicas with a new key • A server can use old keys to access cached old data
Access Control • Writer restriction • Writes are signed • Reads are restricted at clients • Writes are restricted at servers
Data Location and Routing • Objects can reside on any of the OceanStore servers • Use query routing to locate objects
Distributed Routing in OceanStore • Every object is identified by one or more GUIDs • Different replicas of the same object has the same GUID • OceanStore messages are labeled with • A destination GUID (built on top of IP) • A random number • A small predicate
Bloom Filters • Based on the idea of hill-climbing • If a query cannot be satisfied by a server, local information is used to route the query to a likely neighbor • Via a modified version of a Bloom filter
Bloom Filter • A Bloom filter • Represents a set S = {S1, … Sn} • Is depicted by a m bit array, filter[m] • Uses r independent hash functions • h1…hr • for i = 1…n • for j = 1…r • filter[hj[Si]] = 1
Insertion Example • m = 6, r = 3 • To insert word x • h1(x) = 0 • h2(x) = 3 • h3(x) = 5 • filter[] = {1, 0, 0, 1, 0, 1}
Insertion Example • m = 6, r = 3 • To insert word y • h1(y) = 1 • h2(y) = 3 • h3(y) = 5 • filter[] = {1, 1, 0, 1, 0, 1}
Testing Example • filter[] = {1, 1, 0, 1, 0, 1} • Does x belong to the set? • filter[h1(x)] = filter[0] = 1 • filter[h2(x)] = filter[3] = 1 • filter[h3(x)] = filter[5] = 1 • Does z belong to the set? • filter[h1(z)] = filter[2] = 0 no • filter[h2(z)] = filter[3] = 1 • filter[h3(z)] = filter[5] = 1
False Positives • If filter[i] = 0, it’s not in S • If filter[i] = 1, it’s probably in S • False positive rate depends on • Number of hash functions • Array size • Number of unique elements in S
Attenuated Bloom Filters • An attenuated Bloom filter of depth D is an array of D normal Bloom filters • ith Bloom filter is the union of all the Bloom filters for all of the nodes at a distance i • One filter per network edge
Attenuated Bloom Filters • Lookup 11010
The Global Algorithm: Wide-Scale Distributed Data Location • Plaxton’s randomized hierarchical distributed data structure • Resolve one digit of the node id at a time
Achieving Locality • Each new replica only needs to traverse O(log(n)) hops to reach the root, where n is the number of the servers
Achieving Fault Tolerance • Avoid failures at roots • Each root GUID is hashed with a small number of different salt values • Make it difficult to target a single GUID for DoS attacks • If failures are detected, just jump to any node to reach the root • OceanStore continually monitors and repairs broken pointers
Advantages of Distributed Information • Redundant paths to roots • Scalable with a combination of probabilistic and global algorithms • Easy to locate and recover failed components • Plaxton links form a natural substrate for admission controls and multicasts
Achieving Maintenance-Free Operation • Recursive node insertion and removal • Replicated roots • Use beacons to detect faults • Time-to-live fields to update routes • Second-chance algorithm to avoid false diagnoses of failed components • Avoid the cost of recovering lost nodes • Automatic reconstruction of data for failed servers
Update Model • Conflict resolution update model • Challenge: • Untrusted infrastructure • Access only to ciphertext
Update Format and Semantics • An update: a list of predicates associated with actions • If any of the predicates evaluates to be true, the actions associated with the earliest true predicate are atomically applied • Everything is logged
Extending the Model to Work over Ciphertext • Supported predicates • Compare version (unencrypted metadata) • Compare size (unencrypted metadata) • Compare block • Compare a hash of the encrypted block • Search • Returns only yes/no • Cannot be initiated by the server • Replace/insert/delete/append block
Serializing Updates in an Untrusted Infrastructure • Use a small primary tier of replicas to serialize updates • Minimize communication • Meanwhile, a secondary tier of replicas optimistically propagate updates among themselves • Final ordering from primary tier is multicasted to secondary replicas
A Direct Path to Clients and Archival Storage • Updates flow directly from a client to the primary tier, where they are serialized and then multicast to the secondary servers • Updates are tightly coupled with archival • Archival fragments are generated at serialization time and distributed with updates
Efficiency of the Consistency Protocol • For updates > 4Kbytes, network overhead < 100% • Approximate latency per update < 1 second
Deep Archival Storage • Erasure encoded block fragments • Use small and widely distributed fragments to increase reliability • Administrative domains are ranked by their reliability and trustworthiness • Avoid locations with correlated failures
The OceanStore API • Session: a sequence of reads and writes to potentially different objects • Session guarantees: define the level of consistency • Updates • Callback: for user defined events (commit) • Façade: an interface to the conventional API • UNIX file system, transactional databases, WWW gateways
Introspection • Observation modules monitor the activity of a running system and track system behavior • Optimization modules adjust the computation computation observation optimization
Uses of Introspection • Cluster recognition • Identify related files • Replica management • Adjust replication factors • Migrate floating replicas
Related Work • Space/time trade-offs in hash coding with allowable errors. In Communications of the ACM, 13(7), pp. 422-426, July 1970 • The Bayou architecture: Support for data sharing among mobile users. In Proc. of IEEE Workshop on Mobile Computing Systems and Applications, Dec 1994
Related Work • A tutorial on reed-solomon coding for faulting tolerance in raid-like systems. Software Practice and Experience, 27(9), pp. 995-1012, September 1997 • Accessing nearby copies of replicated objects in a distributed environment. In Proc. of ACM SPAA, June 1997 • Search on encrypted data. IEEE SRSP, May 2000