330 likes | 475 Views
OceanStore is a groundbreaking prototype for an internet-scale cooperative file system that emphasizes durability, accessibility, and privacy in an untrusted network environment. It employs a two-tiered storage architecture where powerful hosts manage primary data while user workstations contribute resources for secondary copies. Key features include versioned data objects allowing recovery and consistency, a decentralized routing system through Tapestry, and proactive threshold signatures to ensure data integrity. This innovative approach aims to provide long-term data management solutions addressing contemporary data challenges.
E N D
POND:THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley
Key Ideas • Versioning file system • Location independent routing • Uses hashes instead of addresses • Mapping is done through Tapestry • Byzantine update commitment • By nodes holding primary copies (inner ring) • Proactive threshold signatures allow inner ring membership updates
Key Ideas • Push-based update of other copies • Through an overlay multicast network • Copies are not permanent • Continuous archiving in erasure-coded form • Very reliable • Very slow access
Motivation • Find a better solution forlong-term management of data • Enabling trends: • Near universal connectivity through high-bandwidth links • Very fast increase of disk storage capacity per unit cost
OceanStore • Internet-scale cooperative file system • Will provide • High durability • Universal accessibility • Will use a two-tiered storage system • Stores data objects
Two-tiered organization • Upper tier • Powerful , well connected hosts • Serialize changes and archive results • Lower tier • Less powerful hosts • Can be user workstations • Provide storage resources
Two-tiered organization Archive Primary replica (in inner ring) Secondary replica Secondary replica Secondary replica
Basic requirements • OceanStore should • Let information be accessed fromany location • Balance the tension between privacy and information sharing • Offer an easily understandable and usable model of data consistency • Guarantee data integrity
First basic assumption • Infrastructure cannot be trusted , except in aggregate • Host and routers can fail arbitrarily • Must consider • Passive failures: host snooping, … • Active failures: host injecting malicious messages, …
Second basic assumption • Infrastructure is continuously changing • Performance of communication paths varies • Resources enter and leave the network without warning • System should • Be at leastself-organizing andself-repairing • Aim to be self-tuning
The challenge • Build a system that provides • An expressive user interface • High data availability • High data durability • High data privacy and integrity atop an untrusted and ever changing base More ambitious than FARSITE
The data model • OceanStore data object • Similar to a traditional file • Ordered sequence of read-only versions • Versioning • Simplifies consistency issues • Allows recovery of previous versions • Identical blocks are shared among versions
Data object implementation (I) • Each data object has an AGUID(Active Globally-Unique Identifier) • Secure hash of application-level name and private key of owner • Each version has a VGUID (Version GUID) • BGUID of root block of a version • Each block has a BGUID (Block GUID) • Secure hash of block contents
M M A data object AGUID VGUIDi VGUIDi+1 root block COW Indirect blocks COW Data blocks
Data object implementation • AGUID, VGUID and BGUID arelocation-transparent • OceanStore relies on a lower-level serviceto map GIDs into addresses
Application-level consistency (I) • Updating an object means creating a new version • Updates are • Atomic • Represented as an array of potential actions each guarded by a predicate
Application-level consistency (II) • Actions can be • Appending data • Replacing bytes at a specific address • Predicates can be • Checking the latest version number of the object • Verifying values of bytes at a specific address
Application-level consistency (II) • Actions can be • Appending data • Replacing bytes at a specific address • Predicates can be • Checking the latest version number of the object • Verifying values of bytes at a specific address
Application-level consistency (III) • Predicate and action model • Allows to implement multiple level of consistency • Atomic transactions satisfying ACID properties for database applications • Weaker consistency for mailboxes
A footnote • ACID properties of atomic transactions mean that atomic transactions • Are Atomic • Bring the database from one consistent state to another consistent state • Isolate their partial results until the transaction is completed • Guarantee the durability of final result
Virtualization through Tapestry • OceanStore messages are addressed with a GUID • Tapestry forwards these messages to host containing a resource with that GUID • Fully decentralized service • Hosts can • Join tapestry by supplying its GUID • Publish the GUIDs of the resources they have
Replication and consistency (I) • Each object has a single primary replica • Primary replica • Serializes and applies all updates • Creates a certificate (heartbeat ) mapping AGUID of object to GUID of its latest version • Controls access to the object • …
Replication and consistency (II) • Heartbeat contains • An AGUID • A VGUID • A timestamp • A version sequence number • Getting the most recent version of object means getting its most recent heartbeat
The inner ring • Small set of co-operating servers that manage primary replicas • Implement a Byzantine fault-tolerant protocol to • Agree on all updates to an object • Digitally sign the result
Archival storage • Stores object versions that are not frequently accessed • Uses erasure codes • Each block • Partitioned into m fragments • Encoded into n > m fragments • Any subset of m fragments suffices to reconstitute the block
Caching of data objects • Retrieving data from archive is slow • OceanStore also maintains of whole blocks • Secondary replicas • Heartbeats always come from theprimary replica • Updates of secondary replicas are done through a dissemination tree
Path of an OceanStore update Archive Primary replica in inner ring Application Secondary replica Secondary replica Secondary replica
Updating primary replicas (I) • Use a Byzantine fault-tolerant protocol • Tolerates up to f failures in a system made up of 3f + 1 hosts • Protocol uses digitally signed messages using symmetric key message authentication code • Faster than using public keys • Complicates the Byzantine agreement protocol
Updating primary replicas (II) • Solution was to use • Symmetric keys for all communications within the inner ring • Public keys to communicate with all other machines
Proactive threshold signatures • (listen to lecture)
Tapestry Network (Java NBIO) Prototype software architecture Disseminationtree/replicas Inner ring Clientinterface Byzantineagreement Application Archive
The prototype • Written in Java