1 / 54

OceanStore An Architecture for Global-scale Persistent Storage

OceanStore An Architecture for Global-scale Persistent Storage. By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao http://oceanstore.cs.berkeley.edu.

bartelt
Download Presentation

OceanStore An Architecture for Global-scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStoreAn Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao http://oceanstore.cs.berkeley.edu Presented by Yongbo Wang, Hailing Yu

  2. Ubiquitous Computing

  3. OceanStore Overview • A global-scale utility infrastructure • Internet-based, distributed storage system for information appliances such as computers, PDAs, cellular phones,… • It is designed to support 1010 users, each having 104 data files (Support over 1014 files)

  4. OceanStore Overview (cont) • Automatically recovers from server and network failures • Utilizes redundancy and client-side cryptographic techniques to protect data • Allows replicas of a data object to exist anywhere, at any time • Incorporates new resources • Adjusts to usage patterns

  5. OceanStore • Two Unique design goals: • Ability to be constructed from an untrusted infrastructure • Servers may crash • Information can be stolen • Support of nomadic data • Data can be cached anywhere, anytime (promiscuous caching) • Data is separated from its physical location

  6. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  7. Naming • Objects are identified by a globally unique identifier (GUID) • Different objects in OceanStore use different mechanism to generate their GUID

  8. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  9. Access Control • Reader Restriction • Encrypt the data that is not public • Distribute the encryption key to users having read permission • Writer Restriction • The owner of an object can decide an access control list (ACL) for the object • All writes are verified by well-behaved servers and clients based on the ACL.

  10. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  11. Data Location and Routing • Provides necessary service to route messages to their destinations and to locate objects in the system • Works on top of IP

  12. Data Location and Routing • Each object in the system is identified by a globally unique identifier ,GUID (a pseudo-random fixed length bit string) • An object GUID is a secure hash function over the object’s contents • OceanStore uses 160-bit SHA-1 hash for which the probability that two out of 1014 objects hash to the same value is approximately 1 in 1020.

  13. Data Location and Routing • In OceanStore system, entities that are accessed frequently are likely to reside close to where they are being used • Two-tiered approach: • First use a fast probabilistic algorithm • If necessary, use a slower but reliable hierarchical algorithm

  14. Probabilistic algorithm • Each server has a set of neighbors, chosen from servers closest to it in network latency • A server associates with each neighbor a probability of finding each object in the system through that neighbor • This association is maintained in constant space using an attenuated Bloom filter

  15. Bloom Filters • An efficient, lossy way of describing sets • A Bloom filter is a bit-vector of length w with a family of hash functions • Each hash function maps the elements of the represented set to an integer in [0,w) • To form a representation of a set, each set element is hashed and the bits in the vector corresponding to has functions’ results are set

  16. Bloom Filters • To check if an element is in the set • Element is hashed • Corresponding bits in the filter are checked - If any of the bits are not set, it is not in the set - If all bits are set, it may be in the set • The element may not be in the set even if all of the hashed bits are set (false positive) • False positive rate of a Bloom filter is a linear function of its width, number of hash functions and cardinality of the represented set

  17. A Bloom Filter: To check an object’s name against a Bloom filter summary, the name is hashed with n different hash functions (here, n=3) and bits corresponding to the result are checked

  18. Attenuated Bloom Filters • An attenuated Bloom filter of depth d is an array of d normal bloom filters • For each neighbor link, an attenuated Bloom filter is kept • The k th bloom filter in the array is the merger of all Bloom filters for all of the nodes k hops away through any path starting with that neighbor link

  19. Attenuated Bloom Filter for the outgoing link AB In FAB,the document “Uncle John’s Band” would map to potential value 1/4+1/8=3/8.

  20. The Query Algorithm • The query node examines the 1st level of each of its neighbors’ filters • If matches are found, the query is forwarded to closest neighbor • If no filter matches, the querying node examines the next level of each filter at each step and forwards the query if a matching node founds

  21. The probabilistic query process: n1 is looking for object X, which is hashed to bits 0,1, and 3.

  22. Probabilistic location and routing • A filter of depth d stores information about servers d hops from the server • If a query reaches a server d hops away from its source due to a false positive, it is not forwarded further • In this case, the probabilistic algorithm gives up and forwards the query to deterministic algorithm

  23. Deterministic location and routing Tapestry: OceanStore’s self-organizing routing and object location subsystem • IP overlay network with a distributed, fault tolerant architecture • A query is routed from node to node until the location of a replica is discovered

  24. Tapestry • A hierarchical distributed data structure • Every server is assigned a random and unique node-ID • The node-ID ’s are then used to construct a mesh of neighbor links

  25. Tapestry • Every node is connected to other nodes via neighbor links of various levels • Level-1 edges connect to a set of nodes closest in network latency with different values in the lowest digit of their node-ID’s • Level-2 edges connect to the closest nodes that match in the lowest digit and different second digits, etc.

  26. Tapestry • Each node has a neighbor map with multiple levels • for example, the 9th entry of the 4th level for node 325AE is the node closest to 325AE which ends in 95AE • Messages are routed to the destination ID digit by digit • ***8=>**98=>*598=>4598

  27. Neighbor Map for Tapestry node 0642

  28. Tapestry routing example: A potential path for a message originating at node 0325 destined for node 4598

  29. Tapestry • Each object is associated with a location root through a deterministic mapping function • To advertise an object o, the server s storing the object sends a publish message toward the object’s root, leaving location pointers at each hop

  30. Tapestry routing example: To publish an object, the server storing the object sends a publish message toward the object’s root (e.g. node 4598), leaving location pointers at each node

  31. Locating an object • To locate an object, a client sends a message toward the object’s root. When the message encounters a pointer, it routes directly to the object • It is proved that Tapestry can route the request to the asymptotically optimal node (in terms of the shortest path network distance) containing a replica

  32. Tapestry routing example: To locate an object, node 0325 sends a message toward the object’s root (e.g. node 4598)

  33. Data Location and Routing • Fault tolerance: • Tapestry uses redundant neighbor pointers when it detects a primary route failure • Uses periodic UDP probes to check link conditions • Tapestry deterministically chooses multiple root nodes for each object

  34. Data Location and Routing • Automatic repair: • Node insertions: • A new node needs the address of at least one existing node • It then starts advertising its services and the roles it can assume to the system through the existing node • Exiting nodes: • If possible, the exiting node runs a shutdown script to inform the system • In any case, neighbors will detect its absence and update routing tables accordingly

  35. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  36. Updates • Updates are made by clients and all updates are logged • OceanStore allows concurrent updates • Serializing updates: • Since the infrastructure is untrusted, using a master replica will not work • Instead, a group of server’s called inner ring is responsible for choosing final commit order

  37. Update commitment • Inner ring is a group of servers working on behalf of an object. • It consists of a small number of highly-connected servers. • Each object has an inner ring which can be located through Tapestry

  38. Inner ring • An object’s inner ring, • Generates new versions of an object from client updates • Generates encoded, archival fragments and distributes them • Provides mapping from active GUID to the GUID of most recent version of the object • Verifies a data object’s legitimate writers • Maintains an update history providing an undo mechanism

  39. Update commitment • Each inner ring makes its decisions through a Byzantine agreement protocol • Byzantine agreement lets a group of 3n+1 servers reach a agreement whenever no more than n of them are faulty

  40. Update commitment • Other nodes containing the data of that object are called secondary nodes • They do not participate in serialization protocol • They are organized into one or more multicast trees (dissemination trees)

  41. Path of an update: • After generating an update, a client sends it directly to the object’s inner ring • While inner ring performs a Byzantine agreement to commit the update, secondary nodes propagate the update among themselves • The result of update is multicast down the dissemination tree to all secondary nodes

  42. Cost of an update in bytes sent across the network, normalized to minimum cost needed to send the update to each of the replicas

  43. Update commitment • Fault tolerance: • Guarantees fault tolerance if less than one third of the servers in the inner ring is malicious • Secondary nodes do not participate in the Byzantine protocol, but receive consistency information

  44. Update commitment • Automatic repair: • Servers of the inner ring can be changed without affecting the rest of the system • Servers participating in the inner ring are altered continuously to maintain the Byzantine assumption

  45. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  46. Deep Archival Storage • Each object is treated as a series of m fragments and then transformed into n fragments, where n>m • That uses Reed-Solomon encoding. • Any m of the n coded fragments are sufficient to construct the original data • Rate of encoding: r=m/n • Storage overhead=1/r=n/m

  47. Underlying Technologies • Naming • Access control • Data Location and Routing • Data Update • Deep Archival Storage • Introspection

  48. Introspection • It is impossible to manually administer millions of servers and objects • OceanStore contains introspection tools • Event monitoring • Event analysis • Self-adaptation

  49. Introspection • Introspective modules on servers observe network traffic and measure local traffic. • They automatically create, replace, and remove replicas in response to object’s usage patterns

More Related