1 / 25

OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01. John Kubiatowicz University of California at Berkeley. Questions about ubiquitous information:. Where is persistent information stored?

lalo
Download Presentation

OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStoreStatus and DirectionsROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley

  2. Questions about ubiquitous information: • Where is persistent information stored? • Want: Geographic independence for availability, durability, and freedom to adapt to circumstances • How is it protected? • Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity • Can we make it indestructible? • Want: Redundancy with continuous repair and redistribution for long-term durability • Is it hard to manage? • Want: automatic optimization, diagnosis and repair

  3. Everyone’s Data, One Utility • Millions of servers, billions of clients …. • 1000-YEAR durability (excepting fall of society) • Maintains Privacy, Access Control, Authenticity • Incrementally Scalable (“Evolvable”) • Self Maintaining! • Not quite peer-to-peer: • Utilizing servers in infrastructure • Some computational nodes more equal than others

  4. Want Automatic Maintenance • Can’t possibly manage billions of servers by hand! • System should: • Be Fault-Tolerance (High MTTF) • Repair itself (Low MTTR through adaptation) • Incorporate new elements • Can we guarantee data is available for 1000 years? • New servers added from time to time • Old servers removed from time to time • Everything just works • Many components with geographic separation • System not disabled by natural disasters • Can adapt to changes in demand and regional outages • Gain in stability through statistics

  5. OceanStore Assumptions • Untrusted Infrastructure: • The OceanStore is comprised of untrusted components • Only ciphertext within the infrastructure • Responsible Party: • Some organization (i.e. service provider) guarantees that your data is consistent and durable • Not trusted with content of data, merely its integrity • Mostly Well-Connected: • Data producers and consumers are connected to a high-bandwidth network most of the time • Exploit multicast for quicker consistency when possible • Promiscuous Caching: • Data may be cached anywhere, anytime

  6. This Talk: making it real!(Or: you will hear reality from my students)

  7. Inner-Ring Servers Second-Tier Caches Clients Multicast trees The Path of an OceanStore Update

  8. Important Components: • Data Object: (Distribution-enabled data format) • Must support copy-on-write and versioning efficiently • Must allow sparse population of data in caches • Must smoothly interface with archive • Inner Ring: (Byzantine Agreement) • Check write access control • Choose seriallize updates/resolve micro-conflicts • Sign result with Threshold Signature • Erasure code result and send fragments • Second Tier Server: (Promiscuous Caches) • Serve local clients • Tie itself into Dissemination tree • Apply updates that it receives through tree • Decision point for caching policies: tentative vs committed

  9. Thread Scheduler X Introspection Modules D i s p a t c h Y 1 4 3 2 4 Consistency Location & Routing Archival Asynchronous Network Asynchronous Disk Java Virtual Machine Operating System Network Implementation Framework • Event-driven Implementation Model in Java • Divided into a sequence of communicating “stages” • Communication between stages in the form of “snoopable” messages • > 100,000 lines of Java, Comments, Test scripts • Substantially functioning!

  10. Foo Bar Baz Each link is an AGUID Out-of-Band “Root link” Myfile GUIDs for Naming • Unique, location independent identifiers: • Every version of every unique entity has a permanent, Version-GUID (or VGUID): Hash over content  Versioning supports time-travel • Each object has a permanent (version-independent) Archival-GUID (or AGUID): • Signed Associations between AGUIDs and latest VGUIDs are produced by inner ring (called Heartbeats) • Naming hierarchy: • Users map from names to AGUIDs via hierarchy of OceanStore objects

  11. Check Point == V6 Check Point == V11 Data B -Tree M M Indirect Blocks Indirect Blocks Blocks Blocks Log Object d'8 d'9 d1 d2 d3 d4 d5 d6 d7 d8 d9 V10 V9 V8 Unit of Coding Unit of Coding Set of Log Entries V7 Encoded Fragments: Unit of Archival Storage Encoded Fragments: Unit of Archival Storage Verification Tree Verification Tree GUID of d1 GUID of d'8 Data Object StructureAll about flexibility and validation

  12. Status:Data Object Development • Second-Tier Replica support: functional • Second-tier caches can hold multiple versions • Tie themselves into multicast trees • Several dissemination tree algorithms explored • Updates forwarded from inner ring through trees • Complete B-Tree object structure developed • Data blocks named with unforgeable hashes • Hashes can point to archival fragments/live blocks • Supports copy on write • Top block defines complete version • Missing blocks filled in from archive or other replicas • Update commits with distributed threshold signatures • Byzantine commitment not quite integrated into prototype • Traffic generator for testing

  13. Exploiting Law of Large Numbers for Durability

  14. Model Builder Introspection Human Input Set Creator Network Monitoring model probe fragments set type Disseminator set Disseminator fragments fragments The Dissemination Process

  15. Achieving Low MTTR:Global Heartbeats • Trigger repair when level of redundancy to low • Continuous sweep (slowly over time)

  16. Status:Archival Infrastructure • Archival Fragments generated by Inner Ring • Multi-stage-based implementation at inner ring • Storage servers hold fragments • Caching servers (2nd- tier replicas) hold data objects • Independence Analysis (mostly there) • Node discovery technique exists • Analysis of long-running reliability data • Dissemination-set creator: initial versions • Storage servers (Naïve but functional): • Initial implementation: cache + object store • Ongoing tuning efforts • Redesign in the works

  17. Location Independent Routing • Paradigm: Routing • Route messages to objects by GUID regardless of location • Fast, probabilistic search for “routing cache”: • Built from attenuated bloom filters • Approximation to gradient search • Redundant Plaxton Mesh used for underlying routing infrastructure: • Randomized data structure with locality properties • Redundant, insensitive to faults, and repairable • Amenable to continuous adaptation to adjust for: • Changing network behavior • Faulty servers • Denial of service attacks • Tomorrow: 3 talks on Routing

  18. Status: Location Independent Routing • Basic Tapestry infrastructure is operational • Single-path static routing: works • Multi-path adaptive routing: mostly there • Dynamic Integration of new nodes: implemented • Network adaptation almost there (Patchwork) • Framework for Measurement of network properties • Periodic beacons measure loss and network latency • Exploitation of Differences in nodes: • Brocade backbone supplement to Tapestry: Improves routing • Differentiation in service experiments ongoing • Theoretical Results on Tapestry • Construction/Analysis of Dynamic Integration Algorithms • Voluntary/involuntary node deletion algorithms • View of Tapestry as data structure for solving nearest neighbor • Attenuated Bloom Filters are operational • Implemented and functional • Optimizes short-distance routing infrastructure!

  19. Compute Adapt Monitor Introspection:The New Architectural Creed • Using Moore’s law gains for something other than performance • Examples: • Online algorithmic validation • Model building for data rearrangement • Availability • Better prefetching • Extreme Durability (1000-year time scale?) • Use of erasure coding and continuous repair • Stability through Statistics • Use of redundancy to gain more predictable behavior • Systems version of Thermodynamics! • Continuous Dynamic Optimization of other sorts

  20. Status: Introspection • Development of OIL framework for introspection: this framework is operational • Collection facilities can observe all events in the system • Multiple aggregation models available • Example 1: Clustering for prefetching • Currently builds Hidden Markov-model of access patterns utilizing OIL framework • Almost there: • Use models to better prefetch objects • Placement of replices assisted by bloom filters (almost) • Example 2: Observation of network behavior • Framework for observation of network latencies • Adaptation of network topology: almost there • Example 3: Grammer building for prefetching • Experiment of introspection at processor level • Talk later today about this (Mark Whitney)

  21. Status:Medium Scale Test and Emulation • Two medium clusters from IBM SUR Grant • Each cluster 21 servers: • Each with two 1 GHz processors • One GByte of RAM, 73 GB of Disk • 1 GB Switch per cluster • MIRNET switch • Plan to have continuous OceanStore components running – in approximately 1 month • Emulation technology: currently works • Able to simulate large-scale network by simulating network latencies • Multiple OceanStore nodes emulated/node

  22. Reality: Web Caching through OceanStore

  23. Day Dreams?(Becoming real) • NFS File system built in OceanStore (Exists) • Still have to integrate ACLs • Update to latest prototype • Windows Installable File system (Planning) • “USB Keys” hold cryptographic keys and personal identity • Automatic downloading and verification of filesystem • IMAP  OceanStore gateway (Planning) • Lotus Notes Domino Server • Exploring use of work flow on top of OceanStore

  24. OceanStore Conclusions • OceanStore: everyone’s data, one big utility • Global Utility model for persistent data storage • Very Soon: Working OceanStore cluster!!!! • Event-driven programming in Java • You will hear about components today and tomorrow • OceanStore assumptions: • Untrusted infrastructure with a responsible party • Mostly connected with conflict resolution • Continuous on-line optimization

  25. For more info: • OceanStore vision paper for ASPLOS 2000 “OceanStore: An Architecture for Global-Scale Persistent Storage” • OceanStore paper on Maintenance (IEEE IC): “Maintenance-Free Global Data Storage” • Both available on OceanStore web site: http://oceanstore.cs.berkeley.edu/

More Related