Pond and CFS

Pond and CFS CS599 Special Topics in OS and Distributed Storage Systems Professor Banu Ozden Jan 2004 Ho Chung

Table of Contents • Part 1 Pond: • Overview: OceanStore and Pond • Pond Architecture • Techniques: Erasure Codes, Push-based update, Byzantine Agreement, Proactive Threshold Signature • Experimental Results • Part 2 CFS: • Overview and Design Goals • Chord Layer, DHash Layer, FS Layer • Experimental Results

PART 1Pond: The OceanStore prototype

OceanStore: Overview • DesignGoal: Persistent Storage • Design criteria: • High durability • Universal availability • Privacy • Data Integrity • Assumptions on infrastructure: • Untrusted (e.g. Hosts & rtrs can fail arbitrarily) • Dynamic (So, the system must be self-organizing and self-repairing => self-tuning) • Support for nomadic data (how? promiscuous caching)

OceanStore as an Application New Distributed Applications (OceanStore) Tapestry (Routing messages & location of objects) Network (Java NBIO) Operating System

Archival Servers (for durability) HotOS Attendee Other Researchers (Primary Replicas) Secondary Replicas (soft state) Pond: Overview Inner ring • Serialize concurrent writes • Enforce access control • Check update

Pond: Data Model • Versioning • Each data object in Pond has version • Allows time travel • Each version of an object contains metadata, actual data, and pointers to previous version • Entire stream of versions of a given data object is named by AGUID • GUID • BGUID (block): secure hash of a block of data • VGUID (version): BGUID of the top block • AGUID (active): Hash(app-specified name||OwnerPK) • Mapping from AGUID to the latest version of VGUID may change over time

Pond - GUID AGUID VGUID i+1 VGUID i root block backpointer M M copy on write Indirect blocks copy on write d1 d2 d3 d4 d5 d6 d7 d’6 d’7 data blocks

Pond: Architecture (1) • Virtualization of Resources • Virtualizes: resources are not tied to hardware • DOLR (Decentralized Object Location & Routing) Interface • Tapestry virtualizes resources • Object is addressed with a GUID not IP address • (e.g.) PublishObject(Object_GUID, App_ID) • Locality aware: • No restriction on the placement of objects • Queries for nearby object with high probability

Pond: Architecture (2) • Replication and Consistency • Each object has a single primary replica • Heartbeat = Certificate (AGUID, VGUID, TimeStamp, Version) • Enforces access control • Serializes concurrent updates from multiple users • Inner ring • Use Byzantine-fault tolerant protocol to agree on updates to the data object, and digitally sign the object

f -1 Pond: Architecture (3) • High Durability for Archival Storage • Motivation: If we create 2 replicas of a data block, then we get fault tolerance of one failure for additional 100% storage. Can we improve this? Yes • Erasure Codes: more durable than replication for same space • After an update in primary replica, all newly created blocks are erasure-coded and fragments are stored A C E G f B D F H A E D H

Pond - Erasure Codes • A block is divided into m identically-sized fragments, which are then encoded into n fragments, where n > m • The original object can be reconstructed from anym fragments • Rate of encoding r = m/n < 1 • Intuitively, erasure encoding gives higher fault tolerance for the storage used than replication • Disadvantage? Expensive computation

Pond: Caching Data Objects • Frequently-read objects? • Use Whole-block caching (instead of Erasure) • However, if the whole-block cache does not exist in its local node, the node reads fragments, and do decoding to reconstruct the block. Then, cache the block. • LRU • To read the latest version of a document? • Utilize Tapestry to retrieve a heartbeat for the object from its primary replica

Pond: Push-based Update • Update? • Push-based update of secondary replicas of an object • Every time primary replica applies an update to create a new version, it sends the corresponding update and heartbeat down the secondary replicas

Pond: Byzantine Agreement 4 Byzantine generals, N >= 3f + 1 P1 (commander) • 1st Round: The commander sends a value to each of the lieutenants • 2nd Round: Each of the lieutenants sends the value it received to its peers • P3 is a faulty general • Need to send O(N^f+1) messages 1:v 1:v 1:v P3 2:1:v P2 3:1:u 4:1:v 4:1:v 3:1:w 2:1:v P4

Pond: Authentication • Authentication in Byzantine agreement: Use hybrid cryptography • MAC is used in all comm. in Inner Ring • PKC is to communicate with all other machines • Secondary replicas can verify the authenticity of data received from other replicas without contacting the inner ring • (e.g.) Most read traffic can be satisfied by the secondary replicas

Pond: Proactive Threshold Signature (1) • Goal: • To support flexibility in choosing the membership of the inner ring • To replace machines in the inner ring without changing public keys • PTS pairs a single public key with l private key shares. Each of the l servers uses its key share to generate a signature share, and any k correctly generated signature shares may be combined by any party to produce a full signature, where (l = 3f + 1, k = f + 1, f = # of faulty hosts)

Pond: Proactive Threshold Signature (2) Inner Ring Public Key: Private Key Shares: PK SK1 SK2 SK3 SK4 SS1 SS2 SS3 SS4 Signature Shares l = 4, k = 2, f = 1, (L = 3f + 1, k = f + 1) Public Key: Private Key Shares: PK SK’1 SK’2 SK’3 SK’4 SS’1 SS’2 SS’3 SS’4 Signature Shares New node NOTE: The Public key doesn’t change!

Pond: Prototype implementation • All major subsystems operational • Self-organizing Tapestry base • Primary replicas use Byzantine agreement • Secondary replicas self-organize into multicast tree • Erasure-coding archive • Staged Event-driven software architecture • Built on SEDA • 280K lines of Java (J2SE v1.3) • JNI libraries for cryptography, erasure coding

Pond: Deployment on Planetlab • http://www.planet-lab.org • ~100 hosts, ~40 sites • Hosts are spread across North America, Europe, Australia, and New Zealand • Pond: up to 1000 virtual nodes • Using custom Perl scripts • 5 minute startup • Gives global scale for free

Pond: Results (Latency) Table 1. Latency Breakdown of an Update The majority of the time is spent computing the threshold signature share in small updates. With larger updates, the time to apply and archive the update dominates signature time. Figure 1. Latency to Read Objects from the Archival The graph shows that the time to read an object increases with the # of blocks that must be retrieved

Pond: Results (Throughput) W R Table 2. Throughput in the Wide Area. The throughput for a distributed ring is limited by the wide-area bandwidth Table 3. Results of the Andrew Benchmark OceanStore outperforms NFS by a factor of 4.3 in read-intensive phases. But the write performance is worse by as much as a factor of 7.3.

Pond: Conclusion • Likes • Supports higher degree of consistency (by Byzantine agreement protocol) • Idea of using Proactive Threshold Signature • Don’t Likes: • Not suitable for write-sharing • Idea of Responsible Party (choose the hosts for inner rings) • Complex!: Data privacy, client updates and durable storage all come with increase in Complexity (e.g. Byzantine protocol, Plaxton tree, proactive threshold signature, Erasure encoding, etc.)

PART 2Wide-area cooperative storage with CFSCFS: Distributed read-only file storage

CFS: Chord-based Distributed File Storage System CFS: Client CFS Server CFS Server FS (Interprets blocks as files; presents a FS interface to apps) DHash (Storage layer: storage/retrieval, replication/caching of data blocks) DHash DHash Chord (Lookup layer: Maintains routing tables used to find blocks) Chord Chord

CFS – Design Goals • Efficiency and Scalability • (See Chord algorithmic performance in the next slide) • Availability • Chord allows client to always retrieve data (assumption: absence of network partition, etc) • Fault-tolerance - Replication & Caching • Block-level Storage: Store blocks, NOT whole files (cf. PAST) • Block-level Caching: Cache along the lookup path • Whole-file Caching: Only if, files are small • Load Balance • Virtual Servers: Spread blocks evenly over the available virtual servers (per physical server) • Per-publisher Quotas • To avoid malicious injection of large quantities of data (e.g. PAST) • Decentralization • cf. CDN (e.g. Akamai) is managed by a central entity

CFS: Chord Layer • Chord is a structured P2P • Chord maps key to linear key space • Given a key, it maps the key onto a node (e.g. lookup(key) = IP Address of node) • Key Idea: Keep pointers (fingers) into exponential places around space • Algorithmic performance • In n node network, each node maintains O(log N) route entries in its routing table • A lookup requires O(log N) messages

Chord: A simple Lookup protocol Lookup(K54) N1 N1 K54 N8 K54 N8 K10 N56 N56 N14 Using only “Successor” list N14 N51 N51 N48 N48 N21 K24 N21 K38 N42 N42 K30 N38 N32 N38 N32 Fig 2.1 Chord ring consisting of 10 nodes storing 5 keys Fig 2.2 Node 8 performs a lookup for key 54

Chord: A fast Lookup Protocol Fig 2.4 Fig 2.3 Finger Table Lookup(K54) K54 N1 N1 N8 N8 +1 +2 N56 N56 +4 N14 N14 +32 N51 N51 +8 N48 Using “finger table” to accelerate lookup +16 N48 N21 N21 N42 N42 i-th entry in the table at node n contains the ID of the first node that succeeds n by at least 2^(i-1) on the Chord ring N38 N38 N32 N32

CFS: Chord LayerServer Selection • Goal • Reduce lookup latency by preferentially contacting nodes likely to be nearby in the underlying networks • Cost Metric: Pick the minimum C(ni) • Notations: • H(ni) = an estimate of the # of Chord hops that would remain after contacting ni • di= Latency to node ni as reported by node m (m= previous hop) • d ‘ = Average latency of all the RPCs that node n has ever issued • Log N = an estimate of the # of significant high bits in an ID • ones() = the function counts how many bits are set in () • (ni – id >> (160 – log N)) = the significant bits in the ID-space distance between ni and the target key id

CFS: Chord LayerNode ID Authentication • When a new node wants to join, the existing node authenticates the node • Chord ID = SHA1 (Node’s IP address || Virtual node index) • Check whether the claimed IP address & virtual index hash to the Chord ID • Why do this? If Chord nodes could use arbitrary IDs, an attacker could destroy chosen data by choosing a node ID just after the data’s ID

CFS: Block vs. Whole-file • ADV. of Block granularity • Well-suited to serve large & popular files • Network BW consumed for lookup is small (CFS also hides the block lookup latency by pre-fetching blocks) • Less work to achieve load balance • Allow flexible choice of format to client applications and different data structures can coexist • ADV. of Whole-file • Efficient to serve large & unpopular files • Decreases the # of msg required to fetch a file • Lower lookup costs (one lookup per file rather than per block)

CFS: DHash Layer Replication and Caching • The block is stored at the successors of its ID (square). • The block is replicated at the successor’s immediate successors (circles) Circle: immediate successors of server Square: Server The placement of block replicas and cached copies Tick mark: Block’ ID Triangle: Servers along the lookup path • The block is cached at the servers along the lookup path (triangles)

CFS: DHash LayerLoad Balance • Motivation: Assume all CFS server had one ID. Then, every server has the same storage burden. Is this what we want? No • Every server has different network and storage capacity • Thus, Uniform distribution doesn’t produce perfect load balance (due to heterogeneity) • A Solution: Virtual Servers • ADV: Allows adaptive configuration of the server according to the server’s capacity • DISADV: Introduces a potential increase in # of hops in a Chord lookup • A Quick Remedy: Allow virtual servers in the same server to cross lookup each others’ tables

CFS: DHash LayerUpdate and Delete • CFS allows updates for only the publisher of the file • CFS doesn’t support an explicit delete • Publishers must periodically refresh their blocks to continue to store them

CFS: FS Layer (1) • The File System is read-only as far as clients are concerned • The File System may be updated by its publisher • Key Idea • WE WANT integrity and authenticity guarantees on public data and serve many clients • HOW? • Use SFSRO ( = SFS Read-Only File System) • “Self-certifying” & “read-only” FS • Filenames contain public keys • ADVANTAGE of Read-Only FS? • Distribution infrastructure is independent from the published content • Avoid any cryptographic operations on servers, and Keep the overhead on Clients

CFS: FS Layer (2) A simple CFS file system data-block public key root-block H(B1) directory-block inode-block H(D) Block, B1 H(F) DIR-info, D File-info, F data-block H(B2) <name, H (inode)> signature Block, B2 • Public key is the root block’s identifier • Data block and Inode is named by hashes of their contents • Update involves updating root block to point to new data • Includes Timestamp to prevent replay attack • Includes finite time-interval  Need periodic refresh for indefinite storage

Some thoughts on implementation… • Why Rabin public key cryptosystem in SFSRO? Why NOT RSA or DSA?  Fast signature verification • How fast would be considered as cheap for digital signature verification? Far smaller than a typical network RTT (e.g. 82μsec)

CFS: Implementation • CFS • 7K lines of C++ (including 3K lines for Chord) • Servers communicate over UDP with a C++ RPC package (provided by the SFS toolkit) • Why not TCP? Overhead of TCP connection setup • CFS runs on Linux, OpenBSD, and FreeBSD

CFS: Results (1) • Lookup cost is O(log (N)) • Pre-fetch increases the speed • Server selection increases speed

CFS: Results (2) With only 1 virtual server, some servers will not store any blocks, and other would store more than average • You can control Storage space! • Load Balance: With multiple virtual servers per a real server, the sum of the fraction of ID space that a server’s virtual servers are responsible for is more tightly clustered around the average

CFS: Conclusions • Pros (or Likes) • Simplicity • Aggressive load balancing (via virtual servers) • Algorithms guarantees data availability with high probability (e.g. tighter bounds on lookup cost) • Cons (or Don’t likes) • Read-only storage system • No anonymity • No (keyword) search feature

APPENDIX

Appendix – P2P Comparisons (1) • Tapestry (UCB), Chord (MIT), Pastry (Microsoft), CAN(AT&T): to provide functionality to route messages to an object • Disadv. of CAN, Chord: route on the shortest overlay hops available • Adv. of Tapestry & Pastry: construct locally optimal routing tables from initialization and maintain them in order to reduce routing stretch • Adv. of Pastry: constraints the routing distance per overlay hops to achieve efficiency in point-to-point routing between overlay nodes

Appendix - P2P Comparisons(2) • Adv. of Tapestry: locality-awareness • Number and location of object replicas are not fixed • Difference between Pastry and Tapestry is in object location. • While Tapestry helps the user or application locate the nearest copy of an object, • Pastry actively replicates the object and places replicas at random locations in the network. • The result is that when a client searches for a nearby object, Tapestry would route through a few hops to the object, while Pastry might require the client to route to a distant replica of the object.

Appendix - Bloom Filter (1) • Goal: to support membership queries • Given a set A={a1, a2, … an} of n elements, using hash function the BF computes whether the message query is a member of the set • Factors: reject-time, hash area size, allowable fraction of errors • Idea: Examine only part of the message to recognize as not matching a test message & reduce hash area size by allowing errors

Appendix - Bloom Filter (2) • Initially, m-bit Vector v, is set to 0 • Choose k independent hash functions, h1() … hn(), each with range {1 … m} (e.g. here k = 4) • For each element in aA, the bits at positions h1(a), h2(a), … hn(a) in v are set to 1 • Given a query for b, we check the bits at positions h1(b), h2(b), … hn(b). • If any of them is 0, then b is not in the set A A Bloom Filter with 4 hash functions

Pond and CFS

Pond and CFS

Presentation Transcript

Pond Lighting

CFS Future

Rinearson Pond

CFS PO PRINT

Pond Dynamics and Balance

Temperate Pond

CFS

Chord and CFS

CFS Implementation

Pond House

The Pond

POND LIFE

Pond

CFS

Pond

Pond Condition

Pond Pumps - Pond Pumps Reviews

Pond Bacteria Treatment and pond Care Kits

CFS Interiors and Flooring