Replica Control for Peer-to-Peer Storage Systems

Replica Control for Peer-to-Peer Storage Systems

P2P • Peer-to-peer (P2P) has emerged as an important paradigm model for sharing resources at the edges of the Internet. • The most widely exploited resource is storage, as typified in P2P music file sharing • Napster • Gnutella • Following the great success of P2P file sharing, a natural next step is to develop wide-area, P2P storage systems to aggregate the storage across the Internet.

Replica Control Protocol Replication to maintain multiple copies of some critical data to increase the availability Replica Control Protocol to guarantee a consistent view of the replicated data

Replica Control Methods • Optimistic • Proceed optimistically with computation on the available subgroup and join later with consistency • Approaches • Log, Version vector, etc. • Pessimistic • Restrict computations with worst-case assumptions • Approaches • Primary site, Voting, etc.

Write-ahead Log • Files are actually modified in place, but before any file block is changed, a record is written to a log telling which node is making the change, which file block is being changed, and what the old and new values are • Only after the log has been written successfully is the change made to the file • Write-ahead log can be used for undo (rollback) and redo

Version Vector • Version vector for file f • N element vector, where N is the number of nodes in which f is stores • The ith element represents the number of updates done by node I • A vector V dominated V’ if • Every element in V >= corresponding element in V’ • Conflicts if neither dominates

Version Vector • Consistency resolution • If V dominates V’, inconsistent; can be resolved by copying V to V’ • If V and V’ conflict, inconsistency is detected • Version vector can detect only update conflicts; cannot detect read-write conflicts

Primary Site Approach • Data replicated on at least k+1 nodes (for k-resilient) • One node acts as the primary site (PS) • Any read request is served by the PS • Any write request is copied to all other back-up sites • Any write request to back-up sites are forwarded to the PS

PS Failure Handling • If back-up fails, no interruption in service • If PS fails, there are two possibilities • If the network not segmented • Choose a back-up node as the primary • If checkpointing has been active, need to restart only from the previous checkpoint • If segmented • Only the partition with PS can progress • Other partitions stops updates on data • Necessary to distinguish between site failures and network partitions

Voting Approach • V votes are distributed to n replicas with • Vw+Vr > V • Vw+Vw > V • Obtain Vr or more votes to read • Obtain Vw or more votes to write • Quorum system is more general than voting

Quorum Systems • Trees • Grid-based (array-based) • Torus • Hierarchical • Multi-column and so on…

Quorum-Based Schemes (1/2) • n replicas with version numbers • Read operation • Read-lock and access a read quorum • Obtaining a largest-version-number replica • Write operation • Write-lock and access a write quorum • Updating all replicas with the new version number the largest+ 1

Quorum-Based Schemes (2/2) The set of replicas must behave as if there is only a single copy. This is the strictest consistency criterion. • One-copy equivalence is guaranteed If we restrict • Write-write and write-read lock exclusion • Intersection Property • A non-empty intersection in any pair of • A read quorum and a write quorum • Two write quorums

Witnesses Witness - small entity that maintains enough information to identify the replicas that contain the most recent version of the data - the information could also be a timestamp containing the time of the latest update - the information could also be a version number, which is an integer incremented each time the data are updated

Classification of P2P Storage Sys. • Unstructured • “Replication Strategies for Highly Available Peer-to-peer Storage” • “Replication Strategies in Unstructured Peer-to-peer Networks” • Structured • CFS • PAST • LAR • Ivy • Oasis • Om • Eliot • Sigma (for mutual exclusion primitive) Read only Read/Write (Mutable)

Ivy • Stores a set of logs with the aid of distributed hash tables. • Ivy keeps, for each participant, a log storing all its updates, and maintains data consistency optimistically by performing conflict resolutions among all logs. (Maintain data consistency in a best-effort manner) • The logs should be kept indefinitely and a participant must scan all the logs related to a file to look up the up-to-date file data. Thus, Ivy is only suitable for small groups of participants.

Ivy uses DHT • DHT provides • Simple API • Put(key, value) and get(key)  value • Availability (Replication) • Robustness (Integrity checking) (Ivy) Distributed application data get (key) put(key, data) (DHash) Distributed hash table lookup(key) node IP address (Chord) Lookup service

Solution: Log Based • Update: Each participant maintains a log of changes to the file system • Lookup: Each participant scans all logs

Eliot • Eliot relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata. • It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics. • It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates. • The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Oasis • Oasis is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership. • It spreads versioned metadata along with data replicas over the P2P network. • To complete an operation on a data object, a client must first find a metadata related to the object and figure out the total number of votes, required votes for read/write operations, replica list, and so on, to form a quorum accordingly. • One drawback of Oasis is that if a node happens to use a stale metadata, the data consistency may be violated.

Om • Om is based on the concepts of automatic replica regeneration and replica membership reconfiguration. • The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration. • Om allows replica regeneration from single replica. However, a write in Om is always first forwarded to the primary copy, which serializing all writes and uses a two-phase procedure to propagate the write to all secondary replicas. • The drawbacks of Om are (1) the primary replica may become a bottleneck (2) the overhead incurred by the two-phase procedure may be too high (3) the reconfiguration by witness model has the probability of violating consistency.

Om and Pastry PAST example Object key = 100 120 80 90 104 103 98 99 101 100

Om and Pastry PAST example Object key = 100 Replication 120 80 90 104 103 98 99 101 100

Om and Pastry PAST example Object key = 100 Replication Replica crash 120 80 90 104 103 98 99 101 100

Om and Pastry PAST example Object key = 100 Replication Replica crash Regeneration 120 80 90 104 103 98 99 101 100

Om: Normal Case Operation Read-one / write-all approach Writes serialized via primary 120 80 90 104 write 103 98 99 101 100 read primary

Om: System Architecture Overview

Witness Model • The witness model utilizes the following limited view divergence property • Intuitively, the property says that two replicas are unlikely to have a completely different view regarding the reachability of a set of randomly-placed witnesses.

Witness Model • To utilize the limited view divergence property, all replicas logically organize the witnesses into an m*t matrix • The number of rows, m, determines the probability of intersection • The number of columns, t, protects against the failure of individual witnesses, so that each row has at least one functioning witness with high probability

Witness Model

Sigma System Model

Sigma System Model • Replicas are always available, but their internal states may be randomly reset. failure-recovery model • The number of clients is unpredictable. Clients are not malicious and fail stop. • Clients and replicas communicate via messages, which could be replicated, lost, but never forged.

Sigma • The Sigma protocol intelligently collect states from all replicas to achieve mutual exclusion. • The basic idea of the Sigma protocol is as follows. A node u wishing to be the winner of the mutual exclusion sends a timestamped request for each of the totally n (n=3k+1) replicas and waits for replies. On receiving a request from u, a node v should put u’s request in a local queue by the timestamp order, takes the node as the winner whose request is in the front of the queue, and reply the winner ID to u.

Sigma • When the number of replies received by u exceeds m (m=2k+1), u acts according to the following conditions:(1) if more than m replies take v as the winner, then u is the winner. (2) if more than m replies take w (wu) as the winner, then w is the winner and u just keeps waiting.(3) if no node is regarded as the winner by more than m replies, then u sends YIELD message to cancel its request temporarily and then re-inserts its request again with random backoff. • In this manner, one node can eventually be elected as the winner even when communication delay variance is large. • A drawback of the Sigma protocol is that a node needs to send requests to all replicas and gets advantaged replies from a large portion (2/3) of nodes to be the winner of the mutual exclusion, which will incur large overhead. Moreover, the overhead will even be larger under an environment of high contention.

References • Ivy:A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peer-to-peer File System,” in Proc. of the Symposium on Operating Systems Design and Implementation (OSDI), 2002. • Eliot:C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File System on Peer-to-peer Storage,” in Proc. of 21st IEEE Symposium on Reliable Distributed Systems (WRP2PDS), 2002. • Oasis:M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data Management,” in Proc. of the 3rd ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 85–92, 2003. • OM:H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” in Proc. of First Symposium on Networked Systems Design and Implementation (NSDI '04), 2004. • Sigma: S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual exclusion protocol in dynamic peer-to- peer systems,” in Proc. of 3rd International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004.

Q&A

Replica Control for Peer-to-Peer Storage Systems