Erasure Code Replication Presenter: W.K Lin (The Chinese University of Hong Kong)
Why we need replication? • Storage devices can fail to function. • Use replication to increase data availability, e.g. RAID • The basic idea of replication: • Place more data in different places and increase the chance of finding a data. • P2P systems often provide replication.
Server-less VoD Architecture • No centralized video server to provide the video streaming. • Each client in the system store a partial video blocks. • Store the video blocks by erasure code. • Not necessary to stream from all peers for complete video playback. • The clients can stream the video from other clients.
Some Terminologies • Peers are the computers/ storage devices that store the data. • Peer availabilityμ is a measure to indicate the portion of time that the peer is up/ online. • File availabilityA is the probability to recover the file from the duplicated copies of data. • Storage overheadS is the ratio of storage required for replication to the storage required before replication
Whole File Replication • Whole file replication replicates the complete file. • If the storage overhead is S, then there are S copies of data in the system. • File availability Aw:
Whole File Replication • It is not storage effective: Adopted from : Replication Strategies for Highly Available Peer to Peer Networks, Ranjita Bhagwan et. al,
Erasure Code Replication • Instead of replicating the whole file, replicate a portion of the file. • Principle: • A file is divided into b blocks. • Use erasure code to add redundancy to these b blocks. We then have n blocks in total. • Make the n file blocks dependent to each other – each file block has partial information of other blocks. • Any b out of the n blocks are enough to recover the original file.
Erasure Code Replication • Storage overhead S = n/b; or n = S*b. • Since we need any b out of the S*b copies to recover the file, the file availability Aw is: • Notice that whole file replication is a special case of erasure code replication with b = 1.
Erasure Code Replication • Erasure code replication is more storage effective Adopted from : Replication Strategies for Highly Available Peer to Peer Networks, Ranjita Bhagwan et. al,
Effectiveness of Erasure Code Replication • The effectiveness of erasure code replication is determined by two factors: • combinatorial effect, i.e. SbCb >> SC1 • peer availability factorμb(1-μ)Sb-b • Erasure code replication depends on S, b, and μ.
How Erasure Code Replication Performs? • File availability A (Aw or Ab) by varying μ and S:
A Related Problem • Lee and Liew paper: “Parallel Communications for ATM Network Control and Management” points out a similar problem: • An information string is divided into b parts, then encoded into n parts. • Any b out of the n parts is enough to recover the original information. • Very similar to our problem! • They prove a necessary bound Sμ > 1 for reliable communication.
Erasure Code Bound (Sμ > 1) • The area above the curve define the region that erasure code replication is preferred for large b.
Erasure Code Replication Sensitivity Analysis • We need to use a large b in order to benefit from erasure code replication. • If the system is operating at a level Sμ ~ 1, a little fluctuation of system parameter will harm the system.
Erasure Code Replication Sensitivity Analysis • The system is targeted to operate at S = 3, μ = 0.35. • Sμ > 1 • 10% measurement error of μ.
Related Work I: • Markov chain model for a simple birth/ death model: Adopted from : Design and Analysis of a Fault-Tolerant Mechanism for a Server-Less Video-On-Demand System Lee and Yeung
Related Work I: • Mean time to failure of the model: • Result:
Related Work II: • Another Markov model: c: connected state, mean time to stay = λ u: disconnected state, mean time to stay = μ . d: dead state α : the probability of going to disconnected state d. Adopted from : Data Durability in Peer to Peer Storage Systems Gil Utard, Antoine Vernois
Related Work II: Storage overhead S=3
Conclusion • Traditionally, erasure code replication has been very successful, e.g. RAID • A strict bound Sμ > 1, has to be satisfied for replication to gain from erasure code replication. • Erasure code replication is sensitive to system measurement errors. • Partly explain why erasure code replication is not seen in P2P systems.
Future Directions • Most analysis are based on the assumption that all peers have the same availability level. • In real system, a peer might have different failure and recovery rates. • The replica distribution, discovery are opened for research: • How to place/ locate the replicas if the peers are having different availabilities? • If the system fail, how to recover the lost replicas from the system?
Appendix • Proof: Let X be a binomial random variable having mean μ’=Sbμ and variance σ2 =Sbμ(1-μ).
Appendix • Similarly,