XORing Elephants: Novel Erasure Codes for Big Data MaheswaranSathiamoorthy, MegasthenisAsteris, DimitrisPapailiopoulos, Alexandros G Dimakis, RamkumarVadali, Scott Chen, DhrubaBorthakur USC, UT Austin, and Facebook The 39th international conference on Very Large Data Bases(VLDB), 2013
ASolutiontotheNetwork Challengesof DataRecoveryin Erasure-coded Distributed StorageSystems : AStudyon theFacebookWarehouse Cluster K.V.Rashmi,NiharShah,D.Gu, H.Kuang,D.Borthakur,K.Ramchandran UC Berkeley, Facebook 2013 IEEE ISIT(International Symposium on Information Theory) & The 5th USENIX Workshop on Hot Topics in File and Storage Technologies, HotStorage 2013
Outlines • Introduction • Locally Repairable Codes (LRC) • Performance evaluation results • Conclusion
Distributed storage systems • Numerous disk failures per day • Failures are the norm rather than the exception • Must introduce redundancy for reliability • Moving from replication to coding
Data distribution • Encode and distribute a data file to n storage nodes. Data File: “INC”
Data collector • Data collector can retrieve the whole file by downloading from any k storage nodes. “INC”
Three kinds of disk failures • Transient error due to noise corruption • repeat the disk access request • Disk sector error • partial failure • detected and masked by the operating system • Catastrophic error • total failure due to disk controller for instance • the whole disk is regarded as erased 7
Frequency of node failures Goal: Design new codes that have easier repair Number of failed nodes over a single month in a 3000 node production cluster of Facebook. 20 node failures/day * 15TB = 300TB if 8% RS coded, 588TB network traffic/day. (average total network: 2PB/day) ~30% of network traffic is repair in a normal day. 8
Distributed storage system • Encode a data file and distribute it to ndisks • (n,k) recovery property • The data file can be rebuilt from any kdisks. • Repair • If a node fails, we regenerate a new node by connecting and downloading data from any d surviving disks. • Aim atminimizing the repair bandwidth(Dimakis et al 2007). • A coding scheme with the above properties is called a regenerating code. Dimakis, Brighten, Wainwright and Ramchandran, Network coding for distributed storage systems, IEEE INFOCOM, 2007.
Repetition scheme • GFS: Replicate data 3 times • Gmail: Replicate data 21 times
2x Repetition scheme Divide the datafile into 2 parts 1G A A, B 1G Data Collector B 1G A 1G Cannot toleratedouble disk failures B
1G Repair is easy for repetition-based system New node A A B A Repair bandwidth =1G B
Reed-Solomon Code Divide the file into 2 parts A A, B Data Collector B A+B It can toleratedouble disk failures A+2B 13
Repair requires essentially decoding the whole file A A New node 1G B 1G A+B Repair bandwidth = 2G A+2B 14
Storing with an (n,k) MDS code • An (n,k) erasure code provides a way to: • Take k packets and generate n packets of the same size such that any k out of n suffice to reconstruct the original k • Optimal reliability for that given redundancy. Well-known and used frequently, e.g. Reed-Solomon codes, Array codes, LDPC and Turbo codes. • Each packet is stored at a different node, distributed in a network. Exampe of a n=5, k=4 code. Single parity: P1= 1+2+3+4
Current Hadoop architecture • 3x replication is HDFS current default. • Very large storage overhead. • As data grows faster than infrastructure, 3x is too expensive.
Facebook introduced Reed-Solomon (HDFS RAID) • ‘cold’ (i.e. rarely accessed) files are switched from 3-replication to (14,10) Reed Solomon. • HDFS RAID. Uses Reed-Solomon erasure code
Limitations of classical codes • Currently only 8% of Facebook’s data warehouse is RS encoded. (still significant saving) • Our goal: move to 40-50% of coded data in RS • Save petabytes • Bottleneck: code repair • Goal: design new codes that have easier repair
Repair metrics of interest • The number of bits communicated in the network during single node failures (Repair bandwidth) • Capacity known for two points only. My 3-year old conjecture forintermediate points was just disproved. [ISIT13] • The number of bits read from disks during single node repairs (Disk IO) • Capacity unknown. • Only known technique is bounding by Repair Bandwidth • The number of nodes accessed to repair a single node failure (Locality) • Capacity computed [ISIT12]. Very few explicit code constructionsknown. • Real systems started using these codes recently.[UsenixATC12, VLDB13]
Some experiments • 100 machines on Amazon EC2 • 50 machines running HDFS RAID (Facebookversion, (14,10) Reed Solomon code ) • 50 running our version LRC HDFS Regenerating code • 50 files uploaded on system, 640MB per file • Killing nodes and measuring network traffic, disk IO, CPU, etc during node repairs
What we observe? • LRC storage codes reduces bytes read by roughly 2.6x • Network bandwidth reduced by approximately 2x • We use 14% more storage. Similar CPU. • In several cases 30~40% faster repairs. Study on larger scale-on going. • Gains can be much more significant if larger codes are used (i.e. for archival storage systems).
Erasure Coding: MDS N=4 A N=3 K=2 A B File or Data Object A B A+B A+2B A+B (3,2) MDS Code (Single parity) Raid 5 Mode (4,2) MDS Code Raid 6 Mode
Erasure Coding VS. Replica (4,2) MDS Erasure Code Replication A A File or Data Object A B B A A+B B Low Redundancy Level Low Storage Cost B A+2B
Improvement of MDS • Regenerating Code • Repairing lost encoded fragments from existing encoded fragments. • A new class of erasure code. • Reduce repair bandwidth. • Increase number of surviving node connected.
Reed Solomon Codes Reed Solomon codes Repair traffic = M • Conventional repair: • Repair whole file and reconstruct data in new node File of size M A A Node 1 B B Node 2 B A+B Node 3 A A A+B A+2B Node 4 n = 4, k = 2
Regenerating Codes [Dimakis et al.’10] Regenerating codes Repair traffic = 0.75M • Repair in regenerating codes: • Downloads one chunk from each node (instead of whole file) • Repair traffic: save 25% for (n=4,k=2), while same storage size • Using network coding: encode chunks in storage nodes File of size M A A Node 1 B B C D C Node 2 D C A+C Node 3 A A B+D A+C B B A+D Node 4 A+B+C B+C+D n = 4, k = 2 Alexandros G. Dimakis, Brighten Godfrey, Yunnan Wu, Martin J. Wainwright, KannanRamchandran: Network coding for distributed storage systems.IEEE Transactions on Information Theory 56(9): 4539-4551, 2010
MDS and Regenerating Code • MDS code: • High complexity • Uses a random linear network coding • “Repair-by-transfer regenerating code” : • Less complexity • Its process is addition of two packets using bit-wise exclusive OR (XOR)
Regenerating codes • Theorem: It is possible to functionally repair an (n,k) MDS code by communicating • Further, there is a tradeoff between the storage per node and repair communication
Minimum Distance • Distance: of a code dis the minimum number of erasures after which data is lost. • Locality: number of blocks rwe have to access to reconstruct a single block. • Reed-Solomon (10,14) (n=14, k=10). d = n – k + 1=5 • Singleton (1964) showed a bound on the best distance possible: • Reed-Solomon codes achieve the Singleton bound (hence called MDS) • Easy lemma: any MDS code must have trivial locality r=k.
Definition 1 (Minimum Code Distance). The minimum distance d of a code of length n, is equal to the minimum number of erasures of coded blocks after which the file cannot be retrieved. • Definition 2(Block Locality). An (k, n-k) code has block locality r, when each coded block is a function of at most r other coded blocks of the code. • Lemma 1. MDS codes with parameters (k, n-k) cannot have locality smaller than k.
Theorem 1. There exist (k, n-k, r) Locally Repairable codes with logarithmic block locality r = log(k) and distance dLRC= n -(1 + δk) k + 1. Hence, any subset of k (1 + δk) coded blocks can be used to reconstruct the file, where δk = • Corollary 1. For fixed code rate R = k/n, the distance of LRCs is asymptotically equal to that of (k, n-k)-MDS codes LRCs are constructed on top of MDS codes (and the most common choice will be a Reed-Solomon code).
Locality-distance tradeoff • Theorem: Any (n,k) code with locality r can have distance at most: • r=k (trivial locality) gives Singleton Bound. • Any non-trivial locality will hurt the fault tolerance of the storage system • Pyramid codes (Huang et al) achieve this bound for message-locality • (Gopalan, Huang, Simitci, Yekhanin) had shown this tradeoff for scalar linear codes. We generalize to arbitrary codes. • No explicit codes known for all-symbol locality. [Papailiopoulos, D, ‘Locally Repairable Codes’, ISIT 2012]
General locality result • key idea: allow each block to be a little larger than M/k • For ε = 0, retrieves previous bound • For ε = 1/r , we can find explicit simple codes • Setting r=logn, we get near MDS codes with logarithmic locality and simple construction
Locally Repairable Codes (LRC) • A randomized an explicit family of codes that have logarithmic locality on all coded blocks and distance that is asymptotically equal to that of an MDS code. We call such codes (k, n - k, r) Locally Repairable Codes (LRC)
Locally Repairable Codes (LRC) Figure 2: Locally repairable code implemented in HDFS-Xorbas. The four parity blocks P1, P2, P3, P4 are constructed with a standard RS code and the local parities provide efficient repair in the case of single block failures. The main theoretical challenge is to choose the coeffcientsci to maximize the fault tolerance of the code.
Introduce “locality” to RS X1 X2 X3 X4 X5 A strip 5 File blocks C2 C3 C4 C1 C5 A Local Parity Block S1 X1C1+X2C2+X3C3+X4C4+X5C5 = S1 38
How locally repairable codes work If a single block failure! For example, if block X3 is lost! 39
How locally repairable codes work A strip (10 file blocks) RS Code (4 parity blocks) X1C1+X2C2+X3C3+X4C4+X5C5=S1 X6C6+X7C7+X8C8+X9C9+X10C10=S2
How to keep the parity blocks satisfy 1. We set C5’=C6’=1 2. Set S1+S2+S3=0 3. Set P1C1’+P2C2’+P3C3’+P4C4’=S3 If P2 failure: 41
Performance evaluation on Amazon EC2 and Facebook's cluster • HDFS Bytes Read : corresponds to the total amount of data read by the jobs initiated for repair. • Network traffic : represents the total amount of data sent from nodes in the cluster. • Repair Duration : is simply calculated as the time interval between the starting time of the first repair job and the ending time of the last repair job.
Evaluation on Amazon EC2 • Setting • The metrics measured during the 200 file experiment. • During the course of the experiment, we simulated 8 failure events. • A failure event consists of the termination of one or more DataNodes. • In our failure pattern, the first four failure events consisted of singleDataNodes terminations, the next two were terminations of triplets of DataNodes and finally two terminations of pairs of DataNodes. • In total, 3 experiments were performed on the above setup, successively increasing the number of files stored (50, 100, and 200 files). • Fig. 4 depicts the measurement from the last case(200 file).
Evaluation on Amazon EC2--HDFS Bytes Read Be encoding to one stripe with 14 and 16 size in HDFS-RS and HDFS-Xorbas.
Evaluation on Amazon EC2--Network Traffic Be encoding to one stripe with 14 and 16 size in HDFS-RS and HDFS-Xorbas.
Evaluation on Amazon EC2--Repair Duration Be encoding to one stripe with 14 and 16 size in HDFS-RS and HDFS-Xorbas.
Evaluation on Facebook's clusters-- Cluster network traffic • Since block repair depends only on blocks of the same stripe, using larger files that would yield more than one stripe would not affect our results. • An experiment involving arbitrary file sizes use Facebook’s clusters.