Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure

Workshop in Distributed Data & Structures *July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html

Objective Design Implementation Performance Measurements LH*RS Factors of Interest are : Parity Overhead Recovery Performances

Overview Motivation Highly-available schemes LH*RS Architectural Design Hardware testbed File Creation High Availability Recovery Conclusion Future Work Scenario Description Performance Results

Motivation • Information Volume of 30% / year • Bottleneck of disk access and CPUs • Failures are frequent & costly Source: Contingency Planning Research -1996

Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability

You Split ! Inserts Records Transfered I’m Overloaded ! Scalable & Distributed Data Structure Dynamic file growth Coordinator Client Client … Network … … Data Buckets (DBs)

Image Adjustement Message Query Forwarded Query Network SDDS (Ctnd.) No Centralized Directory Access Client … … … Data Buckets (DBs)

Solutions towards High Availability Data Replication (+) Good Response time since mirrors are queried (-)High storage cost (n if n repliquas) Parity Calculus • Erasure-resilient codes are evaluated regarding: • Coding Rate (parity volume / data volume) • Update Penality • Group Size used for Data Reconstruction • Complexity of Coding & Decoding

Fault-Tolerant Schemes 1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]  Tolerate large number of failures …

A Highly Available & Distributed Data Structure: LH*RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

Scalability High Throughput High Availability LH*RS SDDS Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]

LH*RS File Structure Key r       2 1 0 Insert Rank r            2 1 0 Parity Buckets  : Rank [Key List] Parity Field Data Buckets  : Key Data Field

Architectural Design of LH*RS

Communication Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Better Performance & Reliability than UDP

TCP Connection Network Bucket Architecture Multicast Listening Port Send UDP Port TCP/IP Port Recv UDP Port TCP Listening Thread UDP Listening Thread Multicast Listening Thread Message Message Queue Process Buffer Message Queue -Message processing- Work. Thread n Work. Thread 1 … … -Message processing- Multicast Working Thread Free Zones Window  Messages waiting for ack Ack. Management Thread Sending Credit Not ack’ed messages …

Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Before Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88][Diène, 01]

Coordinator Multicast Component Blank PBs Multicast Group Blank DBs Multicast Group DBs, PBs A pre-defined & static IP @s Table Dynamic IP@ Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe Architectural Design (Ctnd.) Before

Hardware Testbed • 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) • Ethernet Network: max bandwidth of 1 Gbps • Operating System: Windows 2K Server • Tested configuration • 1 Client • A group of 4 Data Buckets • k Parity Buckets, k  {0, 1, 2}

LH*RS File Creation

File Creation Client Operation Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank +(Records that move) N Deletes New Data Bucket  PBs:N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB.

File Creation Perf. Experiments Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 20% k = 1 to k = 2  Perf. Degradation of 8%

File Creation Perf. Experimental Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 37% k = 1 to k = 2  Perf. Degradation of 10%

LH*RS Parity Bucket Creation

PB Creation Scenario Searching for a new PB Coordinator Wanna join groupg ? <Multicast> [Sender IP@+Entity#, Your Entity#] PBs Connected to The Blank PBs Multicast Group

I would I would I would PB Creation Scenario Waiting for Replies Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all PBs Connected to The Blank PBs Multicast Group

PB Creation Scenario Cancellation PB Selection Cancellation Coordinator You are Selected<UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group

Send me your contents ! <UDP> … PB Creation Scenario Auto-creation -Query phase … New PB Data Bucket’s group

Buffer Processing Requested Buffer <TCP> … PB Creation Scenario Auto-creation –Encoding phase … New PB Data Bucket’s group

Encoding Rate MB/sec 0,608 0.686 0.640 0.659 Bucket Size: PT  74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison

Encoding Rate MB/sec 0,618 0.713 0.674 0.673 Bucket Size; PT  74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison

PB Creation Perf. XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : 0.673 MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time)

LH*RS Bucket Recovery

Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP>   Parity Buckets Data Buckets

Buckets’ Recovery Waiting for Replies… Coordinator I am Alive ? <UDP>   Parity Buckets Data Buckets

Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender IP@, Your Entity#] DBs Connected to The Blank DBs Multicast Group

I would I would I would Buckets’ Recovery Waiting for Replies … Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all DBs Connected to The Blank DBs Multicast Group

Buckets’ Recovery Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator Cancellation You are Selected<UDP> Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group

Buckets’ Recovery Recovery Manager Determination Coordinator Recover Buckets [Spares IP@s+Entity#s;…] Parity Buckets

Buckets’ Recovery Query Phase Alive Buckets participating to Recovery Recovery Manager Send me Records of rank in [r, r+slice-1] <UDP> … Parity Buckets Data Buckets Spare DBs

Decoding Process Recovered Records<TCP> Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> … Parity Buckets Data Buckets Spare DBs

0.72 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison

0.85 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison

DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : 0.720 sec 1DB Recovery Time – RS : 0.855 sec XOR provides a performance gain of 15% in Total Time

1.2 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs

1.6 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs

Perf. Summary of Bucket Recovery • 1 DB (3.125 MB) in 0.7 sec (XOR) • 4.46 MB/sec • 1 DB (3.125 MB) in 0.85 sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture  Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

Conclusion LH*RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability

Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes

References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329.[Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348. [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure