530 likes | 682 Views
Workshop in Distributed Data & Structures * July 200 4. Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure. Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html. Rim Moussa Rim.Moussa@dauphine.fr
E N D
Workshop in Distributed Data & Structures *July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html
Objective Design Implementation Performance Measurements LH*RS Factors of Interest are : Parity Overhead Recovery Performances
Overview Motivation Highly-available schemes LH*RS Architectural Design Hardware testbed File Creation High Availability Recovery Conclusion Future Work Scenario Description Performance Results
Motivation • Information Volume of 30% / year • Bottleneck of disk access and CPUs • Failures are frequent & costly Source: Contingency Planning Research -1996
Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability
You Split ! Inserts Records Transfered I’m Overloaded ! Scalable & Distributed Data Structure Dynamic file growth Coordinator Client Client … Network … … Data Buckets (DBs)
Image Adjustement Message Query Forwarded Query Network SDDS (Ctnd.) No Centralized Directory Access Client … … … Data Buckets (DBs)
Solutions towards High Availability Data Replication (+) Good Response time since mirrors are queried (-)High storage cost (n if n repliquas) Parity Calculus • Erasure-resilient codes are evaluated regarding: • Coding Rate (parity volume / data volume) • Update Penality • Group Size used for Data Reconstruction • Complexity of Coding & Decoding
Fault-Tolerant Schemes 1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04] Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00] Tolerate large number of failures …
A Highly Available & Distributed Data Structure: LH*RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]
Scalability High Throughput High Availability LH*RS SDDS Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]
LH*RS File Structure Key r 2 1 0 Insert Rank r 2 1 0 Parity Buckets : Rank [Key List] Parity Field Data Buckets : Key Data Field
Communication Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Better Performance & Reliability than UDP
TCP Connection Network Bucket Architecture Multicast Listening Port Send UDP Port TCP/IP Port Recv UDP Port TCP Listening Thread UDP Listening Thread Multicast Listening Thread Message Message Queue Process Buffer Message Queue -Message processing- Work. Thread n Work. Thread 1 … … -Message processing- Multicast Working Thread Free Zones Window Messages waiting for ack Ack. Management Thread Sending Credit Not ack’ed messages …
Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Before Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88][Diène, 01]
Coordinator Multicast Component Blank PBs Multicast Group Blank DBs Multicast Group DBs, PBs A pre-defined & static IP @s Table Dynamic IP@ Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe Architectural Design (Ctnd.) Before
Hardware Testbed • 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) • Ethernet Network: max bandwidth of 1 Gbps • Operating System: Windows 2K Server • Tested configuration • 1 Client • A group of 4 Data Buckets • k Parity Buckets, k {0, 1, 2}
LH*RS File Creation
File Creation Client Operation Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank +(Records that move) N Deletes New Data Bucket PBs:N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB.
File Creation Perf. Experiments Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1 Perf. Degradation of 20% k = 1 to k = 2 Perf. Degradation of 8%
File Creation Perf. Experimental Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1 Perf. Degradation of 37% k = 1 to k = 2 Perf. Degradation of 10%
LH*RS Parity Bucket Creation
PB Creation Scenario Searching for a new PB Coordinator Wanna join groupg ? <Multicast> [Sender IP@+Entity#, Your Entity#] PBs Connected to The Blank PBs Multicast Group
I would I would I would PB Creation Scenario Waiting for Replies Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all PBs Connected to The Blank PBs Multicast Group
PB Creation Scenario Cancellation PB Selection Cancellation Coordinator You are Selected<UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group
Send me your contents ! <UDP> … PB Creation Scenario Auto-creation -Query phase … New PB Data Bucket’s group
Buffer Processing Requested Buffer <TCP> … PB Creation Scenario Auto-creation –Encoding phase … New PB Data Bucket’s group
Encoding Rate MB/sec 0,608 0.686 0.640 0.659 Bucket Size: PT 74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison
Encoding Rate MB/sec 0,618 0.713 0.674 0.673 Bucket Size; PT 74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison
PB Creation Perf. XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : 0.673 MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time)
LH*RS Bucket Recovery
Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP> Parity Buckets Data Buckets
Buckets’ Recovery Waiting for Replies… Coordinator I am Alive ? <UDP> Parity Buckets Data Buckets
Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender IP@, Your Entity#] DBs Connected to The Blank DBs Multicast Group
I would I would I would Buckets’ Recovery Waiting for Replies … Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all DBs Connected to The Blank DBs Multicast Group
Buckets’ Recovery Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator Cancellation You are Selected<UDP> Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group
Buckets’ Recovery Recovery Manager Determination Coordinator Recover Buckets [Spares IP@s+Entity#s;…] Parity Buckets
Buckets’ Recovery Query Phase Alive Buckets participating to Recovery Recovery Manager Send me Records of rank in [r, r+slice-1] <UDP> … Parity Buckets Data Buckets Spare DBs
Decoding Process Recovered Records<TCP> Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> … Parity Buckets Data Buckets Spare DBs
0.72 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding RS Decoding Comparison
0.85 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding RS Decoding Comparison
DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : 0.720 sec 1DB Recovery Time – RS : 0.855 sec XOR provides a performance gain of 15% in Total Time
1.2 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB Recover 2 DBs Recover 3 DBs
1.6 Slice (from 4% to 100% of Bucket contents) TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs 3.125 MB Recover 2 DBs Recover 3 DBs
Perf. Summary of Bucket Recovery • 1 DB (3.125 MB) in 0.7 sec (XOR) • 4.46 MB/sec • 1 DB (3.125 MB) in 0.85 sec (RS) 3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS) 5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS) 5.86 MB/sec
Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second
Conclusion LH*RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability
Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes
References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329.[Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348. [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.