1 / 53

Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure

Workshop in Distributed Data & Structures * July 200 4. Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure. Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html. Rim Moussa Rim.Moussa@dauphine.fr

tivona
Download Presentation

Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop in Distributed Data & Structures *July 2004 Design & Implementation of LH*RS : a Highly- Available Distributed Data Structure Thomas J.E. Schwartz TjSchwarz@scu.edu http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html

  2. Objective Design Implementation Performance Measurements LH*RS Factors of Interest are : Parity Overhead Recovery Performances

  3. Overview Motivation Highly-available schemes LH*RS Architectural Design Hardware testbed File Creation High Availability Recovery Conclusion Future Work Scenario Description Performance Results

  4. Motivation • Information Volume of 30% / year • Bottleneck of disk access and CPUs • Failures are frequent & costly Source: Contingency Planning Research -1996

  5. Requirements Need Highly Available Networked Data Storage Systems Scalability High Throughput High Availability

  6. You Split ! Inserts Records Transfered I’m Overloaded ! Scalable & Distributed Data Structure Dynamic file growth Coordinator Client Client … Network … … Data Buckets (DBs)

  7. Image Adjustement Message Query Forwarded Query Network SDDS (Ctnd.) No Centralized Directory Access Client … … … Data Buckets (DBs)

  8. Solutions towards High Availability Data Replication (+) Good Response time since mirrors are queried (-)High storage cost (n if n repliquas) Parity Calculus • Erasure-resilient codes are evaluated regarding: • Coding Rate (parity volume / data volume) • Update Penality • Group Size used for Data Reconstruction • Complexity of Coding & Decoding

  9. Fault-Tolerant Schemes 1 server failure Simple XOR parity calculus : RAID Systems [Patterson et al., 88], The SDDS LH*g [Litwin et al., 96] More than 1 server failure Binary linear codes: [Hellerstein & al., 94] Array Codes: EVENODD [Blaum et al., 94 ], X-code [Xu et al.,99], RDP schema [Corbett et al., 04]  Tolerate just 2 failures Reed Solomon Codes: IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]  Tolerate large number of failures …

  10. A Highly Available & Distributed Data Structure: LH*RS [Litwin & Schwarz, 00] [Litwin, Moussa & Schwarz, sub.]

  11. Scalability High Throughput High Availability LH*RS SDDS Data Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]

  12. LH*RS File Structure Key r       2 1 0 Insert Rank r            2 1 0 Parity Buckets  : Rank [Key List] Parity Field Data Buckets  : Key Data Field

  13. Architectural Design of LH*RS

  14. Communication Use of UDP Individual Insert/ Update/ Delete/ Search Queries Record Recovery Service and Control Messages Speed Use of TCP/IP New PB Creation Large Update Transfer (DB split) Bucket Recovery Better Performance & Reliability than UDP

  15. TCP Connection Network Bucket Architecture Multicast Listening Port Send UDP Port TCP/IP Port Recv UDP Port TCP Listening Thread UDP Listening Thread Multicast Listening Thread Message Message Queue Process Buffer Message Queue -Message processing- Work. Thread n Work. Thread 1 … … -Message processing- Multicast Working Thread Free Zones Window  Messages waiting for ack Ack. Management Thread Sending Credit Not ack’ed messages …

  16. Architectural Design Enhancements to SDDS2000 [B00, D01] Bucket Architecture TCP/IP Connection Handler TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00] Before Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 s New Architecture: 2.6 s  Improv. 60% (Hardware config.: 733MhZ machines, 100Mbps network) Flow Control and Acknowledgement Mgmt. Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88][Diène, 01]

  17. Coordinator Multicast Component Blank PBs Multicast Group Blank DBs Multicast Group DBs, PBs A pre-defined & static IP @s Table Dynamic IP@ Structure Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe Architectural Design (Ctnd.) Before

  18. Hardware Testbed • 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) • Ethernet Network: max bandwidth of 1 Gbps • Operating System: Windows 2K Server • Tested configuration • 1 Client • A group of 4 Data Buckets • k Parity Buckets, k  {0, 1, 2}

  19. LH*RS File Creation

  20. File Creation Client Operation Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Data Bucket Split Splitting Data Bucket  PBs : (Records that Remain) N Deletes -from old rank & N Inserts -at new rank +(Records that move) N Deletes New Data Bucket  PBs:N Inserts (Moved Records) All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the Splitting DB & New DB.

  21. File Creation Perf. Experiments Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 20% k = 1 to k = 2  Perf. Degradation of 8%

  22. File Creation Perf. Experimental Set-up File of 25 000 data records; 1 data record = 104 B Client Sending Credit = 1 Client Sending Credit = 5 PB Overhead k = 0 to k = 1  Perf. Degradation of 37% k = 1 to k = 2  Perf. Degradation of 10%

  23. LH*RS Parity Bucket Creation

  24. PB Creation Scenario Searching for a new PB Coordinator Wanna join groupg ? <Multicast> [Sender IP@+Entity#, Your Entity#] PBs Connected to The Blank PBs Multicast Group

  25. I would I would I would PB Creation Scenario Waiting for Replies Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all PBs Connected to The Blank PBs Multicast Group

  26. PB Creation Scenario Cancellation PB Selection Cancellation Coordinator You are Selected<UDP> Disconnect from Blank PBs Multicast Group PBs Connected to The Blank PBs Multicast Group

  27. Send me your contents ! <UDP> … PB Creation Scenario Auto-creation -Query phase … New PB Data Bucket’s group

  28. Buffer Processing Requested Buffer <TCP> … PB Creation Scenario Auto-creation –Encoding phase … New PB Data Bucket’s group

  29. Encoding Rate MB/sec 0,608 0.686 0.640 0.659 Bucket Size: PT  74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison

  30. Encoding Rate MB/sec 0,618 0.713 0.674 0.673 Bucket Size; PT  74% TT PB Creation Perf. Experimental Set-up Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size File Size: 2.5 * Bucket Size records XOR Encoding RS Encoding Comparison

  31. PB Creation Perf. XOR Encoding RS Encoding Comparison For Bucket Size = 50000 XOR Encoding Rate : 0.66 MB/sec RS Encoding Rate : 0.673 MB/sec XOR provides a performance gain of 5% in Processing Time (0.02% in the Total Time)

  32. LH*RS Bucket Recovery

  33. Buckets’ Recovery Failure Detection Coordinator Are You Alive ? <UDP>   Parity Buckets Data Buckets

  34. Buckets’ Recovery Waiting for Replies… Coordinator I am Alive ? <UDP>   Parity Buckets Data Buckets

  35. Buckets’ Recovery Searching for 2 Spare DBs… Coordinator Wanna be a Spare DB ? <Multicast> [Sender IP@, Your Entity#] DBs Connected to The Blank DBs Multicast Group

  36. I would I would I would Buckets’ Recovery Waiting for Replies … Coordinator Start UDP Listening, Start TCP Listening, Start Working Threads Waiting for Confirmation, If Time-out elapsed Cancel all DBs Connected to The Blank DBs Multicast Group

  37. Buckets’ Recovery Disconnect from Blank PBs Multicast Group Spare DBs Selection Coordinator Cancellation You are Selected<UDP> Disconnect from Blank PBs Multicast Group DBs Connected to The Blank DBs Multicast Group

  38. Buckets’ Recovery Recovery Manager Determination Coordinator Recover Buckets [Spares IP@s+Entity#s;…] Parity Buckets

  39. Buckets’ Recovery Query Phase Alive Buckets participating to Recovery Recovery Manager Send me Records of rank in [r, r+slice-1] <UDP> … Parity Buckets Data Buckets Spare DBs

  40. Decoding Process Recovered Records<TCP> Buckets’ Recovery Reconstruction Phase Alive Buckets participating to Recovery Recovery Manager Requested Buffer <TCP> … Parity Buckets Data Buckets Spare DBs

  41. 0.72 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison

  42. 0.85 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison

  43. DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB XOR Decoding RS Decoding Comparison 1DB Recovery Time - XOR : 0.720 sec 1DB Recovery Time – RS : 0.855 sec XOR provides a performance gain of 15% in Total Time

  44. 1.2 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs

  45. 1.6 Slice (from 4% to 100% of Bucket contents)  TT doesn’t vary a lot DBs Recovery Perf. Experimental Set-up File: 125 000 recs; Bucket: 31250 recs  3.125 MB Recover 2 DBs Recover 3 DBs

  46. Perf. Summary of Bucket Recovery • 1 DB (3.125 MB) in 0.7 sec (XOR) • 4.46 MB/sec • 1 DB (3.125 MB) in 0.85 sec (RS)  3.65 MB/sec 2 DBs (6.250 MB) in 1.2 sec (RS)  5.21 MB/sec 3 DBs (9,375 MB) in 1.6 sec (RS)  5.86 MB/sec

  47. Conclusion The conducted experiements show that: Encoding/Decoding Optimization Enhanced Bucket Architecture  Impact on performance Good Recovery Performance Finally, we improved the processing time of the RS decoding process by 4% to 8% 1DB is recovered in half a second

  48. Conclusion LH*RS Mature Implementation Many Optimization Iterations Only SDDS with Scalable Availability

  49. Future Work Better Parity Update Propagation Strategy to PBs Investigation of faster Encoding/ Decoding processes

  50. References [Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. [ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html [McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html [Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329.[Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999. [Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004. [Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348. [White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf [Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

More Related