1 / 65

Rim Moussa Rim.Moussa@dauphine.fr ceria.dauphine.fr/rim/rim.html

Paris Dauphine University *CERIA Lab. *04th October 200 4. Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS. Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html. Thesis Supervisor: Pr. Witold Litwin

kesia
Download Presentation

Rim Moussa Rim.Moussa@dauphine.fr ceria.dauphine.fr/rim/rim.html

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paris Dauphine University *CERIA Lab. *04th October 2004 Contribution to the Design & Implementationof the Highly Available Scalable and Distributed Data Structure: LH*RS Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html Thesis Supervisor: Pr. Witold Litwin Examinators: Pr. Thomas J.E. Schwarz Pr. Toré Risch Jury President: Pr. Gérard Lévy Thesis Presentation in Computer Science *Distributed Databases

  2. Outline Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations LH*RS File Creation Bucket Recovery Parity Bucket Creation Conclusion & Future Work R. Moussa, U. Paris Dauphine

  3. Facts … • Volume of Information of 30% /year • Technology • Network Infrastructure >> Gilder Law, bandwidth triples every year. • Evolution of PCs storage & computing capacities >> Moore Law, the latters double every 18 months. • Bottleneck of Disks Accesses & CPUs Need of Distributed Data Storage Systems SDDSs: LH*, RP* …  High Throughput R. Moussa, U. Paris Dauphine

  4. Network Facts … • Multicomputers >> Modular Architecture >> Good Price/ Performance Tradeoff • Frequent & Costly Failures >> Stat. Published by the Contingency Planning Research in 1996: the cost of service interruption/h case of brokerage application is $6,45 million. Need of Distributed & Highly-Available Data Storage Systems R. Moussa, U. Paris Dauphine

  5. State of the Art Data Replication (+)Good Response Time, Mirors are functional (-)High Storage Overhead (n if n repliquas) Parity Calculus • Criteria to evaluate Erasure-resilient Codes: • Encoding Rate (Parity Volume/ Data Volume) • Update Penality (Parity Volumes) • Group Size used for Data Reconstruction • Encoding & Decoding Complexity • Recovery Capabilitties R. Moussa, U. Paris Dauphine

  6. Parity Schemes 1-Available Schemes XOR Parity Calculus : RAID Technology (level 3, 4, 5…) [PGK88], SDDS LH*g [L96] … k-Available Schemes Binary Linear Codes: [H94]  Tolerate max. 3 failures Array Codes: EVENODD [B94 ], X-code [XB99], RDP [C+04]  Tolerate max. 2 failures Reed Solomon Codes : IDA [R89], RAID X [W91], FEC [B95], Tutorial [P97], LH*RS [LS00, ML02, MS04, LMS04]  Tolerate k failures (k > 3) … R. Moussa, U. Paris Dauphine

  7. Outline… Issue State of the Art LH*RS Scheme LH*RS? SDDSs? Reed Solomon Codes? Encoding/ Decoding Optimizations LH*RS Manager Experimentations R. Moussa, U. Paris Dauphine

  8. LH*RS ? LH*RS [LS00] Scalability & High Throughput LH*: Scalable & Distributed Data Structure Distribution using Linear Hashing (LH*LH [KLR96]) LH*LH Manager[B00] High Availability Parity Calculus using Reed-Solomon Codes [RS63] R. Moussa, U. Paris Dauphine

  9. Record Transfert You Split Insertions OVERLOADED SDDSs Principles (1) Dynamic File Growth Coordinator Client Client … Network … … Data Buckets R. Moussa, U. Paris Dauphine

  10. Client Image Adjustment Message Query Query Forward Network SDDSs Principles (2) (2) No Centralized Directory Access File Image Client … … … Cases de Données R. Moussa, U. Paris Dauphine

  11. Reed-Solomon Codes • Encoding From m Data Symbols  Calculus of n Parity Symbols • Data Representation  Galois Field • Fields with finite size: q • Closure Propoerty: Addition, Substraction, Multiplication, Division. • In GF(2w), (1) Addition (XOR) (2) Multiplication (Tables: gflog and antigflog) e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ] R. Moussa, U. Paris Dauphine

  12. Parity Matrix S1 S2 S3 : Si : Sm S1 : Sm P1 P2 : Pj : Pn-m S1 S2 S3 : Si : Sm C1,j C2,j C3,j : Cm,j  = P(m(n-m)) Im Pj (S1 C1,j)  (S2 C2,j)  …  (Sm Cm,j) (1) Systematic Encoding: Matrix (Im|P) (2) Any m columns are linearly independent m-1 XORs GF m Multiplications GF RS Encoding 1 0 0 0 0 … 0 C1,1… C1,j… C1,n-m 0 1 0 0 0… 0 C2,1… C2,j … C2,n-m 0 0 1 0 0… 0 C3,1… C3,j … C3,n-m … … … … … 0 0 0 0 0 … 1 Cm,1 … Cm,j … Cm,n-m R. Moussa, U. Paris Dauphine

  13. Optimized Decoding Multiply the ‘‘m OK symbols’’ By columns of H-1 corresponding to lost symbols Hm: mcorresponding columns Gauss Transformatiom  = [ S1 S2 S3 S4 ….. Sm ] H-1 m OK symbols RS Decoding S1 S2 S3 S4 : Sm P1 P2 P3 : Pn-m 1 0 0 0 0 … 0 C1,1 C1,2 C1,3… C1,n-m 0 1 0 0 0… 0 C2,1C2,2 C2,3… C2,n-m 0 0 1 0 0… 0 C3,1C3,2 C3,3… C3,n-m … … … … … 0 0 0 0 0 … 1 Cm,1Cm,2 Cm,3… Cm,n-m R. Moussa, U. Paris Dauphine

  14. Optimizations GF Multiplication Galois Field Parity Matrix GF(28)  1 symbol = 1 Byte GF(216)  1 symbol = 2 Bytes (+) GF(216) vs. GF(28) reduces the #Symbols by 1/2  #Operations in the GF. (-) Multiplication Tables Size GF(28): 0,768 Ko GF(216): 393,216 Ko (512 0,768) R. Moussa, U. Paris Dauphine

  15. Optimizations (2) GF Multiplication Parity Matrix Galois Field 0001 0001 0001 … 0001eb9b2284 … 0001 22849é74 … 00019e44 d7f1 … … … … … 1st Row of ‘1’s Any update from 1st DB is processed with XOR Calculus  Gain in Performance of 4% (case PB creation, m =4) 1st Column of ‘1’s Encoding of the 1st PB along XOR Calculus  Gain in encoding & decoding R. Moussa, U. Paris Dauphine

  16. Optimizations (3) GF Multiplication Parity Matrix Galois Field Goal: Reduce GF Multiplication Complexity e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ] Encoding Log Pre-calculus of the Coef. of P Matrix  Improvement of 3,5% Decoding Log Pre-calculus of coef. of H-1 matrix and OKsymbols vector  Improvement of 4% to 8% depending on the #buckets to recover 0000 0000 0000 … 00005ab5e267 … 0000 e2670dce … 0000 784d 2b66… … … … … R. Moussa, U. Paris Dauphine

  17. LH*RS -Parity Groups • Grouping Concept • m: #data buckets • k: #parity buckets Key r       Insert Rank r 2 1 0            2 1 0 : Rank; [Key-list ]; Parity Parity Buckets  : Key; Data Data Buckets A k-Acvailable Group survive to the failure of k buckets R. Moussa, U. Paris Dauphine

  18. Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Communication Gross Architecture 5. Experimentations 6.File Creation Bucket Recovery … R. Moussa, U. Paris Dauphine

  19. Communication UDP TCP/IP “Multicast” • Individual Operations (Insert, Update, Delete, Search) • Record Recovery • Control Messages Performance R. Moussa, U. Paris Dauphine

  20. Communication UDP TCP/IP “Multicast” Large Buffers Transfert • New Parity Buckets • Transfer Parity Update & Record (Bucket Split) • Bucket Recovery Performance & Reliability R. Moussa, U. Paris Dauphine

  21. Communication UDP TCP/IP “Multicast” Looking for New Data/Parity Buckets Communication Multipoints R. Moussa, U. Paris Dauphine

  22. Architecture Enhancements to SDDS2000 Architecture: (1) TCP/IP Connection Handler TCP/IP Connections are passive OPEN, RFC 793 –[ISI81], TCP/IP under Win2K Server OS [MB00] Before 1 Bucket Recovery (3,125 MB): SDDS 2000: 6,7 s SDDS2000-TCP: 2,6 s (Hardware Config.: CPU 733MhZ machines, network 100Mbps)  Improvement of 60% (2) Flow Control & Message Acknowledgement (FCMA) Principle of “Sending Credit & Message Conservation until delivery” [J88, GRS97, D01] R. Moussa, U. Paris Dauphine

  23. Architecture (2) (3) Dynamic IP Addressing Structure To tag new servers (data or parity) using Multicast: Multicast Group of Blank Parity Buckets Created Buckets Multicast Group of Blank Data Buckets Coordinator Before Pre-defined and Static IP@s Table R. Moussa, U. Paris Dauphine

  24. Architecture (3) Network TCPListening Thread Pool of Working Threads TCP/IP Port ACK Structure Messages Queue Free Zones UDP Listening Port UDP Listening Thread Messages waiting for ACK. UDP Sending Port Not acquitted Messages … Multicast Working Thread ACK Mgmt Threads Multicast listening Thread Message Queue Multicast Listening Port R. Moussa, U. Paris Dauphine

  25. Experimentation • Performance Evaluation * CPUTime * Communication Time • Experimental Environment * 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) * Ethernet Network 1 Gbps * O.S.: Win2K Server * Tested Configuration: 1 Client, A group of 4 Data Buckets, k Parity Buckets (k = 0,1,2,3). R. Moussa, U. Paris Dauphine

  26. Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations File Creation Parity Update Performance Bucket Recovery Parity Bucket Creation R. Moussa, U. Paris Dauphine

  27. File Creation • Client Operations Propagation of Data Record Inserts/ Updates/ Deletes to Parity Buckets. • Update: Send only –record. • Deletes: Management of Free Ranks within Data Buckets. • Data Bucket Split N1: #renaining records N2: #leaving records Parity Group of the Splitting Data Bucket N1+N2 Deletes + N1 Inserts Parity Group of the New Data Bucket N2 Inserts R. Moussa, U. Paris Dauphine

  28. Performances Config. Client Window = 1 Client Window = 5 Max Bucket Size = 10 000 records File of 25 000 records 1 record = 104 Bytes No difference GF(28) et GF(216) (we don’t wait for ACKs between DBs and PBs) R. Moussa, U. Paris Dauphine

  29. Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1 Perf. Degradation of 20% k = 1 ** k = 2 Perf. Degradation of 8% R. Moussa, U. Paris Dauphine

  30. Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1 Perf. Degradation of 37% k = 1 ** k = 2 Perf. Degradation of 10% R. Moussa, U. Paris Dauphine

  31. Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations File Creation Bucket Recovery Scenario Performances 8.Parity Bucket Creation R. Moussa, U. Paris Dauphine

  32. Scenario Failure Detection Coordinator Are you Alive?   Parity Buckets Data Buckets R. Moussa, U. Paris Dauphine

  33. Scenario (2) Waiting for Responses … Coordinator OK OK OK OK   Parity Buckets Data Buckets R. Moussa, U. Paris Dauphine

  34. Scenario (3) Searching Spare Buckets … Coordinator Wanna be Spare ? Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine

  35. Scenario (4) Waiting for Replies … I would Coordinator I would I would Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed cancel everything Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine

  36. Scenario (5) Spare Selection Cancellation Coordinator Confirmed You are Hired Confirmed Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine

  37. Scenario (6) Recovery Manager Selection Coordinator Recover Failed Buckets Parity Buckets R. Moussa, U. Paris Dauphine

  38. Scenario (7) Query Phase Recovery Manager Send me Records of rank in [r, r+slice-1] … Parity Buckets Data Buckets Buckets participating to Recovery Spare Buckets R. Moussa, U. Paris Dauphine

  39. Scenario (8) Reconstruction Phase Recovery Manager Requested Buffers … Parity Buckets Data Buckets Decoding Phase In // with Query Phase Buckets participating to Recovery Recovered Slices Spare Buckets R. Moussa, U. Paris Dauphine

  40. Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS • File Info File of 125 000 records Record Size = 100 bytes Bucket Size = 31250 records  3.125 MB Group of 4 Data Buckets (m = 4), k-Available with k = 1,2,3 • Decoding * GF(216) * RS+ Decoding (RS + log Pre-calculus of H-1 and OKSymboles Vector) • Recovery per Slice(adaptative to PCs storage & computing capacities) R. Moussa, U. Paris Dauphine

  41. 0,58 Slice (from 4% to 100% of a bucket content)  Total Time is almost constant Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS R. Moussa, U. Paris Dauphine

  42. 0,67 Slice (from 4% to 100% of a bucket content)  Total Time is almost constant Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS R. Moussa, U. Paris Dauphine

  43. Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS Time to Recover 1DB -XOR : 0,58 sec Time to Recover 1DB –RS : 0,67 sec XOR in GF(216) realizes a gain of 13% in Total Time (and 30% in CPU Time) R. Moussa, U. Paris Dauphine

  44. 0,9 Slice (from 4% to 100% of a bucket content)  Total Time is almost constant Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary R. Moussa, U. Paris Dauphine

  45. 1,23 Slice (from 4% to 100% of a bucket content)  Total Time is almost constant Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary R. Moussa, U. Paris Dauphine

  46. Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary Time to Recover f Buckets f Time to Recover 1 Bucket Factorized Query Phase  The + is Decoding Time & Time to send Recovered Buffers R. Moussa, U. Paris Dauphine

  47. Performances XOR vs. RS 2 DBs 3 DBs Summary GF(28) • XOR in GF(28) improves decoding perf. of 60% compared to RS in GF(28). • RS/RS+ decoding in GF(216) realize a gain of 50% compared to decoding in GF(28). R. Moussa, U. Paris Dauphine

  48. Outline… 1. Issue 2. State of the Art 3.LH*RS Scheme 4.LH*RS Manager 5. Experimentations 6.File Creation 7.Bucket Recovery 8.Parity Bucket Creation Scenario Performances R. Moussa, U. Paris Dauphine

  49. Scenario Searching for a new Parity Bucket Coordinator Wanna Join Groupg ? Multicast Group of Blank Parity Buckets R. Moussa, U. Paris Dauphine

  50. Scenario (2) Waiting for Replies … Coordinator I Would I Would I Would Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed cancel everything Multicast Group of Blank Parity Buckets R. Moussa, U. Paris Dauphine

More Related