1 / 95

Ch12 (continued) Replicated Data Management

Ch12 (continued) Replicated Data Management. Outline. Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums Atomic Multicast

zeke
Download Presentation

Ch12 (continued) Replicated Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch12 (continued)Replicated Data Management

  2. Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm Update Propagation

  3. Data replication Data replication: Why? Make data available in spite of failure of some processors Enable Transactions (user-defined actions) to complete successfully even if some failures occur in the system i.e. Actions are resilient to failures of some nodes Problems to be solved: Consistency Management of replicas

  4. Data replication Intuitive representation of a replicated data system Transactions see logical data The underlying system maps each operation on the logical data to operations on multiple copies Transactions Logical operation d: a logical data Mapping of logical operation replica of d replica of d replica of d

  5. Data replication Correctness criteria of the underlying system To be correct, the mapping performed by the underlying system must ensure the one-copy serializability: One-copy serializability property: The concurrent execution of transactions on replicated data should be equivalent to some serial execution of the transaction on non-replicated data

  6. Data replication Quorum-based protocol Ensure that any pair of conflicting accesses to the same data item access overlapping sites Here we discuss the Read/Write quorums A data item d is stored at every processor p in P(d) Every processor p in P(d) has a vote weight vp(d) R(d) : read threshold W(d): write threshold Read quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ )  R(d)

  7. Data replication Quorum-based protocol Write quorum of d: a subset P’ of P(d) such that ( vp(d), pP’ )  W(d) The total number of votes for d: V(d) = ( vp(d), pP(d) ) Quorums must satisfy the following two conditions: condition 1: R(d) + W(d) > V(d) Intuitively, every read quorum of d intersect with every write quorum of d. Hence, read and write cannot be performed concurrently on d every read quorum can access the copy that reflects the latest update

  8. Data replication Quorum-based protocol condition 2: 2*W(d) > V(d) Intuitively, two write quorums of d intersect Hence, a write can be performed in at most one group How read and write operations work ?

  9. Data replication Quorum-based protocol Read operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero, When a transaction T wants to read d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ )  R(d) 3. Lock all copies in P’ 4. Read the replica dp, p in P’ with a highest version number 5. Unlock copies of d

  10. Data replication Quorum-based protocol Write operation: Each replica of d has a version number np(d) : version number of the replica at processor p Initially, np(d) is zero, When a transaction T wants to write d, the following steps are performed 1. Broadcast request for votes to P(d) (a remote processor q replies by sending a message with nq(d), vq(d)) 2. Collect replies and construct P’ until ( vp(d), pP’ ) W(d) 3. Lock all copies in P’ 4. Compute the new value d’ of d 5. Let max_n(d) be the highest version number read in step 2, For all p in P’ , write d’ to d with np(d’)=max_n(d)+1 6. Unlock copies of d

  11. Data replication Replicated Servers Clients see logical servers The underlying system maps each operation on the logical servers to operations on multiple copies Clients Logical requests Reply S:a logical server Mapping of logical requests replica of S replica of S replica of S

  12. Data replication Replicated Servers In the context of replicated data, one might consider that the system consists of servers and clients Servers : are processors having copy of the data item Clients : are processors requesting operation on the data item Some approaches for replicating servers: Active replication Primary site approach

  13. Replicated Servers Data replication A copy of S is at every processor p in P(S) Active replication: All the replicas are simultaneously active All replicas are equivalent When a client C requests a service from S, C contacts any one of the replicas Sp, for p in P(S) Sp acts as the coordinator for the transaction To be fault tolerant, the client must contact all the replicas (plus other restrictions e.g. same set of requests; same order of requests at all replicas ) In general, suitable for processor failures

  14. Data replication Replicated Servers Primary site approach: One replica is the primary copy: coordinator for all transactions All other replicas are backups (passive in general) When a client C requests a service from S, C contacts the primary copy If the primary fails, a new primary is elected If a network partitioning occurs, only the partition having the primary can be serviced

  15. Data replication Replicated Servers Primary site approach: Read operation: If the requested operation is a read, the primary performs the operation and sends the result to the requester Write operation: If the requested operation is a write, The primary server make sure that all the backup maintains the most recent up to date value of the data item The primary processor might periodically checkpoints the state of the data item on the backups to reduce the computation overhead at the backups

  16. Data replication Dynamic Quorum changes The quorum-based protocol we have seen is a static method A single processor failure can make a data item unavailable If the system breaks into small groups, it might be the case that no group will perform the write operation The dynamic quorum change algorithm avoids this (at certain limits) Idea: For a data item d, Quorums are defined on the set of alive replicas of d: Introduction of the notion of view, Each transaction executes in a single view Views are changed sequentially

  17. Data replication Dynamic Quorum changes d: data item P(d) : processors at which a copy of d is stored some processors in P(d) can fail View: we can regard a view of d as consisting of : alive processors of P(d): AR(d) a read quorum defined on AR(d) a write quorum defined on AR(d) a unique name v(d) (view names are assumed to be totally ordered)

  18. Data replication Dynamic Quorum changes For a transaction Ti, v(Ti) denotes the view in which Ti executes The idea behind view-based quorum is to ensure that If v(Ti) < v(Tj) then, Ti comes before Tj in an equivalent serial execution Problem: ensure serializability within view serializability between views New conditions are necessary for quorums to satisfy the above requirements

  19. Data replication Dynamic Quorum changes • New conditions for quorums • d: data item • v: a view of d • P(d,v) : alive processor that store d in view v, • |P(d,v)| = n(d,v) • R(d,v) : read threshold for d in view v, • W(d,v) : write threshold for d in view v, • Ar(d) : read accessibility threshold for d in all views • (availability, d can be read in a view v as long as there • are Ar(d) alive processors in view v ) • Aw(d): write accessibility threshold for d in all views • (availability)

  20. Dynamic Quorum changes Data replication New conditions for quorums (cont.) The threshold must satisfy the following conditions: DQC1. R(d,v) + W(d,v) > n(d,v) in a view, read write quorums intersect DQC2. 2*W(d,v) > n(d,v) /* in a view, write quorum intersect; nodes participating in an update form a majority of the view */ DQC3. Ar(d) + Aw(d) > |P(d)| /* read accessibility and write accessibility intersect in all views */ DQC4. Aw(d)  W(d,v)  n(d,v) /* ensure consistency of views (we’ll see later) */ DQC5. 1  R(d,v)  n(d,v) /* The minimum size of a read quorum is 1 */

  21. Dynamic Quorum changes Data replication Restrictions on read and write operations: A data item d can be read in view v only if n(d,v)  Ar(d) i.e. the number of alive replicas of d must be greater than or equal to read availability written in view v only if n(d,v)  Aw(d) i.e. the number of alive replicas of d must be greater than or equal to the write availability These restrictions are imposed to ensure consistent changes of quorums

  22. Dynamic Quorum changes Data replication How read and write operations work: Similar to the static quorum-based protocol except that: Only processors in P(d,v) are contacted for votes (hence, for constructing the quorum i.e. P’) The version number of each replica becomes : (view_number, in_view_sequence_number) If a processor p receives a request from a transaction Ti and v(Ti) is not the view p has for d, then p rejects the request

  23. Dynamic Quorum changes Data replication Installation of new view: We have claimed that views are changed sequentially How this is achieved? A processor p in P(d) can initiate an update of the view for d due to recovering , failure of a member of the view or because its version number for d is not current.

  24. Dynamic Quorum changes Data replication Installation of new view (cont.): The idea: Assume that processor p is the one that wants to change the view 1. p determines if the view (the set of nodes with which p can communicate) it belongs to satisfies the new conditions for quorums (n(d,v)  Ar(d) and n(d,v)  Aw(d), …) If this is not the case, p cannot change the view 2. p reads all copies of d in P(d,v) 3. p gets the new copy from a replica with the highest version number; 4. p increments the view number 5. p broadcasts the latest version to all members of P(d,v)

  25. Dynamic Quorum changes Data replication Installation of new view (cont.): Let v be the old view an v’ the view after the change We have that W(d,v)  Aw(d) n(d,v’)  Ar(d) Ar(d) + Aw(d) > |P(d)| which implies that W(d,v) + n(d,v’) > |P(d)| That is, read and write quorum in view v overlap when changing to v’ : “consistent” change of view.

  26. Dynamic Quorum changes Data replication View change handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Initial view 0: P(d,0) ={A,B,C,D,E}, W(d,0)=5, R(d,0)=1 Assume that the system partitions A B C D || E /* node E cannot communicate with others */ If an update request d arrives at any processor, if the view is not updated, the operation cannot be performed

  27. View changes handle network partitions: Assume a data item d is replicated at five processors A,B,C,D,E Ar(d)=2, Aw(d)=3 Let view 1 be: P(d,1) ={A,B,C,D}, W(d,1)=4, R(d,1)=1 In this view, partition {E} can still read d but cannot update d partition {A,B,C,D} can read and write d Assume that D fails, partitions: {E}, {A,B,C} d can be read by both partitions, to enable write operation, the view must be updated e.g. P(d,2) ={A,B,C}, W(d,2)=3, R(d,2)=1 Data replication

  28. View change illustrated: Data replication A B C D E r1(d) w1(d) w1(d) w1(d) w1(d) w1(d) r3(d) w3(d) w3(d) w3(d) w3(d) r2(d) r4(d) w4(d) w4(d) w4(d) w4(d) w4(d) View 0 W(d,0)= 5 R(d,0)= 1 View 1 W(d,1)= 4 R(d,1)= 1 View 1 W(d,2)= 5 R(d,2)= 1 The quorum based algorithm serialized T2 before T3 Is there any notion of majority behind view changes?

  29. Outline Serializability Quorum-based protocol Replicated Servers Dynamic Quorum Changes One-copy serializability View-based quorums Atomic Multicast Virtual synchrony Ordered Multicast Reliable and Causal Multicast Trans Algorithm Group Membership Transi Algorithm Update Propagation

  30. Atomic Multicast In many situations, one-to-many form of communication is useful e.g. maintaining replicated servers,etc. Two forms of one-to-many communication are possible: Broadcast and Multicast Broadcast: the sender sends to all the nodes in the system Multicast: the sender sends to a subset L of the nodes in the system we are interested in multicast and we assume the sender sends a message m

  31. Atomic Multicast A naïve algorithm for Multicast For each processor p in L send m to p Problem: The sender fails after sending it has only sent to some processors Some members of the list L receive m while others do not receive m This is not acceptable in fault tolerant systems Multicast must be Reliable: If one processor in L receives m, every alive processor in L must receive m

  32. Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm Idea: Regard a multicast as a transaction (“all-or-nothing” property) distinguish delivery of a message from the reception of that message APP APP m Reception of message m Delivery of message m

  33. Atomic Multicast The naïve algorithm for Multicast + 2PC technique The 2PC technique can improve reliability of the naïve algorithm Idea(cont.): Rule for delivery: deliver a message only when you know that the message will be delivered everywhere Algorithm for the sender: 1. send m to every processor in L 2. When you have received all acknowledgements, deliver m locally; tell every processor in L that it can deliver

  34. Atomic Multicast The naïve algorithm for Multicast + 2PC technique This technique might require an important amount of work due to recovering of a failed processor In addition, it is vulnerable vulnerability inherited from the vulnerability of the 2PC The main difficulty comes from the correctness criteria: How can a processor determines which nodes in L are up ? Virtual synchrony takes this into account

  35. Atomic Multicast Virtual synchrony Accounts for the fact that it is difficult to determine exactly which are the non-failed processors Processors are organized into groups that cooperate to perform a reliable multicast Each group corresponds to a multicast list: multicast in a group Group view : The current list of processors to receive a multicast message (+ some global properties) Consistency of group view: common view on the members

  36. Atomic Multicast Virtual synchrony An algorithm for performing reliable multicast is virtual synchrony If : 1. In any consistent state, there is a unique group view on which all members of the group agree 2. If a message m is multicast in group view v before view change c, then either: 2.1. No processor in v participating in c can ever receive m, or 2.2. All processors in v participating in c receive m before performing c

  37. Atomic Multicast Virtual synchrony illustrated c :{A,B,C} participate 2.1 A B C D A B C {} {} {} m c C: {A,B,C} participate 2.2 A B C D A B C {m} {m} {m} m

  38. Atomic Multicast Virtual synchrony View changes can be considered as checkpoints Delivery list in virtual synchrony Between two consecutive “checkpoints” v and v’, A set G of messages is multicast A sender of a message in G must be in v Hence, if p is removed from the view, the remaining processors can consider that p has failed There is a guarantee that from v’ , no message from p will be delivered in the future

  39. Atomic Multicast Ordered multicasts One might want a multicast that satisfies a specific order: e.g. Causal order, total order Causal order (for causal multicast): If processor p receives m1 and then multicast m2 then every processor that receives {m1,m2} should receive m1 before m2 Total order (for atomic multicast) If p receives m1 before m2 then every processor that receives {m1,m2} should receive m1 before m2 (i.e. the same order of reception everywhere)

  40. Why causal multicast ? Assume that the data item x is replicated and consider the following scenario: Atomic Multicast p q r m1: “set x to zero” m2: “increment x by 1” m1 m2 m2 m1 must be delivered before m2. Otherwise, inconsistency ! m1

  41. Why total order for multicast ? Assume that p sends m and after that, p crashes then by some mechanism, q and r are informed about the crash of p but q receives m before crash(p); r receives crash(p) before m Atomic Multicast p q r m crash(p) m Total order is necessary otherwise, q and r might take different decisions

  42. Why total order for multicast (cont.)? Assume that the data item x is a replicated queue and consider the following scenario: Atomic Multicast p q r m1: “insert a to x” m2: “delete a from x” m1 m2 m2 m1 and m2 must be delivered in the same order. Otherwise, inconsistency ! m1

  43. Atomic Multicast The Trans algorithm Executes between two views changes exploitation of the guarantee provided by virtual synchrony Hence, the algorithm works on one view Mechanisms: Combination of positive and negative acknowledgements for reliability Piggybacking acknowledgements on messages being multicast simplifies detection of missed messages minimizes the need for explicit acknowledgements

  44. Atomic Multicast By piggybacking positive acknowledgements and negative acknowledgements, when a processor p receives a multicast message, p learns : which message it doe not need to acknowledge which message it has missed and must request a retransmission

  45. Atomic Multicast The idea behind Trans is illustrated by the following scenario: Let L =[ P,Q,R] be a delivery list for multicast step 1. P multicasts m1 step 2. Q receives m1 and piggybacks a positive acknowledgement on the next message m2 that it multicasts ( we write m2:ack(m1) to mean m2 contains ack for m1) step3 . R receives m2 (i.e. m2:ack(m1)) Two cases are in order for R upon receipt of m2: case 1 If R had received m1, it realizes that it does not need to send an acknowledgement for it, as Q had acknowledged it case 2 If R had not received m1, then R learns (because of the ack(m1) attached to m2) that m1 is missing then R requests a retransmission of m1 by attaching a negative acknowledgement for m1 in the next message it multicast

  46. Atomic Multicast Trans: an invariant The protocol maintains the following invariant A processor p multicasts an acknowledgement of message m only if processor p has received m and all messages that causally precede m. m Causal order If you acknowledge m, you do not need to acknowledge all unacknowledged messages that precede m

  47. Atomic Multicast Trans: stable messages A message is said to be stable if it has reached all the processors in the group view This is detectable because each receiver of a message multicasts acknowledges Some assumptions: All messages are assumed to be uniquely identified (processor_id , message_seq_number) Each sender sequentially number its messages A virtual synchrony layer is assumed

  48. Atomic Multicast Trans: Variables used Each processor maintains: ack_list : the list of identifiers of messages for which that node has to send a positive acknowledgement nack_list : the list of identifiers of messages for which that node has to send a negative acknowledgement G : the causal DAG contains all messages that the processor has received but that are not yet stable (m, m’) is in G if message m acknowledges message m’

  49. Atomic Multicast Trans: retransmission Using information given by the local DAG, a processor can determine which messages it should have received For such a message, a negative acknowledgement is multicast to request a retransmission

  50. Atomic Multicast Trans: variables (cont.) m : message container (serves as id of message here) m.message : application-level message (to be delivered at the app.) m.nacks : list of negative acknowledgments m.acks : list of positive acknowledgments L : destinations list (maintained by an underlying algorithm)

More Related