480 likes | 580 Views
Explore the complexities of time in distributed systems, including external, internal, and logical notions of time. Topics covered include logical clocks, global clocks, clock synchronization, causality relationships, and vector timestamps.
 
                
                E N D
Logical Clocks Ken Birman
Time: A major issue in distributed systems • We tend to casually use temporal concepts • Example: “p suspects that q has failed” • Implies a notion of time: first q was believed correct, later q is suspected faulty • Challenge: relating local notion of time in a single process to a global notion of time • Discuss this issue before developing practical tools for dealing with other aspects, such as system state
Time in Distributed Systems • Three notions of time: • Time seen by external observer. A global clock of perfect accuracy • Time seen on clocks of individual processes. Each has its own clock, and clocks may drift out of sync. • Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.
External Time • The “gold standard” against which many protocols are defined • Not implementable: no system can avoid uncertain details that limit temporal precision! • Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures
Time seen on internal clocks • Most workstations have reasonable clocks • Clock synchronization is the big problem (will visit topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate • Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)
Logical notion of time • Has no clock in the sense of “real-time” • Focus is on definition of the “happens before” relationship: “a happens before b” if: • both occur at same place and a finished before b started, or • a is the send of message m, b is the delivery of m, or • a and b are linked by a chain of such events
Logical time as a time-space picture p0 p1 p2 p3 a a, b are concurrent c c happens after a, b b d d happens after a, b, c
Notation • Use “arrow” to represent happens-before relation • For previous slide: • a c, b  c, c  d • hence, a  d, b  d • a, b are concurrent • Also called the “potential causality” relation
Logical clocks • Proposed by Lamport to represent causal order • Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p • Algorithm ensures that if ab, then LT(a) < LT(b)
Algorithm • Each process maintains a counter, LT(p) • For each event other than message delivery: set LT(p) = LT(p)+1 • When sending message m, set LT(m) = LT(p) • When delivering message m to process q, set LT(q) = max(LT(m), LT(q))+1
Illustration of logical timestamps p0 p1 p2 p3 0 1 2 7 0 2 3 4 5 6 0 1 0 1 6
Concurrent events • If a, b are concurrent, LT(a) and LT(b) may have arbitrary values! • Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so! • Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’
Vector timestamps • Extend logical timestamps into a list of counters, one per process in the system • Again, each process keeps its own copy • Event e occurs at process p: p increments VT(p)[p] (p’th entry in its own vector clock) • q receives a message from p: q sets VT(q)=max(VT(q),VT(p)) (element-by-element)
Illustration of vector timestamps p0 p1 p2 p3 [1,0,0,0] [2,0,0,0] [2,1,1,0] [2,2,1,0] [0,0,1,0] [0,0,0,1]
Vector timestamps accurately represent happens-before relation • Define VT(e)<VT(e’) if, • for all i, VT(e)[i]<VT(e’)[i], and • for some j, VT(e)[j]<VT(e’)[j] • Example: if VT(e)=[2,1,1,0] and VT(e’)=[2,3,1,0] then VT(e)<VT(e’) • Notice that not all VT’s are “comparable” under this rule: consider [4,0,0,0] and [0,0,0,4]
Vector timestamps accurately represent happens-before relation • Now can show that VT(e)<VT(e’) if andonly if e  e’: • If e  e’, then there exists a chain e0 e1 ...  en on which vector timestamps increase “hop by hop” • If VT(e)<VT(e’) suffices to look at VT(e’)[proc(e)], where proc(e) is the place that e occured. By definition, we know that VT(e’)[proc(e)] is at least as large as VT(e)[proc(e)], and by construction, this implies a chain of events from e to e’
Examples of VT’s and happens-before • Example: suppose that VT(e)=[2,1,0,1] and VT(e’)=[2,3,0,1], so VT(e)<VT(e’) • How did e’ “learn” about the 3 and the 1? • Either these events occured at the same place as e’, or • Some chain of send/receive events carried the values! • If VT’s are not comparable, the corresponding events are concurrent
Notice that vector timestamps require a static notion of system membership • For vector to make sense, must agree on the number of entries • Later will see that vector timestamps are useful within groups of processes • Will also find ways to compress them and to deal with dynamic group membership changes
What about “real-time” clocks? • Accuracy of clock synchronization is ultimately limited by uncertainty in communication latencies • These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions) • Limits use of real-time clocks to “coarse-grained” applications
Interpretations of temporal terms • Understand now that “a happens before b” means that information can flow from a to b • Understand that “a is concurrent with b” means that there is no information flow between a and b • What about the notion of an “instant in time”, over a set of processes?
Neither clock is appropriate • Problem is that with both clocks, there can be many events that are concurrent with a given event • Leads to a philosophical question: • Event e has happened at process p • Which events are “really” simultaneous with p?
Perspectives on logical time • One view is based on intuition from physics • Imagine a time-space diagram • Cones of causality define past and future • “Now” is any cut across the system consistent including no future events and no past events • Next Tuesday will see algorithms based on this
Causal notions of past, future p0 p1 p2 p3 a d e f b g c
Causal notions of past, future FUTURE p0 p1 p2 p3 a d e PAST f b g c
Issues raised by time • Time is a tool • Typical uses of time? • To put events into some sort of order • Example: the order of updates on a replicated data item • With one item, logical time may make sense • With multiple items, consider VT with one element per item
Ways to extend time to a total order • Often extend a logical timestamp or vector timestamp with actual clock time when the event occurred and process id where it occurred • Combination breaks any possible ties • Or can use event “names”
An example • Suppose we are broadcasting messages • Atomic broadcast is • Fault-tolerant: unless every process with a copy fails, the message is delivered everywhere (often expressed as all or nothing delivery) • Ordered: if p, q both receive m, n, either both receive m before n, or both receive n before m • How should we implement this policy?
Easy case • In many systems there is really just one source of broadcasts • Typically we see this pattern when there is really one reference copy of a replicated object and the replicas are viewed as cached copies • Accordingly we can use a FIFO ordered broadcast and reduce the problem to fault-tolerance • FIFO ordering simply requires a counter from sender
A more complex example • Sender-ordered multicast • Sender places a timestamp in the broadcast • Receiver waits until it has full set of messages • Orders them by logical timestamp, breaks ties with sender-id • Then delivers in this order • How can it tell when it has the “full set”?
A more complex example m Deliver m,n or n,m? n
A more complex example • Solution implicitly depends upon membership • In fact, most distributed systems depend upon membership • Membership is “the most fundamental” idea in many systems for this reason • Receiver can simply wait until all members have sent one message • System ends up running in rounds, where each member contributes zero or one messages per round • Use a “null” message if you have nothing to send
Optimizations • We could agree in advance on “permission to send” • Now, perhaps only p, q have permission • We treat their messages in rounds but others must get permission before sending • Avoids all the null messages and ensures fairness if p, q send at same rate • Dolev: explored extensions for varied rates, gets quite elaborate…
Optimizations • In the limit, we end up with a token scheme • While holding the token, p has permission to send • If q requests the token p must release it (perhaps after a small delay) • Token carries the sequence number to use
A more complex example m:1 n:2
An example • Such solutions are expressed in many ways • With a ring: Chang and Maxemchuck; messages are like a “train” with new message tacked onto end and old ones delivered from front • Direct all-to-all broadcast • Like a token moving around the ring, but it carries the messages with it (inspired by FDDI) • Tree structured in various ways
More examples • Old Isis system uses logical clocks • Sender says “here is a message” • Receivers maintain logical clocks. Each proposes a delivery time • Sender gathers votes, picks maximum, says “commit delivery at time t” • Receivers deliver committed messages in timestamp order from front of a queue
More examples m m:[1,p] n:[2,p] n:[1,q] m:[2,q] m:[1,r] n:[2,r]
More examples m m:[1,p] n:[2,p] m:[2,q] n:[1,q] m:[2,q] n:[2,r] m:[1,r] n:[2,r]
More examples m m:[1,p] n:[2,p] m:[2,q] m! n! n:[1,q] m:[2,q] n:[2,r] m!n! m:[1,r] n:[2,r] m!n!
More examples • Later versions of Isis used vector times • Membership is handled separately • Each message is assigned a vector time • Delivered in vector time order, with ties broken using process id of the sender
Totem and Transis • These systems represent time using partial order information • Message m arrives and includes ordering fields: • Deliver m after n and o • By transitivity, if n is after p, them m is after p • Break ties using process id number
Totem and Transis m n o p
Things to notice • Time is just a programming tool • But membership and message atomicity are very fundamental • Waiting for m won’t work if m never arrives • And VT is only meaningful if we can agree on the meaning of the indicies • With failures, these algorithms get surprisingly complicated: suppose p fails while sending m?
Major uses of time • To order updates on replicated data • To define versions of objects • To deal with processes that come and go in dynamic networked applications • Processes that joined earlier often have more complete knowledge of system state • Process that leaves and rejoins often needs some form of incrementing “incarnation number” • To prove correctness of complex protocols