Efficient Dependency Tracking for Relevant Events in Shared Memory Systems

Efficient Dependency Tracking for Relevant Events in Shared Memory Systems Anurag Agarwal (anurag@cs.utexas.edu) Vijay K. Garg (garg@ece.utexas.edu) PDS Lab University of Texas at Austin

Outline • Motivation • Background • Chain Clock • Instances of Chain Clock • Experimental Results • Conclusion

Motivation • Dependency between events required for global state information • Applications like monitoring and debugging • Vector clock [Fidge 88, Mattern 89] • O(N) operations for a system with N processes • Dynamic creation of processes

Relevant Events • Events “useful” for application • Predicate Detection • “There are no messages in the channel” p1 p2 p3 p4

Vector Clocks [Fidge 88, Mattern 89] • Assigns N-tuple (V) to every relevant event • e → f iff e.V < f.V (clock condition) • Process Pi : • V = (0, … , 0) • On an event e • If e is receive of message m: V = max (V, m.V) • If e is a relevant event: V[i] = V[i] + 1 • If e is a send of message m: m.V = V

p1 a b c d p2 p3 e f g h p4 Key Idea • Any chain in the computation poset can function as a process a b c e d h f g

Chain Clocks • A component in timestamp corresponds to a chain • Change “Rule II” in the vector clock algorithm • If e is a relevant event V[e.c] = V[e.c] + 1 • Theorem: Chain clocks guarantee the “clock condition” • Goal: Online decomposition of poset into as few chains as possible

Outline • Motivation • Background • Chain Clock • Instances of Chain Clock • DCC • ACC • VCC • Experimental Results • Conclusion

Dynamic Chain Clocks (DCC) • Shared vector Z maintains up-to-date values of all components • Each process starts with empty vector • Rule II • e.c = j such that Z[j] = e.V[j] • Give preference to component last updated by Pi • V[e.c] = V[e.c] + 1

DCC: Example • If e is receive of message m: V = max (V, m.V) • If e is a relevant event: e.c = i s.t. Z[i] = V[i] V[e.c] = V[e.c] + 1 Z[e.c] = Z[e.c] + 1 • If e is a send of message m: m.V = V p1 (1) (1,1) = max{(1),(0,1)} (2,1) (3,1) p2 (0,1) p3 (3,1) (3,2) V1 V2 V3 Z 1 3 2 0 3 3 2 1 1 1 1 2 1 2 1

Problem • Number of processes can be much larger than minimal number of chains p1 (1) p2 (1,2) (0,1) p3 (0,1,1) (1,2,2) p4 (0,1,1,1) (1,2,2,2)

Optimal Chain Decomposition • Antichain: Set of pairwise concurrent elements • Width: Maximum size of an antichain • Dilworth’s Theorem [1950] : A poset of width k can be partitioned into k chains and no fewer. • Requires knowledge of complete poset

Online Chain Decomposition • Elements of poset presented in a total order consistent with the poset • Assign elements to chains as they arrive • Can be modeled as a game between • Bob : Presents elements • Alice : Assigns them to chains • Felsner [1997] : For a poset of width k, Bob can force Alice to use k(k+1)/2 chains

Chain Partitioning Algorithm (ACC) • Felsner gave an algorithm which meets the k(k+1)/2 bound • Our algorithm is simpler and more efficient • B1 … Bk : |Bi| = i • For an element z: • Insert into the first queue q in Bi with head < z • Swap queues in Bi and Bi-1 leaving q in its place z B1 B2 B3

Drawback of DCC and ACC • Require a shared data structure • Monitoring applications generally need a central server • Hybrid clocks • Multiple servers, each responsible for a subset of processes • Finds chains within a process group

Shared Memory System • Accesses to shared variables induce dependencies • Observation: Access events for a shared variable form a chain • Variable-based Chain Clocks (VCC) • Associate a component with every variable

y = 2 x = 0 x = 2 x =1 y = 1 x = 1 VCC Application: Predicate Detection • Predicate : (x = 1) and (y = 1) • Only events changing x and y are relevant • Associate a component of VCC with x and other with y Initially: x=0, y = 0

Experiments • Setup • A multithreaded application • Each thread generates a sequence of events • Parameters: • Number of Processes • Number of Events • Probability of relevant event: a • Metrics • Number of components used • Execution time

Components Used Events = 100 a = 1%

Execution Time Events = 100 a = 1%

Effect of Relevancy Threads = 100 Events = 100

Conclusion • Generalized vector clocks to a class of algorithms called Chain Clocks • Dynamic Chain Clock (DCC) can provide tremendous speedup and reduce memory requirement for applications • Antichain-based Chain Clock (ACC) meets the lower bound for chain decomposition

Questions?

Example: Poset of width 2 • For a poset of width 2, Alice can force Bob to use 3 chains 3 1 1 2

Drawback of DCC and ACC • Require a shared data structure • Monitoring applications generally need a central server • Hybrid clocks • Multiple servers, each responsible for a subset of processes • Finds chains within a process group

Example: Poset of width 2 • For a poset of width 2, Alice can force Bob to use 3 chains 3 1 1 2

Chain Partitioning Algorithm (ACC) • Felsner gave an algorithm which meets the k(k+1)/2 bound • Our algorithm is simpler and more efficient • B1 … Bk : |Bi| = i • For an element z: • Insert into the first queue q in Bi with head < z • Swap queues in Bi and Bi-1 leaving q in its place z B1 B2 B3

Happened Before Relation (→)[Lamport 78] • Distributed computation with N processes • Every process executes a series of events • Internal, send or receive event p1 p2 • e → f if there is a path from e to f • e║f if there is no path between e and f

Future work • Lower bound for online chain decomposition when a decomposition into N chains is already known • Other chain decomposition strategies

Distributed System: Time vs Threads Events = 100 a = 1%

Distributed System: Events vs Time Threads = 100 a = 1%

Effect of Number of Events Threads = 100 a = 1%

DCC: Example • If e is receive of message m: V = max (V, m.V) • If e is a relevant event: e.c = i s.t. Z[i] = V[i] V[e.c] = V[e.c] + 1 Z[e.c] = Z[e.c] + 1 • If e is a send of message m: m.V = V p1 (1) (1,1) = max{(1),(0,1)} (2,1) (3,1) p2 (0,1) p3 (3,1) (3,2) V1 V2 V3 Z 1 3 2 0 3 3 2 1 1 1 1 2 1 2 1

Example for DCC – is it appropriate ? • Is the content a bit too much for this amount • Where can I reduce it ? • Remove VCC or ACC ? • Chain clock • Generalizes vector clocks • Reduces the time and memory overhead • Elegantly handles dynamic process creation

Efficient Dependency Tracking for Relevant Events in Shared Memory Systems