Distributed Monitoring and Management

Distributed Monitoring and Management Presented by: Ahmed Khurshid Abdullah Al-Nayeem CS 525 Spring 2009 Advanced Distributed Systems

Large Distributed Systems Infrastructure PlanetLab: 971 nodes in 485 sites Application Hadoop at Yahoo!: 4000 nodes Google probably has more than 450,000 servers worldwide (Wikipedia) Not only nodes, data processed in commercial systems, e.g. Facebook is enormous (over 10 billion picture uploaded). Department of Computer Science, UIUC

Monitoring and Management • Monitoring and management of both infrastructures and applications • Corrective measures against failure, attacks, etc. • Ensuring better performance, e.g. load balancing • What resources are managed? • Distributed application processes, objects (log files, routing table, etc.) • System resources: CPU utilization, free disk space, bandwidth utilization Department of Computer Science, UIUC

Management and Monitoring Operations • Query current system status • CPU utilization, disk space.. • Process progress rate.. • Push software updates • Install the query program • Monitor dynamically changing state n5 n2 n3 n6 n4 n1 ? Department of Computer Science, UIUC

Challenges • Managing today’s large-scale systems is difficult. • A centralized solution doesn’t scale (no in-network aggregation) • Self-organization capability is becoming a necessity • Responses are expected in seconds, not in minutes/hours. • Node failure causes inconsistent results (network partition) • Brewer’s conjecture: • It is impossible for a web service to provide the following three guarantees: Consistency, Availability, Partition-tolerance (CAP Dilemma) Department of Computer Science, UIUC

Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring, Management, and Data Mining Robbert Van Renesse, Kenneth P. Birman, and Werner Vogels Presented by: Abdullah Al-Nayeem

Overview • Astrolabe as an information management service. • Locates and collects the status of a set of servers. • Reports the summariesof this information (aggregation mechanism using SQL). • Automatically updates and reports any changed summaries. • Design principles: • Scalability through hierarchy of resources • Robustness through gossip protocol (p2p) • Flexibility through customization of queries (SQL) • Security through certificates Department of Computer Science, UIUC

Astrolabe Zone Hierarchy Zone: /berkeley /cornell Zone: n5 n2 n8 n7 n3 n6 n1 n4 /uiuc/cs /uiuc Zone: Zone: Department of Computer Science, UIUC

Astrolabe Zone Hierarchy (2) • - Zone hierarchy is determined by • the administrators (less flexibility). • Assumption: zone names are consistent • with the physical topology. / /uiuc /cornell /berkeley /uiuc/cs /berkeley/eecs /uiuc/ece /cornell/cs /berkeley /eecs/n7 /berkeley /eecs/n8 /berkeley /eecs/n5 /uiuc/ece/n1 /uiuc/cs/n4 /uiuc/cs/n6 /cornell/n2 /cornell/cs/n3 It’s a virtual hierarchy. Only the host in leaf zone runs an Astrolabe agent Department of Computer Science, UIUC

Decentralized Hierarchical Database • An attribute list is associated with each zone. • This attribute list is defined as Management Information Base (MIB) • Attributes includes information on load, total free disk space, process information, etc. • Each internal zone has a relational table of MIBs of its child zones. • The leaf zone is an exception (.. next slide) Department of Computer Science, UIUC

Decentralized Hierarchical Database (2) Agent (/uiuc/cs/n6) has its local copy of these management table of MIBs. uiuc / cornell berkeley n4 cs n6 ece /uiuc /berkeley /cornell … … Load = 0.1 /uiuc/ece /uiuc/cs Load = 0.3 /uiuc/ece/n1 /uiuc/cs/n4 /uiuc/cs/n6 system system Disk = 0.6TB Disk = 1.2TB Load = 0.3 Load = 0.1 Service: A(1.0) progress = 0.5 Service: A(1.1) progress = 0.7 process process files files Department of Computer Science, UIUC

State Aggregation uiuc cornell (Own) berkeley / cs n4 Aggregates the result using SQL query ece n6 (Own) Time = 130 Load = 0.3 /uiuc SELECT MIN(Load) as Load Load = 0.3 (Own) Time = 121 /uiuc/cs Other aggregation includes: MAX (attribute) SUM (attribute) AVG (attribute) FIRST(n, attribute) Load = 0.5 Time = 101 /uiuc/cs/n4 Department of Computer Science, UIUC

State Merge – Gossip Protocol uiuc uiuc (Own) (Own) cornell cornell / / berkeley berkeley cs n4 n4 cs n6 ece n6 ece (Own) Time = 130 Load = 0.3 (Own) Time = 110 Load = 0.5 /uiuc /uiuc Load = 0.3 Load = 0.5 Time = 121 Time = 101 Time = 121 Load = 0.3 (Own) /uiuc/cs /uiuc/cs (Own) Load = 0.5 Load = 0.5 Time = 101 Time = 101 /uiuc/cs/n4 /uiuc/cs/n6 Each agent periodically contacts some other agent and exchanges the state associated with MIB based on timestamp. Department of Computer Science, UIUC

More about Astrolabe Gossip uiuc / cornell berkeley cs ece /uiuc /cornell /berkeley Gossiped MIBs in /uiuc/cs /uiuc/cs /berkeley/eecs /uiuc/ece /cornell/cs /berkeley /eecs/n7 /berkeley /eecs/n8 /berkeley /eecs/n5 /uiuc/ece/n1 /uiuc/cs/n4 /uiuc/cs/n6 /cornell/n2 /cornell/cs/n3 How does /uiuc/cs/n4 know the MIB of /cornell? By gossiping with /cornell/n2 or /cornell/cs/n3 Department of Computer Science, UIUC

More about Astrolabe Gossip (2) • Each zone dynamically elects a set of representative agents to gossip on behalf of this zone. • Election can be based on the load of the agents, their longevity. • The MIB contains the list of representative agents. • An agent can represent for multiple zones. • Each agent periodically gossips for a zone it represents. • Randomly picks another sibling zones and one of its representative agents. • Gossips the MIBs of all their sibling zones. • Gossip dissemination within a zone grows at O(logK). • K = no. of child zones. Department of Computer Science, UIUC

More about Astrolabe Gossip (3) uiuc / Representative agents for /uiuc cornell berkeley /berkeley/eecs/n8 /berkeley/eecs/n5 /uiuc/cs/n4 /uiuc/ece/n1 /cornell/cs/n3 /uiuc /cornell /berkeley /uiuc/cs /berkeley/eecs /uiuc/ece /cornell/cs /berkeley /eecs/n7 /berkeley /eecs/n8 /berkeley /eecs/n5 /uiuc/ece/n1 /uiuc/cs/n4 /uiuc/cs/n6 /cornell/n2 /cornell/cs/n3 Gossips about the MIBs of the /berkeley, /uiuc Department of Computer Science, UIUC

Example: P2P Caching of Large Objects • Query to locate a copy of the file, game.db • SELECT COUNT(*) AS file_count FROM files WHERE name = ‘game.db’ • SELECT FIRST(1, result) AS result SUM(file_count) AS file_count WHERE file_count > 0 • The SQL query code is installed in an Astrolabe agent using an aggregation function certificate (AFC) Each host installs an attribute ‘result’ with its host name in its leaf MIB. Aggregates the ‘result’ of each zone and picks one host in each zone. Department of Computer Science, UIUC

Example: P2P Caching of Large Objects (2) • The querying application introduces this new AFC at the management table of some Astrolabe agent. • Query propagation and aggregation of output: • An Astrolabe agent automatically evaluates the AFC for the leaf zone and recursively updates the table of the ancestor zones. • A copy of the AFC is also included along the result. • This query is evaluated recursively till the root zone as long as the policy permits. • AFCs are also gossiped to other agents (as part of the MIB). • An receiving agent scans the gossiped message and installs the new AFCs and in the leaf MIBs • Each AFC has an expiration period. Till then, queries are evaluated frequently. Department of Computer Science, UIUC

Membership • Removingfailed or disconnected nodes: • Astrolabe also gossips about the membership information. • If a process (or host) fails, its MIB will eventually expire and be deleted. • Integratingnew members: • Astrolabe relies on IP multicast to set up the initial contacts. • Gossip message is also broadcast on the local LAN (occasionally). • Astrolabe agents also contact a set of its relatives (occasionally) Department of Computer Science, UIUC

Simulation Results Effect of branching factor of the zone hierarchy on the gossip dissemination time. • Number of representative • agents per zone = 1. • No failure. The smaller the branching of the zone hierarchy, the slower the gossip dissemination. Department of Computer Science, UIUC

Simulation Results (2) Effect of the number of representative agents on the gossip dissemination time. • Branching factor = 25 • No failure. The higher the representative agents per zone, the lower the time of gossip dissemination in the presence of failures. Department of Computer Science, UIUC

Discussion • Astrolabe is not meant to provide routing features similar to DHTs. • How is Astrolabe different from DHTs? • Astrolabe attributes are updated proactively and frequently. • Do you think this proactive management is better than the reactive one? Department of Computer Science, UIUC

Moara: Flexible and Scalable Group-Based Querying System Steven Y. Ko1, Praveen Yalagandula2, Indranil Gupta1, Vanish Talwar2, Dejan Milojicic2, Subu Iyer2 1University of Illinois at Urbana-Champaign 2HP Labs, Palo Alto ACM/IFIP/USENIX Middleware, 2008 Presented by Ahmed Khurshid

Motivation Linux Apache What is the average memory utilization of machines running MySQL? MySQL Naïve approach Consumes extra bandwidth and adds delay Department of Computer Science, UIUC

Motivation (cont.) Linux Apache What is the average memory utilization of machines running MySQL? MySQL Better approach Avoids unnecessary traffic Department of Computer Science, UIUC

Two Approaches Single Tree (No Grouping) Group Based Trees M O A R A Query cost is high Group maintenance cost is high Department of Computer Science, UIUC

Moara Features • Moara maintains aggregation trees for different groups • Uses FreePasrty DHT for this purpose • Supports a query language having the following form • (query-attribute, aggregation function, group-predicate) • E.g. (Mem-Util, Average, MySQL = true) • Supports composite queries that target multiple groups using unions and intersections • Reduces bandwidth usage and response time by • Adaptively pruning branches of the DHT tree • Bypassing intermediate nodes that do not satisfy a given query • Only querying those nodes that are able to answer quickly without affecting the result of the query Department of Computer Science, UIUC

Common Queries Department of Computer Science, UIUC

Group Size and Dynamism Number of machines to do a job varies Most slices have fewer than 10 nodes Usage of HP’s utility computing environment by different jobs Usage of PlanetLab nodes by different slices Department of Computer Science, UIUC

Moara: Data and Query Model • Three part query • Query attribute • Aggregation function • Group-predicate • Aggregation functions are partially aggregatable • To perform in-network aggregation • Composite queries can be constructed using “and” and “or operators” • E.g. (Linux=true and Apache=true) Moara Agent (attribute1, value1) (attribute2, value2) . . . (attributen, valuen) Department of Computer Science, UIUC

Scalable Aggregation • Moara employs a peer-to-peer in-network aggregation approach • Maintains separate DHT trees for each group predicate • Hash of the group attribute is used to designate a node as the root of such tree • Queries are first send to the root node that then propagates the query down the tree • Data coming from child nodes are aggregated before sending results to parent node Department of Computer Science, UIUC

DHT Tree Steps 000 • Take the hash of the group predicate • Let, Hash(ServiceX) = 000 • Select the root based on the hash • Use Pastry’s routing mechanism to join the tree • Similar to SCRIBE 010 001 111 110 011 101 100 Department of Computer Science, UIUC

Optimizations • Prune out branches of the tree the do not contain any node satisfying a given query • Need to balance the cost of maintaining group tree with the query resolution cost • Bypass internal nodes that do not satisfy a given query • While dealing with composite queries, select a minimal set of groups by rewriting the query into a more manageable form Department of Computer Science, UIUC

Dynamic Maintenance prune (p) = binary local state variable at every node per attribute p(B)=false p(C)=false p(H)=false A NO-PRUNE message p(D)=false p(E)=false p=false p(F)=false p(G)=false p=false B C D E F G H p=false p=false p=false p=false p=false Department of Computer Science, UIUC

Dynamic Maintenance prune (p) = binary local state variable at every node p(B)=false p(C)=false p(H)=false A PRUNE message p(D)=false p(E)=false p=false p(F)=true p(G)=false p=false B C D E F G H p=false p=false p=true p=false p=false Department of Computer Science, UIUC

Dynamic Maintenance prune (p) = binary local state variable at every node p(B)=false p(C)=true p(H)=false A PRUNE message High group churn rate will cause more PRUNE/NO-PRUNE message and may be more expensive than forwarding query to all nodes p(D)=false p(E)=false p=false p(F)=true p(G)=true p=true B C D E F G H p=false p=false p=true p=false p=false Department of Computer Science, UIUC

Adaptation Policy • Maintain two additional state variables at every node • sat • To track if a subtree rooted at this node should continue receiving queries for a given predicate • update • Denotes whether a node will update its prune variable or not • The following invariants are maintained • update = 1 AND sat = 1 => prune = 0 • update = 1 AND sat = 0 => prune = 1 • update = 0 => prune = 0 Department of Computer Science, UIUC

Adaptation Policy (cont.) When overhead due to unrelated queries is higher compared to group maintenance messages When group maintenance messages consume more bandwidth than queries Department of Computer Science, UIUC

Separate Query Plane • Used to bypass intermediate nodes that do not satisfy a given query • Reduces message complexity from O(mlogN) to O(m), where • N = total nodes in the system • m = number of query satisfying nodes • Uses two locally maintained sets at each node • updateSet – a list of nodes forwarded to parent • qSet – list of children nodes to which queries are forwarded • qSet is the union of all updateSets received from child nodes • Based on the size of the qSet and SAT value, a node decides whether to remain in the tree or not Department of Computer Science, UIUC

Separate Query Plane (cont.) A A B updateSet C,D qSet (B) (C,D) B C,D Remain in the tree if SAT is true and |qSet| ≥ threshold B B NOSAT NOSAT C,D C,D C D E C D E Query SAT SAT NOSAT SAT SAT NOSAT C D C D threshold = 2 threshold = 3 Department of Computer Science, UIUC

Composite Queries • Moara does not maintain trees for composite queries • Answers composite queries by contacting one or more simple predicate trees • Example - 1 • (Free_Mem, Avg, ServiceX = true AND ServiceY = true) • Two trees – one for each service • Can be answered using a single tree (whichever promises to respond early) • Example - 2 • (Free_Mem, Avg, ServiceX = true OR ServiceY = true) • Two trees – one for each service • Need to query both trees Department of Computer Science, UIUC

Composite Queries (cont.) • Moara selects a small cover that contains required trees to answer a query • For example • cover(Q=“A”) = {A} if A is a predefined group • cover(Q=“A or B”) = cover(A) U cover(B) • cover(Q=“A and B”) = cover(A), cover(B) or (cover(A) U cover(B)) • Bandwidth is saved by • Rewriting a nested query to select a low cost cover • Estimating query costs for individual trees • Use semantic information supplied by the user Department of Computer Science, UIUC

Finding Low-Cost Covers - CNF = Conjunctive Normal Form - Gives minimal-cost cover Department of Computer Science, UIUC

Performance Evaluation - Dynamic Maintenance • FreePastry simulator • 10,000 nodes • 2000-sized group churn event • 500 events • Query for an attribute A with value Є {0, 1} Bandwidth usage with various query-to churn ratios Moara performs better than both the extreme approaches Department of Computer Science, UIUC

Performance Evaluation - Separate Query Plane Bandwidth usage for different (group size, threshold) For higher threshold values, query cost does not depend on total number of nodes Department of Computer Science, UIUC

Emulab Experiments • 50 machines, 10 instances of Moara per machine • Fixed query rate Average latency with a static group of the same size Latency and BW usage with static groups Average Latency of dynamically changing groups - Latency and message cost increases with group size - Moara performs well even with high group churn events Department of Computer Science, UIUC

PlanetLab Experiments • 200 PlanetLab nodes, one instance of Moara per node • 500 queries injected 5 seconds apart Moara vs. Centralized Aggregator Moara responds faster than a centralized aggregator Department of Computer Science, UIUC

Discussion Points • Using DAG or Synopsis Diffusion instead of trees • Handling group churn in the middle of a query • Moara ensures eventual consistency • Effect of nodes that are unreachable • State maintenance overhead for different attributes • Computation overhead for maintaining DHT trees for different attributes • Using Moara in ad-hoc mobile wireless networks Department of Computer Science, UIUC

Network Imprecision: A New Consistency Metric for Scalable Monitoring Navendu Jain†, Prince Mahajan⋆, Dmitry Kit⋆, Praveen Yalagandula‡, Mike Dahlin⋆, and Yin Zhang⋆ †Microsoft Research ⋆The University of Texas at Austin ‡HP Labs

Motivation • Providing a consistency metric suitable for large-scale monitoring system • Safeguarding accuracy despite node and network failure • Providing a level of confidence on the reported information • Efficiently tracking the number of nodes that fail to report status or that report status multiple times Department of Computer Science, UIUC

Distributed Monitoring and Management