1 / 28

Scalable Monitoring & Autonomous Management of Cloud Environments

Scalable Monitoring & Autonomous Management of Cloud Environments. Idit Keidar Technion. Executive Summary. Goal : Scalable Monitoring and Autonomous Management of Cloud Environments Approach : Distributed Local Computations Combine theory and experimental work

kalea
Download Presentation

Scalable Monitoring & Autonomous Management of Cloud Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Monitoring & Autonomous Management of Cloud Environments Idit KeidarTechnion

  2. Executive Summary • Goal: Scalable Monitoring and Autonomous Management of Cloud Environments • Approach: Distributed Local Computations • Combine theory and experimental work • Task 1: Robust aggregation • Task 2: Overcome (& understand impact of) loss, failures in gossip-based membership • Task 3 (long term): Local and adaptive self-organization

  3. Autonomous Self* Clouds • Complex autonomous decision making • Collaboratively computing functions The Haifa data center is too hot! They’re going to turn on the sprinklers - need to backup Let’s reduce power

  4. Centralized Solutions Don’t Cut It • Load • Communication costs • Delays • Fault-tolerance

  5. Classical Dist. Solutions Don’t Cut It • Global agreement before any output • Repeated invocations to adapt to changes • High latency, high load • By the time synchronization is done, input may have changed … the result is irrelevant • Frequent changes -> inconsistent snapshots • Synchronization typically relies on leader • difficult and costly to maintain

  6. Locality to the Rescue! L • Nodes make local decisions based on communication with some proximate nodes • rather than the entire network • Infinitely scalable • Fast, low overhead, low power

  7. What is Locality? • Worst case view • Interesting problems have (a few) inherently global instances  • Average case view • Requires an a priori distribution of the inputs  • Our approach: be “as local as possible” • E.g., Veracity Radius of distributed aggregation [BKLSW’06] : how far does a node need to look in order to know the globally correct result?

  8. Task 1: Distributed Clustering for Robust Aggregation Years 1-2 With IttayEyal and Raphael Rom

  9. Clouds Need Monitoring • Load balancing storage/computation • Need to know load distributions • Ensuring a certain replication level • Need to know number of failures per object • Discovering problems – detecting anomalies • Isolated outliers (malfunctioning node) • Anomalous clusters • All nodes running some OS version are overloaded due to attack • Overheating area

  10. Aggregation Needs • Robustness to data errors • Ignore erroneous reports (outliers) • See Amazon S3’s recent crash caused by corrupt data being gossiped • Data is multi-dimensional • Physical location X Heat: Where is there a fire? • Cluster group X Load: Overloaded clusters? • Software version X Performance: What software are perturbed nodes running?

  11. Solution Requirements • Decentralized, tolerating crashes • Scalable, low cost • Clouds run 100,000s of machines • Machines are busy doing real work • Dynamic: deal with churn, value changes • All nodes learn the outcome • Data used for self-configuring/self-managing systems, so all nodes need to know the outcome in order to take appropriate actions

  12. Proposed Approach • Gossip-based diffusion • Crash robust, scalable • Constant size synopses represent data distribution as set of Gaussian clusters Estimated Distribution Gaussian 2 Gaussian 1 Samples taken

  13. Merging Synopses • Gossiping nodes exchange synopses, merge them to improve accuracy + = merge

  14. Preliminary Results - Robustness Sample Distribution Regular Aggregation Robust Aggregation No crashes With crashes

  15. Estimating Distributions - Pareto PDF CDF

  16. Estimating Distributions - Uniform PDF CDF

  17. Multi-Dimensional Distributions Samples Taken AggregatedSynopsis

  18. Key Challenges • Test with real data • Analyze convergence properties • Understand locality • Deal with changing inputs

  19. Task 2: Fault- & Loss-Tolerant Gossip-Based Membership: Formal Analysis Years 1-2 With Maxim Gurevich

  20. Why Membership? • Each node needs to know some live nodes • In a dynamically changing system (churn) • Gossip partners • Random choices make gossip protocols work • Unstructured overlay networks • E.g., among super-peers • Random links provide robustness, expansion • Gathering statistics • Probe random nodes

  21. Desirable Properties • Each node has a local view (set of node ids) • Small views, e.g., logarithmic • Load balance of representation in views • Uniform sample: In every node’s view, all other nodes appear with equal probability • Spatial independence: No correlation among views of different nodes • Temporal independence: fast decay of correlation with past views

  22. Existing Work • Many protocols studied only empirically  • Achieve good load balance  • Induce spatial dependence  • No bound on temporal dependence  • A few analyzed theoretically • Uniformity, load balance, spatial indep.  • Unrealistic assumptions  • Atomic actions with bi-directional communication • No churn, failures, or message loss • No bounds on temporal dependence 

  23. Our Goal • Bridge “Theory” and “Practice” • A practical protocol • Working despite message loss, churn, failures • No complex bookkeeping for atomic actions • Formally prove the 5 desirable properties • Should perfectly hold in good circumstances • Quantify how much they degrade due to averse conditions – message loss, churn, etc.

  24. Send & Forget Membership w w w w • No bi-directional communication • Overcomes message loss • Simple • Amenable to formal analysis u v u v u u v v after loss before after u -> v after dup

  25. Challenges • Setting parameters • View size, how often to dup? • Proving all 5 desirable properties w/out loss • Markov Analysis 1: In-degree distribution • Markov Analysis 2: Markov Chain of all reachable global states • stationary probability, mixing, membership properties • Quantify impact of loss, churn, failures • Bound dependencies, degree imbalance

  26. Task 3: Local and Adaptive Self-Organization and Topology Maintenance Years 2-3

  27. Decisions, Decisions, • Making autonomous decisions based on some function computation • E.g., optimization function for topology maintenance • Devise local distributed computations for these • Challenge 1: Prove instance-based locality • Challenge 2: Test with real data

  28. Summary (Repeated) • Goal: Scalable Monitoring and Autonomous Management of Cloud Environments • Approach: Distributed Local Computations • Combine theory and experimental work • Task 1: Robust aggregation • Task 2: Overcome (& understand impact of) loss, failures in gossip-based membership • Task 3 (long term): Local and adaptive self-organization

More Related