230 likes | 364 Views
MyOps. An Operational Framework for PlanetLab Deployments. Outline. Objective of MyOps Current status Future ideas Questions at any time. Example of Feedback. Objective : Close Operational Cycle. System - Provides service (slice) Monitoring - Feedback from running system
E N D
MyOps An Operational Framework for PlanetLab Deployments
Outline • Objective of MyOps • Current status • Future ideas • Questions at any time
Objective : Close Operational Cycle • System - Provides service (slice) • Monitoring - Feedback from running system • Operator - Interpret feedback into tasks • Management - Control running system
Challenges: Break-down • System may not deliver service • Monitoring not observe useful metrics • Operator may not know • how to interpret observations • how to control the system • what the service goals are • Management may not control system
Requirements for Operational Systems • Satisfy Minimal Conditions • Physical Integrity • Interconnectivity • Controllable • Provide a Service • Two requirements • Reliably reach the final condition • When failures occurs, repair or report automatically • Two approaches in MyOps • Precise bootstrap stages (not discussed) • Operational monitoring & management in platform
Monitoring Types Open-loop monitoring • Identify the unknown • More information, fine-grained Operational monitoring (closed-loop) • Correctness • Less information, coarse-grained • Actionable
Management Types Open-loop management • Bootstrap/Deploy from the ground up • Inefficient, coarse-grained • No feed-back Operational management (closed-loop) • Tweak the system to correct behavior • More efficient, fine-grained
Example • Observe: Node is Off-Line • Control: Attempt to Power-On • Observe: Node is On-line but Failed to boot • Observe: Failed to boot Error • Control: Create ticket & Send email to local contact • Time passes • Control: Disable slice creation • Observe: Local contact responds • Observe: Node is Power-on and Running • Control: Re-enable slice creation • Contro: Close ticket
History of PlanetLab Operations Open-loop Monitoring with Open-loop Management • Collect fine-grained statistics using CoMon • Act with coarse-grained operations (e.g. Reinstall) • Manual bridge between the two Moving towards Closed-loop Operations • Collect targeted metrics • Take directed, problem-specific actions • Automate actions based on policy
PlanetLab Operations • Close the monitor/management cycle • Direct automation of common operations • Indirect through remote contacts and incentives
MyOps Architecture • Collection from Node • Translated by policy to Automated action
MyOps Architecture • Collection from Node • Send notice to Local contact to take action
MyOps Architecture • When there is no response • Indirect influence with incentives
Collection • Operational monitoring specific targets, such as: • Boot status, Filesystem status • DNS - internal and external • RPMs • System services, etc • Periodic collection • Coarse-grained collection at a human-timescale • Time-series of events and status
Policy • Constraints over a time-series of events • To satisfy a constraint • Automated action • Send notice • Apply incentive • Policy defines • Preferred status of system • Frequency of actions • Magnitude of incentives
Automation • Automatic correction of common bootstrap problems • Communication errors with MyPLC • Corrupt filesystem repair • Retry when state is unknown • PCU Reboot • Reinstall • Automation Notices • Bad disk • Minimal hardware • Bad DNS • Bad node configuration
Notices & Incentives • Notices are indirect paths to node management • Node down / online / specific problem (i.e. DNS, disk) • Site down / online • Privilege reduced / restored • PCU errors • The incentives on MyPLC • Sites 10 slices • Disable slice creation • Disable running slices
Validation of Notices & Incentives A B C D E Kernel Bug Fix Fix2 Notice Bug Fix
Future Ideas • Generalize Configuration • Collect from multiple sources • Expose policy • Act on multiple targets • Self-monitoring • Positive Incentives • Special access to services • Additional resources (Slices, Bandwidth, CPU, etc)