MyOps

MyOps An Operational Framework for PlanetLab Deployments

Outline • Objective of MyOps • Current status • Future ideas • Questions at any time

Example of Feedback

Objective : Close Operational Cycle • System - Provides service (slice) • Monitoring - Feedback from running system • Operator - Interpret feedback into tasks • Management - Control running system

Challenges: Break-down • System may not deliver service • Monitoring not observe useful metrics • Operator may not know • how to interpret observations • how to control the system • what the service goals are • Management may not control system

Requirements for Operational Systems • Satisfy Minimal Conditions • Physical Integrity • Interconnectivity • Controllable • Provide a Service • Two requirements • Reliably reach the final condition • When failures occurs, repair or report automatically • Two approaches in MyOps • Precise bootstrap stages (not discussed) • Operational monitoring & management in platform

System: PlanetLab Slices

Monitoring Types Open-loop monitoring • Identify the unknown • More information, fine-grained Operational monitoring (closed-loop) • Correctness • Less information, coarse-grained • Actionable

Management Types Open-loop management • Bootstrap/Deploy from the ground up • Inefficient, coarse-grained • No feed-back Operational management (closed-loop) • Tweak the system to correct behavior • More efficient, fine-grained

Example • Observe: Node is Off-Line • Control: Attempt to Power-On • Observe: Node is On-line but Failed to boot • Observe: Failed to boot Error • Control: Create ticket & Send email to local contact • Time passes • Control: Disable slice creation • Observe: Local contact responds • Observe: Node is Power-on and Running • Control: Re-enable slice creation • Contro: Close ticket

History of PlanetLab Operations Open-loop Monitoring with Open-loop Management • Collect fine-grained statistics using CoMon • Act with coarse-grained operations (e.g. Reinstall) • Manual bridge between the two Moving towards Closed-loop Operations • Collect targeted metrics • Take directed, problem-specific actions • Automate actions based on policy

PlanetLab Operations • Close the monitor/management cycle • Direct automation of common operations • Indirect through remote contacts and incentives

MyOps Architecture • Collection from Node • Translated by policy to Automated action

MyOps Architecture • Collection from Node • Send notice to Local contact to take action

MyOps Architecture • When there is no response • Indirect influence with incentives

Collection • Operational monitoring specific targets, such as: • Boot status, Filesystem status • DNS - internal and external • RPMs • System services, etc • Periodic collection • Coarse-grained collection at a human-timescale • Time-series of events and status

Policy • Constraints over a time-series of events • To satisfy a constraint • Automated action • Send notice • Apply incentive • Policy defines • Preferred status of system • Frequency of actions • Magnitude of incentives

Automation • Automatic correction of common bootstrap problems • Communication errors with MyPLC • Corrupt filesystem repair • Retry when state is unknown • PCU Reboot • Reinstall • Automation Notices • Bad disk • Minimal hardware • Bad DNS • Bad node configuration

Notices & Incentives • Notices are indirect paths to node management • Node down / online / specific problem (i.e. DNS, disk) • Site down / online • Privilege reduced / restored • PCU errors • The incentives on MyPLC • Sites 10 slices • Disable slice creation • Disable running slices

Validation of Notices & Incentives A B C D E Kernel Bug Fix Fix2 Notice Bug Fix

Time to Restore Down Node (all issues)

Future Ideas • Generalize Configuration • Collect from multiple sources • Expose policy • Act on multiple targets • Self-monitoring • Positive Incentives • Special access to services • Additional resources (Slices, Bandwidth, CPU, etc)

Time to Reply (when there is a reply)

MyOps

MyOps

Presentation Transcript