Myops
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

MyOps PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

MyOps. An Operational Framework for PlanetLab Deployments. Outline. Objective of MyOps Current status Future ideas Questions at any time. Example of Feedback. Objective : Close Operational Cycle. System - Provides service (slice) Monitoring - Feedback from running system

Download Presentation

MyOps

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Myops

MyOps

An Operational Framework for PlanetLab Deployments


Outline

Outline

  • Objective of MyOps

  • Current status

  • Future ideas

  • Questions at any time


Example of feedback

Example of Feedback


Objective close operational cycle

Objective : Close Operational Cycle

  • System - Provides service (slice)

  • Monitoring - Feedback from running system

  • Operator - Interpret feedback into tasks

  • Management - Control running system


Challenges break down

Challenges: Break-down

  • System may not deliver service

  • Monitoring not observe useful metrics

  • Operator may not know

    • how to interpret observations

    • how to control the system

    • what the service goals are

  • Management may not control system


Requirements for operational systems

Requirements for Operational Systems

  • Satisfy Minimal Conditions

    • Physical Integrity

    • Interconnectivity

    • Controllable

    • Provide a Service

  • Two requirements

    • Reliably reach the final condition

    • When failures occurs, repair or report automatically

  • Two approaches in MyOps

    • Precise bootstrap stages (not discussed)

    • Operational monitoring & management in platform


System planetlab slices

System: PlanetLab Slices


Monitoring types

Monitoring Types

Open-loop monitoring

  • Identify the unknown

  • More information, fine-grained

    Operational monitoring (closed-loop)

  • Correctness

  • Less information, coarse-grained

  • Actionable


Management types

Management Types

Open-loop management

  • Bootstrap/Deploy from the ground up

  • Inefficient, coarse-grained

  • No feed-back

    Operational management (closed-loop)

  • Tweak the system to correct behavior

  • More efficient, fine-grained


Example

Example

  • Observe: Node is Off-Line

  • Control: Attempt to Power-On

  • Observe: Node is On-line but Failed to boot

  • Observe: Failed to boot Error

  • Control: Create ticket & Send email to local contact

  • Time passes

  • Control: Disable slice creation

  • Observe: Local contact responds

  • Observe: Node is Power-on and Running

  • Control: Re-enable slice creation

  • Contro: Close ticket


History of planetlab operations

History of PlanetLab Operations

Open-loop Monitoring with Open-loop Management

  • Collect fine-grained statistics using CoMon

  • Act with coarse-grained operations (e.g. Reinstall)

  • Manual bridge between the two

    Moving towards Closed-loop Operations

  • Collect targeted metrics

  • Take directed, problem-specific actions

  • Automate actions based on policy


Planetlab operations

PlanetLab Operations

  • Close the monitor/management cycle

  • Direct automation of common operations

  • Indirect through remote contacts and incentives


Myops architecture

MyOps Architecture

  • Collection from Node

  • Translated by policy to Automated action


Myops architecture1

MyOps Architecture

  • Collection from Node

  • Send notice to Local contact to take action


Myops architecture2

MyOps Architecture

  • When there is no response

  • Indirect influence with incentives


Collection

Collection

  • Operational monitoring specific targets, such as:

    • Boot status, Filesystem status

    • DNS - internal and external

    • RPMs

    • System services, etc

  • Periodic collection

    • Coarse-grained collection at a human-timescale

    • Time-series of events and status


Policy

Policy

  • Constraints over a time-series of events

  • To satisfy a constraint

    • Automated action

    • Send notice

    • Apply incentive

  • Policy defines

    • Preferred status of system

    • Frequency of actions

    • Magnitude of incentives


Automation

Automation

  • Automatic correction of common bootstrap problems

    • Communication errors with MyPLC

    • Corrupt filesystem repair

    • Retry when state is unknown

    • PCU Reboot

    • Reinstall

  • Automation Notices

    • Bad disk

    • Minimal hardware

    • Bad DNS

    • Bad node configuration


Notices incentives

Notices & Incentives

  • Notices are indirect paths to node management

    • Node down / online / specific problem (i.e. DNS, disk)

    • Site down / online

    • Privilege reduced / restored

    • PCU errors

  • The incentives on MyPLC

    • Sites 10 slices

    • Disable slice creation

    • Disable running slices


Validation of notices incentives

Validation of Notices & Incentives

A

B

C

D

E

Kernel Bug

Fix

Fix2

Notice Bug

Fix


Time to restore down node all issues

Time to Restore Down Node (all issues)


Future ideas

Future Ideas

  • Generalize Configuration

    • Collect from multiple sources

    • Expose policy

    • Act on multiple targets

  • Self-monitoring

  • Positive Incentives

    • Special access to services

    • Additional resources (Slices, Bandwidth, CPU, etc)


Time to reply when there is a reply

Time to Reply (when there is a reply)


  • Login