rm3g next generation recovery manager l.
Skip this Video
Loading SlideShow in 5 Seconds..
RM3G: Next Generation Recovery Manager PowerPoint Presentation
Download Presentation
RM3G: Next Generation Recovery Manager

Loading in 2 Seconds...

play fullscreen
1 / 9

RM3G: Next Generation Recovery Manager - PowerPoint PPT Presentation

  • Uploaded on

RM3G: Next Generation Recovery Manager. Steve Zhang and Armando Fox Stanford University. Design Goals. SLTs. Overall Goal: Manage the detection of and recovery from system failures

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'RM3G: Next Generation Recovery Manager' - tilden

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
rm3g next generation recovery manager

RM3G: Next Generation Recovery Manager

Steve Zhang and Armando Fox

Stanford University

design goals
Design Goals


  • Overall Goal: Manage the detection of and recovery from system failures
  • New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection
    • Previous generation used End-2-End and Exception monitors
  • Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in
    • Standardize the APIs for observation, analysis, and control of system components
    • Provide common services and abstractions to SLT algorithms
  • RM itself must also be resilient to failures



rads architecture
RADS Architecture









SLT Services




Overlay Network









CommodityInternet & IP networks

design diagram
Design Diagram

Comp B

SLT Processes

Spawned by SLT Proc Srv

Comp A

Comp C

Ctrl/Obsrv point descriptors

Control policies

Observation Points

Control Points

SLT Plug-ins

Data Store Srv

SLT Select Srv

Ctrl Srv


Proc Srv


Name & Reg Srv

collaboration with acme
Collaboration with ACME
  • Infrastructure for monitoring, analyzing, and controlling Internet-scale systems
    • Sensors = Observation Points
    • Actuators = Control Points
  • RM potentially benefits from two ACME features
    • An in-network aggregator combines data from sensors as they are routed through an overlay network
    • Configuration language that specifies under what conditions to trigger actuators
  • ACME could benefit from more powerful sensor data analysis using SLTs
observation points
Observation Points
  • We want to avoid requiring every component to be individually instrumented
    • Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint)
  • Several types of observation data can be collected in an application generic way
    • OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc)
    • Middleware can provide intra-application data (e.g. interaction between different components of an application)
slt data services
SLT Data Services
  • Abstracts information from observation points
    • SLT algorithms are spawned for each component in the system, as they are instantiated
    • Observation data stored by SLT Data Server possibly in a streaming database.
  • Listens for feedback from SLT algorithms to adjust the data stream as necessary
    • Increase data sampling rate if anomaly is suspected
    • Stop reporting certain data if it is deemed to be irrelevant
  • Provide persistent data storage for SLT algorithms
    • Remember properties learned from previous analysis of observation data
control points
Control Points
  • Assumes crash-only components
    • Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly)
  • Initially, only restart control points are supported
    • Instrument application server (JBoss) to restart applications and application components
    • OS can restart application servers
    • IP addressable power strips can restart entire nodes
  • Components can specify custom control policy
    • Leverage ACME’s configuration language
future work
Future Work
  • “Master” SLT
    • Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem.
  • Support additional types of control points
    • Multiple level settings that tune component parameters (e.g. filter level)
  • Support additional types of observation points
    • Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way
  • Online SLT algorithms for anomaly detection are not mature