1 / 9

RM3G: Next Generation Recovery Manager

RM3G: Next Generation Recovery Manager. Steve Zhang and Armando Fox Stanford University. Design Goals. SLTs. Overall Goal: Manage the detection of and recovery from system failures

tilden
Download Presentation

RM3G: Next Generation Recovery Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University

  2. Design Goals SLTs • Overall Goal: Manage the detection of and recovery from system failures • New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection • Previous generation used End-2-End and Exception monitors • Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in • Standardize the APIs for observation, analysis, and control of system components • Provide common services and abstractions to SLT algorithms • RM itself must also be resilient to failures RM3G Comp

  3. RADS Architecture Server Client Distributed Middleware Distributed Middleware User Operator SLT Services (RM3G) Application- Specific Overlay Network PNE PNE Edge Network Edge Network Router Router CommodityInternet & IP networks

  4. Design Diagram Comp B SLT Processes Spawned by SLT Proc Srv Comp A Comp C Ctrl/Obsrv point descriptors Control policies Observation Points Control Points SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv RM Proc Srv RMDB Name & Reg Srv

  5. Collaboration with ACME • Infrastructure for monitoring, analyzing, and controlling Internet-scale systems • Sensors = Observation Points • Actuators = Control Points • RM potentially benefits from two ACME features • An in-network aggregator combines data from sensors as they are routed through an overlay network • Configuration language that specifies under what conditions to trigger actuators • ACME could benefit from more powerful sensor data analysis using SLTs

  6. Observation Points • We want to avoid requiring every component to be individually instrumented • Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) • Several types of observation data can be collected in an application generic way • OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc) • Middleware can provide intra-application data (e.g. interaction between different components of an application)

  7. SLT Data Services • Abstracts information from observation points • SLT algorithms are spawned for each component in the system, as they are instantiated • Observation data stored by SLT Data Server possibly in a streaming database. • Listens for feedback from SLT algorithms to adjust the data stream as necessary • Increase data sampling rate if anomaly is suspected • Stop reporting certain data if it is deemed to be irrelevant • Provide persistent data storage for SLT algorithms • Remember properties learned from previous analysis of observation data

  8. Control Points • Assumes crash-only components • Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) • Initially, only restart control points are supported • Instrument application server (JBoss) to restart applications and application components • OS can restart application servers • IP addressable power strips can restart entire nodes • Components can specify custom control policy • Leverage ACME’s configuration language

  9. Future Work • “Master” SLT • Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. • Support additional types of control points • Multiple level settings that tune component parameters (e.g. filter level) • Support additional types of observation points • Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way • Online SLT algorithms for anomaly detection are not mature

More Related