1 / 9

ROC@Stanford Progress Report

ROC@Stanford Progress Report. Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang. Philosophical Direction. Use only dynamic, observed behavior to determine recovery technique/policy Application independent recovery techniques Specialize designs for fast recovery

nuwa
Download Presentation

ROC@Stanford Progress Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROC@Stanford Progress Report Armando Foxwith George Candea, James Cutler, Ben Ling, Andy Huang

  2. Philosophical Direction • Use only dynamic, observed behavior to determine recovery technique/policy • Application independent recovery techniques • Specialize designs for fast recovery • Putting it all together: all software should be crash-only

  3. Dynamic, Observed Behavior • A priori fault models are suspect. Base recovery strategy only on dynamically observed behavior. • Behavior may change as system or workload evolves => addresses a key difference between Internet-oriented ROC systems and traditional mission-critical systems • Kinds of observations • PinPoint: use statistical analysis to determine which groups of components are correlated with observed external faults • Automatic failure-propagation inference: use fault injection and tracing to determine propagation paths and extent of different kinds of faults

  4. Making techniques application-generic • True application-generic recovery is hard [Lowell & Chen] • But that’s because “generic” applications are too unconstrained • Idea: if an application uses a particular “rich runtime”, that runtime may constrain application structure • Example: J2EE, a widely used enterprise app. framework • Modular Java applications, well defined component boundaries • Rich runtime system (“application server”) provides services for deployment/undeployment, naming, load balancing, integration with Web servers & databases, etc. • Instrument the platform with generic methods for fault injection and recovery (e.g., using Recursive Restartability) • Generic mechanisms: timeouts, exception propagation • Parametrizable mechanisms: progress counters, application-level pings

  5. Example: Automatic Failure Propagation Inference • When a failure occurs in a particular software component of an application, how far does it propagate? • i.e., what part(s) of the application must be recovered • Traditionally, failure propagation information is derived by hand • Our approach: modify J2EE application server to allow capture of failure-propagation information in any J2EE app • Automatic Failure-Propagation Inference (AFPI) for JBoss: + automatically and dynamically generates f-maps with no performance overhead + no application knowledge required + finds dependencies that other analyses might miss,omits “false” dependencies that don’t result in actual failure propagation

  6. Design for Fast Recovery • Recursive Restartability as a technique for recovery assumes... • For correctness: All components are independent and restartable (ie no data loss or other bad effects) • For performance: Restarts are relatively fast • For stateless components, this is “easy”; what about stateful components? • Correctness: eg, filesystems may suffer data loss if OS not cleanly shut down • Performance: eg, commercial RDBMS’s are crash-safe, but take a long time (minutes to hours) to recover

  7. Fast-Recovering State Stores • Isolate state exclusively in state store components; make all other “application logic” components stateless • Instead of building a general state store, specialize it for its intended use • Goal: identify combination of specializations that facilitates construction of a very-large-scale state store (O(103) requests/sec on O(106) entries) with near-zero recovery time • Possible axes for specialization… • Is state shared across clients or not? (user profile/session state vs. updating a message board) • How powerful must the query API be? (single-key lookup, free-text search, fully relational…) • What is the intended lifetime of state? (short/session, long/forever)

  8. Putting it together: crash-only software • Already assumed: software must be able to recover from a crash rapidly and correctly • But if it can do that…then why include separate code paths for “clean shutdown”? • All software should be crash-only; this makes it robust, easy to administer/upgrade, and amenable to RR as a recovery technique (among others) • Current explorations: • RR-ifying the platform (J2EE appserver) vs. individual applications • Improving ability to detect anomalies and failure correlations using path-based statistical analysis • Designing crash-only state stores for both session state and persistent state

  9. Outrageous Opinions session tomorrow • tomorrow after dinner: controversial ideas/opinions, open challenges, predicting the future, ... • Please sign up on easel (coming this afternoon) • ~5-8 minutes per person to pound the pulpit and stimulate later discussion • Retreat proceedings, slides, etc. (mostly) online • Internet keyword “retreat” :-) or http://retreat or 10.0.0.1

More Related