1 / 16

Detecting, Managing, and Diagnosing Failures with FUSE

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.

beauchamp
Download Presentation

Detecting, Managing, and Diagnosing Failures with FUSE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP

  2. Goals & Target Environment • Improve the ability of large internet portals to gain insight into failures • Non-goals: • masking failures • use machine learning to inferabnormal behavior

  3. MSN Background • Messenger, www.msn.com, Hotmail, Search, many other “properties” • Large (> 100 million users) • Sources of Complexity: • multiple data-centers • large # of machines • complex internal network topology • diversity of applications and software infrastructure

  4. The Plan • Detecting, managing, and diagnosing failures • Review MSN’s current approaches • Describe our solution at a high level

  5. Detecting Failures • Monitor system availability with heartbeats • Monitor applications availability & quality of service using synthetic requests • Customer complaints • Telephone, email Problems: • These approaches provide limited coverage – harder to catch failures that don’t affect every request • Data on detected failures often lacks necessary detail to suggest a remedy: • which front end is flaky? • which app component caused end-user failure?

  6. Managing Failures Definition: • Ability to prioritize failures • Detect component service degradation • Characterizing app-stability • Capacity planning • When server “x” fails, what is the impact of this failure? • Better use of ops and engineering resources • Current approach: no systematic attempt to provide this functionality

  7. Our solution (in 2 steps) Detecting and Managing Failures • Step 1: Instrument applications to track user requests across the “service chain” • Each request is tagged with a unique id • Service chain is composed on-the-fly with help of app instrumentation • For each request: • Collect per-hop performance information • Collect per-request failure status • Centralized data collection

  8. What kinds of failures? We can handle: • Machine failures • Network connectivity problems Most: • Misconfiguration • Application bugs But not all: • Application errors where app itself doesn’t detect that there is a problem

  9. Diagnosing Failures • Assigning responsibility to a specific hw or sw component • Insight into internals of a component • Cross component interactions • Current approach: instrument applications • App-specific log messages • Problems • High request rates => log rollover • Perceived overhead => detailed logging enabled during testing, disabled in production

  10. Fuse Background • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred • Lack of a positive ack => failure

  11. Step 2: Conditional Logging • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain • While fate is undecided: Detailed log messages stored in main memory • Common case overload of logging is vastly reduced • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures • Quantity of data generated is manageable, when most requests are successful

  12. Client Server1 Server2 Server3 X Example Benefits: • FUSE allows monitoring of real transactions. • All transactions, or a sampled subset to control overhead. • When a request fails, FUSE provides an audit trail • How far did it get? • How long did each step take? • Any additional application specific context. • FUSE can be deployed incrementally.

  13. Issues • Overload policy: need to handle bursts of failures without inducing more failures • How much effort to make apps FUSE enabled? • Are the right components FUSE enabled? • Identifying and filtering false positives • Tracking request flow is non-trivial with network load balancers

  14. Status • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine • Testing in progress • Roll-out at end of summer

  15. Backups

  16. FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC

More Related