Detecting managing and diagnosing failures with fuse
1 / 16

Detecting, Managing, and Diagnosing Failures with FUSE - PowerPoint PPT Presentation

  • Uploaded on

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Detecting, Managing, and Diagnosing Failures with FUSE' - coen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Detecting managing and diagnosing failures with fuse l.jpg

Detecting, Managing, and Diagnosing Failures with FUSE

John Dunagan, Juhan Lee (MSN), Alec Wolman


Goals target environment l.jpg
Goals & Target Environment

  • Improve the ability of large internet portals to gain insight into failures

  • Non-goals:

    • masking failures

    • use machine learning to inferabnormal behavior

Msn background l.jpg
MSN Background

  • Messenger,, Hotmail, Search, many other “properties”

  • Large (> 100 million users)

  • Sources of Complexity:

    • multiple data-centers

    • large # of machines

    • complex internal network topology

    • diversity of applications and software infrastructure

The plan l.jpg
The Plan

  • Detecting, managing, and diagnosing failures

    • Review MSN’s current approaches

    • Describe our solution at a high level

Detecting failures l.jpg
Detecting Failures

  • Monitor system availability with heartbeats

  • Monitor applications availability & quality of service using synthetic requests

  • Customer complaints

    • Telephone, email


  • These approaches provide limited coverage – harder to catch failures that don’t affect every request

  • Data on detected failures often lacks necessary detail to suggest a remedy:

    • which front end is flaky?

    • which app component caused end-user failure?

Managing failures l.jpg
Managing Failures


  • Ability to prioritize failures

  • Detect component service degradation

  • Characterizing app-stability

  • Capacity planning

  • When server “x” fails, what is the impact of this failure?

    • Better use of ops and engineering resources

  • Current approach: no systematic attempt to provide this functionality

  • Our solution in 2 steps l.jpg
    Our solution (in 2 steps)

    Detecting and Managing Failures

    • Step 1: Instrument applications to track user requests across the “service chain”

      • Each request is tagged with a unique id

      • Service chain is composed on-the-fly with help of app instrumentation

      • For each request:

        • Collect per-hop performance information

        • Collect per-request failure status

      • Centralized data collection

    What kinds of failures l.jpg
    What kinds of failures?

    We can handle:

    • Machine failures

    • Network connectivity problems


    • Misconfiguration

    • Application bugs

      But not all:

    • Application errors where app itself doesn’t detect that there is a problem

    Diagnosing failures l.jpg
    Diagnosing Failures

    • Assigning responsibility to a specific hw or sw component

    • Insight into internals of a component

    • Cross component interactions

    • Current approach: instrument applications

      • App-specific log messages

    • Problems

      • High request rates => log rollover

      • Perceived overhead => detailed logging enabled during testing, disabled in production

    Fuse background l.jpg
    Fuse Background

    • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred

      • Lack of a positive ack => failure

    Step 2 conditional logging l.jpg
    Step 2: Conditional Logging

    • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain

      • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain

      • While fate is undecided: Detailed log messages stored in main memory

        • Common case overload of logging is vastly reduced

      • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures

        • Quantity of data generated is manageable, when most requests are successful

    Example l.jpg








    • FUSE allows monitoring of real transactions.

      • All transactions, or a sampled subset to control overhead.

    • When a request fails, FUSE provides an audit trail

      • How far did it get?

      • How long did each step take?

      • Any additional application specific context.

    • FUSE can be deployed incrementally.

    Issues l.jpg

    • Overload policy: need to handle bursts of failures without inducing more failures

    • How much effort to make apps FUSE enabled?

    • Are the right components FUSE enabled?

    • Identifying and filtering false positives

    • Tracking request flow is non-trivial with network load balancers

    Status l.jpg

    • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine

    • Testing in progress

    • Roll-out at end of summer

    Fuse is easy to integrate l.jpg
    FUSE is Easy to Integrate

    Example current code on Front End:

    ReceiveRequestFromClient(…) {



    Example code on Front End using FUSE:

    ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null

    if ( f != null ) JoinFUSEGroup( f );

    SendRequestToBackEnd(…, f );


    Current implementation is in C#, and consists of 2400 LOC