Detecting managing and diagnosing failures with fuse
Download
1 / 16

Detecting, Managing, and Diagnosing Failures with FUSE - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Detecting, Managing, and Diagnosing Failures with FUSE' - coen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Detecting managing and diagnosing failures with fuse l.jpg

Detecting, Managing, and Diagnosing Failures with FUSE

John Dunagan, Juhan Lee (MSN), Alec Wolman

WIP


Goals target environment l.jpg
Goals & Target Environment

  • Improve the ability of large internet portals to gain insight into failures

  • Non-goals:

    • masking failures

    • use machine learning to inferabnormal behavior


Msn background l.jpg
MSN Background

  • Messenger, www.msn.com, Hotmail, Search, many other “properties”

  • Large (> 100 million users)

  • Sources of Complexity:

    • multiple data-centers

    • large # of machines

    • complex internal network topology

    • diversity of applications and software infrastructure


The plan l.jpg
The Plan

  • Detecting, managing, and diagnosing failures

    • Review MSN’s current approaches

    • Describe our solution at a high level


Detecting failures l.jpg
Detecting Failures

  • Monitor system availability with heartbeats

  • Monitor applications availability & quality of service using synthetic requests

  • Customer complaints

    • Telephone, email

      Problems:

  • These approaches provide limited coverage – harder to catch failures that don’t affect every request

  • Data on detected failures often lacks necessary detail to suggest a remedy:

    • which front end is flaky?

    • which app component caused end-user failure?


Managing failures l.jpg
Managing Failures

Definition:

  • Ability to prioritize failures

  • Detect component service degradation

  • Characterizing app-stability

  • Capacity planning

  • When server “x” fails, what is the impact of this failure?

    • Better use of ops and engineering resources

  • Current approach: no systematic attempt to provide this functionality


  • Our solution in 2 steps l.jpg
    Our solution (in 2 steps)

    Detecting and Managing Failures

    • Step 1: Instrument applications to track user requests across the “service chain”

      • Each request is tagged with a unique id

      • Service chain is composed on-the-fly with help of app instrumentation

      • For each request:

        • Collect per-hop performance information

        • Collect per-request failure status

      • Centralized data collection


    What kinds of failures l.jpg
    What kinds of failures?

    We can handle:

    • Machine failures

    • Network connectivity problems

      Most:

    • Misconfiguration

    • Application bugs

      But not all:

    • Application errors where app itself doesn’t detect that there is a problem


    Diagnosing failures l.jpg
    Diagnosing Failures

    • Assigning responsibility to a specific hw or sw component

    • Insight into internals of a component

    • Cross component interactions

    • Current approach: instrument applications

      • App-specific log messages

    • Problems

      • High request rates => log rollover

      • Perceived overhead => detailed logging enabled during testing, disabled in production


    Fuse background l.jpg
    Fuse Background

    • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred

      • Lack of a positive ack => failure


    Step 2 conditional logging l.jpg
    Step 2: Conditional Logging

    • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain

      • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain

      • While fate is undecided: Detailed log messages stored in main memory

        • Common case overload of logging is vastly reduced

      • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures

        • Quantity of data generated is manageable, when most requests are successful


    Example l.jpg

    Client

    Server1

    Server2

    Server3

    X

    Example

    Benefits:

    • FUSE allows monitoring of real transactions.

      • All transactions, or a sampled subset to control overhead.

    • When a request fails, FUSE provides an audit trail

      • How far did it get?

      • How long did each step take?

      • Any additional application specific context.

    • FUSE can be deployed incrementally.


    Issues l.jpg
    Issues

    • Overload policy: need to handle bursts of failures without inducing more failures

    • How much effort to make apps FUSE enabled?

    • Are the right components FUSE enabled?

    • Identifying and filtering false positives

    • Tracking request flow is non-trivial with network load balancers


    Status l.jpg
    Status

    • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine

    • Testing in progress

    • Roll-out at end of summer



    Fuse is easy to integrate l.jpg
    FUSE is Easy to Integrate

    Example current code on Front End:

    ReceiveRequestFromClient(…) {

    SendRequestToBackEnd(…);

    }

    Example code on Front End using FUSE:

    ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null

    if ( f != null ) JoinFUSEGroup( f );

    SendRequestToBackEnd(…, f );

    }

    Current implementation is in C#, and consists of 2400 LOC


    ad