detecting managing and diagnosing failures with fuse
Skip this Video
Download Presentation
Detecting, Managing, and Diagnosing Failures with FUSE

Loading in 2 Seconds...

play fullscreen
1 / 16

Detecting, Managing, and Diagnosing Failures with FUSE - PowerPoint PPT Presentation

  • Uploaded on

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Detecting, Managing, and Diagnosing Failures with FUSE' - coen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
detecting managing and diagnosing failures with fuse

Detecting, Managing, and Diagnosing Failures with FUSE

John Dunagan, Juhan Lee (MSN), Alec Wolman


goals target environment
Goals & Target Environment
  • Improve the ability of large internet portals to gain insight into failures
  • Non-goals:
    • masking failures
    • use machine learning to inferabnormal behavior
msn background
MSN Background
  • Messenger,, Hotmail, Search, many other “properties”
  • Large (> 100 million users)
  • Sources of Complexity:
    • multiple data-centers
    • large # of machines
    • complex internal network topology
    • diversity of applications and software infrastructure
the plan
The Plan
  • Detecting, managing, and diagnosing failures
    • Review MSN’s current approaches
    • Describe our solution at a high level
detecting failures
Detecting Failures
  • Monitor system availability with heartbeats
  • Monitor applications availability & quality of service using synthetic requests
  • Customer complaints
    • Telephone, email


  • These approaches provide limited coverage – harder to catch failures that don’t affect every request
  • Data on detected failures often lacks necessary detail to suggest a remedy:
    • which front end is flaky?
    • which app component caused end-user failure?
managing failures
Managing Failures


    • Ability to prioritize failures
    • Detect component service degradation
    • Characterizing app-stability
    • Capacity planning
  • When server “x” fails, what is the impact of this failure?
    • Better use of ops and engineering resources
  • Current approach: no systematic attempt to provide this functionality
our solution in 2 steps
Our solution (in 2 steps)

Detecting and Managing Failures

  • Step 1: Instrument applications to track user requests across the “service chain”
    • Each request is tagged with a unique id
    • Service chain is composed on-the-fly with help of app instrumentation
    • For each request:
      • Collect per-hop performance information
      • Collect per-request failure status
    • Centralized data collection
what kinds of failures
What kinds of failures?

We can handle:

  • Machine failures
  • Network connectivity problems


  • Misconfiguration
  • Application bugs

But not all:

  • Application errors where app itself doesn’t detect that there is a problem
diagnosing failures
Diagnosing Failures
  • Assigning responsibility to a specific hw or sw component
  • Insight into internals of a component
  • Cross component interactions
  • Current approach: instrument applications
    • App-specific log messages
  • Problems
    • High request rates => log rollover
    • Perceived overhead => detailed logging enabled during testing, disabled in production
fuse background
Fuse Background
  • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred
    • Lack of a positive ack => failure
step 2 conditional logging
Step 2: Conditional Logging
  • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain
    • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain
    • While fate is undecided: Detailed log messages stored in main memory
      • Common case overload of logging is vastly reduced
    • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures
      • Quantity of data generated is manageable, when most requests are successful







  • FUSE allows monitoring of real transactions.
    • All transactions, or a sampled subset to control overhead.
  • When a request fails, FUSE provides an audit trail
    • How far did it get?
    • How long did each step take?
    • Any additional application specific context.
  • FUSE can be deployed incrementally.
  • Overload policy: need to handle bursts of failures without inducing more failures
  • How much effort to make apps FUSE enabled?
  • Are the right components FUSE enabled?
  • Identifying and filtering false positives
  • Tracking request flow is non-trivial with network load balancers
  • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine
  • Testing in progress
  • Roll-out at end of summer
fuse is easy to integrate
FUSE is Easy to Integrate

Example current code on Front End:

ReceiveRequestFromClient(…) {



Example code on Front End using FUSE:

ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null

if ( f != null ) JoinFUSEGroup( f );

SendRequestToBackEnd(…, f );


Current implementation is in C#, and consists of 2400 LOC