detecting managing and diagnosing failures with fuse
Download
Skip this Video
Download Presentation
Detecting, Managing, and Diagnosing Failures with FUSE

Loading in 2 Seconds...

play fullscreen
1 / 16

Detecting, Managing, and Diagnosing Failures with FUSE - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Detecting, Managing, and Diagnosing Failures with FUSE' - coen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
detecting managing and diagnosing failures with fuse

Detecting, Managing, and Diagnosing Failures with FUSE

John Dunagan, Juhan Lee (MSN), Alec Wolman

WIP

goals target environment
Goals & Target Environment
  • Improve the ability of large internet portals to gain insight into failures
  • Non-goals:
    • masking failures
    • use machine learning to inferabnormal behavior
msn background
MSN Background
  • Messenger, www.msn.com, Hotmail, Search, many other “properties”
  • Large (> 100 million users)
  • Sources of Complexity:
    • multiple data-centers
    • large # of machines
    • complex internal network topology
    • diversity of applications and software infrastructure
the plan
The Plan
  • Detecting, managing, and diagnosing failures
    • Review MSN’s current approaches
    • Describe our solution at a high level
detecting failures
Detecting Failures
  • Monitor system availability with heartbeats
  • Monitor applications availability & quality of service using synthetic requests
  • Customer complaints
    • Telephone, email

Problems:

  • These approaches provide limited coverage – harder to catch failures that don’t affect every request
  • Data on detected failures often lacks necessary detail to suggest a remedy:
    • which front end is flaky?
    • which app component caused end-user failure?
managing failures
Managing Failures

Definition:

    • Ability to prioritize failures
    • Detect component service degradation
    • Characterizing app-stability
    • Capacity planning
  • When server “x” fails, what is the impact of this failure?
    • Better use of ops and engineering resources
  • Current approach: no systematic attempt to provide this functionality
our solution in 2 steps
Our solution (in 2 steps)

Detecting and Managing Failures

  • Step 1: Instrument applications to track user requests across the “service chain”
    • Each request is tagged with a unique id
    • Service chain is composed on-the-fly with help of app instrumentation
    • For each request:
      • Collect per-hop performance information
      • Collect per-request failure status
    • Centralized data collection
what kinds of failures
What kinds of failures?

We can handle:

  • Machine failures
  • Network connectivity problems

Most:

  • Misconfiguration
  • Application bugs

But not all:

  • Application errors where app itself doesn’t detect that there is a problem
diagnosing failures
Diagnosing Failures
  • Assigning responsibility to a specific hw or sw component
  • Insight into internals of a component
  • Cross component interactions
  • Current approach: instrument applications
    • App-specific log messages
  • Problems
    • High request rates => log rollover
    • Perceived overhead => detailed logging enabled during testing, disabled in production
fuse background
Fuse Background
  • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred
    • Lack of a positive ack => failure
step 2 conditional logging
Step 2: Conditional Logging
  • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain
    • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain
    • While fate is undecided: Detailed log messages stored in main memory
      • Common case overload of logging is vastly reduced
    • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures
      • Quantity of data generated is manageable, when most requests are successful
example
Client

Server1

Server2

Server3

X

Example

Benefits:

  • FUSE allows monitoring of real transactions.
    • All transactions, or a sampled subset to control overhead.
  • When a request fails, FUSE provides an audit trail
    • How far did it get?
    • How long did each step take?
    • Any additional application specific context.
  • FUSE can be deployed incrementally.
issues
Issues
  • Overload policy: need to handle bursts of failures without inducing more failures
  • How much effort to make apps FUSE enabled?
  • Are the right components FUSE enabled?
  • Identifying and filtering false positives
  • Tracking request flow is non-trivial with network load balancers
status
Status
  • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine
  • Testing in progress
  • Roll-out at end of summer
fuse is easy to integrate
FUSE is Easy to Integrate

Example current code on Front End:

ReceiveRequestFromClient(…) {

SendRequestToBackEnd(…);

}

Example code on Front End using FUSE:

ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null

if ( f != null ) JoinFUSEGroup( f );

SendRequestToBackEnd(…, f );

}

Current implementation is in C#, and consists of 2400 LOC

ad