A statistical learning approach to diagnosing ebay s site
Download
1 / 21

A Statistical Learning Approach to Diagnosing eBay’s Site - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

A Statistical Learning Approach to Diagnosing eBay’s Site. Mike Chen , Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer [email protected] Motivation. Fast failure detection and diagnosis are critical to high availability

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Statistical Learning Approach to Diagnosing eBay’s Site' - allegra-lawrence


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A statistical learning approach to diagnosing ebay s site

A Statistical Learning Approach to Diagnosing eBay’s Site

Mike Chen, Alice Zheng, Jim Lloyd,

Michael Jordan, Eric Brewer

[email protected]


Motivation
Motivation

  • Fast failure detection and diagnosis are critical to high availability

    • But, exact root cause may not be required for many recovery techniques

  • Many potential causes of failures

    • Software bugs, hardware, configuration, network, database, etc.

    • Manual diagnosis is slow and inconsistent

  • Statistical approaches are ideal

    • Simultaneously examining many possible causes of failures

    • Robust to noise

Path-basedDiagnosis


Challenges
Challenges

  • Lots of (noisy) data

  • Near real-time detection and diagnosis

  • Multiple independent failures

  • Root cause might not be captured in logs

Path-basedDiagnosis


Talk outline
Talk Outline

  • Introduction

  • eBay’s infrastructure

  • 3 statistical approaches

  • Early results

Path-basedDiagnosis


Ebay s infrastructure
eBay’s Infrastructure

  • 2 physical tiers

    • Web server/app server + DB

    • Migrating to Java (WebSphere) from C++

  • SuperCAL (Centralized Application Logging)

    • API for app developer to log anything to CAL

    • Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc.

    • Supports nested txns

    • A path can be identified via thread ID + host ID

Path-basedDiagnosis


Supercal architecture
SuperCAL Architecture

  • Stats

    • 2K app servers, 40 SuperCAL machines

    • 1B URLs/day

    • 1TB raw logs/day (150GB gzipped), 200Mbps peak

detection

App

Servers

LB

Switch

diagnosis

……

Real-time

msg bus

Path-basedDiagnosis


Failure analysis
Failure Analysis

  • Summarize each transaction into:

  • What features are causing requests to fail?

    • Txn type, txn name, pool, host, version, DB, or a combination of these?

    • Different causes require different recovery techniques

Features

Class

Path-basedDiagnosis


3 approaches
3 Approaches

  • Machine learning

    • Decision trees

    • MinEntropy – eBay’s greedy variant of decision trees

  • Data mining

    • Association rules

Path-basedDiagnosis


Decision trees

Sunny

Cloudy

Y

No new snow

New snow

Y

N

Decision Trees

  • Classifiers developed in the statistical machine learning field

  • Example: go skiing tomorrow?

  • “learning” => inferring the decision trees rules from data

New snow

No new snow

Y

Cloudy

Sunny

Y

N

Path-basedDiagnosis


Decision trees1
Decision Trees

  • Feature selection

    • Look for features that best separates the classes

    • Different algorithms uses different metrics to measure “skewness” (e.g. C4.5 uses information gain)

  • The goal of decision tree algorithm

    • to split nodes until leaves are “pure” enough or until no further split is possible

      • i.e. pure => all data points have the same class label

    • Use pruning heuristics to control over-fitting

Path-basedDiagnosis


Decision trees sample output

(Correct, incorrect)

Decision Trees – Sample Output

  • Pool = icgi1

    | TxnName = LeaveFeedback: failed (8,1)

    | TxnName = MyFeedback: failed (205,3)

    Pool = icgi2

    | TxnName = Respond: failed (1)

    | TxnName = ViewFeedback: failed (3554,52)

  • Naïve diagnosis:

    • Pool=icgi1 and TxnName=LeaveFeedback

    • Pool=icgi1 and TxnName=MyFeedback

    • Pool=icgi2 and TxnName=Respond

    • Pool=icgi2 and TxnName=ViewFeedback

icgi1

icgi2

Respond

MyFdbk

LeaveFdbk

ViewFdbk

8

205

1

3554

Path-basedDiagnosis


Feature selection heuristics

icgi1

icgi2

MyFdbk

MyFdbk

Respond

Respond

205

3554

205

3554

Feature Selection Heuristics

  • Ignore leaf nodes with no failed transactions

  • Problem: noisy leaves

    • keep the top N leaves, or ignore nodes with < M% failues

  • Problem: features may not be independent

    • drop ancestor nodes that are “subsumed” by the leaves

  • Rank by impact

    • sort the predicted causes by failure count

icgi1

icgi2

LeaveFdbk

Respond

MyFdbk

ViewFdbk

8

205

1

3554

Path-basedDiagnosis


Minentropy
MinEntropy

  • Entropy measures the randomness of data

    • E.g. if failure is evenly distributed (very random), then entropy is high

  • Rank features by the normalized entropy

    • Greedy approach searches for the leaf node with most failures

  • Always produces one and exactly one diagnosis

  • Deployed on the entire eBay site

    • Sends real-time alerts to ops

    • Pros: fast (<1s for 100K txns and scales linearly)

    • Cons: optimized for single faults

Path-basedDiagnosis


Minentropy example
MinEntropy example

Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1

Path-basedDiagnosis


Association rules
Association Rules

  • Data mining technique to compute item sets

    • e.g. Shoppers who bought this item also shopped for …

  • Metrics

    • Confidence: (# of A & B) / # of A

      • Conditional probability of B given A

    • Support: (# of A & B)/total # of txns

  • Generates rules for all possible sets

    • e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02)

  • Applied to failure diagnosis

    • Find all rules that has failed status on the right, then rank by conf

    • Pros: looks at combinations of features

    • Cons: generates many rules

Path-basedDiagnosis


Association rules sample output
Association Rules – Sample Output

  • Sample output (rules containing failures):

    TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

    Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

    TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

    TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

  • Problem: features may not be independent

    • e.g. all LeaveFeedback txns are of type URL

    • Drop rules that are subsumed by more specific rules

  • Diagnosis: TxnName=LeaveFeedback

Path-basedDiagnosis


Experimental setup
Experimental Setup

  • Dataset

    • About 1/8 of the whole site

    • 10 one-minute traces, 4 with 2 concurrent faults

      • total of 14 independent faults

    • True faults identified through post-mortems, ops chat logs, application logs, etc.

  • Metrics

    • Precision: (# of identified faults) / (# of true faults)

    • Recall: (# of identified faults) / (# of predicted faults)

Path-basedDiagnosis


Results dbs in dataset

True causes for DB-related failures are captured in the dataset

Variable number of DBs used by each txn

Feature selection heuristics

Ignore leaf nodes with no failed transactions

Noise filtering

ignore nodes with < M% failues (in this case, M = 10)

Path trimming

drop ancestor nodes subsumed by the leaf nodes

Results: DBs in Dataset

Path-basedDiagnosis


Results dbs not in dataset

True cause not captured for DB-related failures dataset

C4.5 suffers from unbalanced dataset

i.e. produces a single-rule that predicts every txn to be successful

Results: DBs not in Dataset

Path-basedDiagnosis


What s next
What’s next? dataset

  • ROC curves

    • show tradeoff between precision and recall

  • Transient failures

    • Up-sample to balance dataset or use cost matrix

  • Some measure of the “confidence” of the prediction

  • More data points

    • Have 20hrs of logs that have failures

Path-basedDiagnosis


Open questions
Open Questions dataset

  • How to deal with multiple symptoms?

    • E.g. DB outage causing multiple types of requests to fail

    • Treat it as multiple failures?

  • Failure importance (count vs. rate)

    • Two failures may have similar failure count

    • Low volume and higher failure rate vs. high volume and lower failure rate

Path-basedDiagnosis


ad