slide1
Download
Skip this Video
Download Presentation
George J. Lee <[email protected]> Advanced Network Architecture Group

Loading in 2 Seconds...

play fullscreen
1 / 16

George J. Lee <[email protected]> Advanced Network Architecture Group - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

CAPRI: A Common Architecture for Autonomous, Distributed Internet Fault Diagnosis using Probabilistic Relational Models. George J. Lee &lt;[email protected]&gt; Advanced Network Architecture Group Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' George J. Lee <[email protected]> Advanced Network Architecture Group' - ali-gentry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

CAPRI: A Common Architecture for Autonomous, Distributed Internet Fault Diagnosis using Probabilistic Relational Models

George J. Lee <[email protected]>

Advanced Network Architecture Group

Computer Science and Artificial Intelligence Lab

Massachusetts Institute of Technology

automated internet fault diagnosis is difficult
Automated Internet fault diagnosis is difficult

DA

  • Knowledge, data, and reasoning are distributed
    • Agents need a common extensible language for expressing knowledge & data
  • Agents have incomplete information:
    • Agents must perform probabilistic diagnosis when evidence is unavailable
  • Distributed diagnosis is costly
    • Agents must minimize probing and communication cost

Failure Report

DA

Data

DA = Diagnostic Agent

Diagnosis

DA

Reasoning

Knowledge

We need a Common Architecture for Probabilistic Reasoning in the Internet (CAPRI)

overview
Overview
  • An extensible language for expressing diagnostic data & knowledge
    • Based on Bayes nets and Probabilistic Relational Models
  • Distributed probabilistic reasoning while minimizing probing and communication cost
    • Trading off accuracy and cost
    • Incorporating past evidence
    • Propagating evidence to other agents
    • Simulations: accuracy vs. cost
  • Learning diagnostic knowledge for real-world diagnosis
    • Passive diagnosis of HTTP proxy connections
    • Evaluation: accuracy using learned knowledge
bayes nets can express diagnostic data
Data = evidence about a particular failure

Diagnostic test results

Component status

Diagnosis without domain-specific knowledge

Allows distributed inference

A

B

C

N

Bayes nets can express diagnostic data

IP Path

B-C Link

CN Path

A-B Link=FAIL

A-B Link

BN Path

AN Path

AN Probe

probabilistic relational models prms can express diagnostic knowledge
Knowledge = shared knowledge about component and test classes

Class dependencies

Diagnostic tests

Agents generate Bayes net using PRM

Provided by experts or learned by agents

Extensible

New component and test classes

Subclassing (e.g. Wireless Link)

Probabilistic Relational Models (PRMs) can express diagnostic knowledge

Link

Status

First

IP Path

Status

Rest

Path

Ping Test

Result

probabilistic models enable agents to reduce diagnosis cost
Diagnosis Procedure:

Receive failure report

Construct Bayes net from PRM

Incorporate current and past evidence using a Dynamic Bayes Net (DBN)

Infer most probable explanation (MPE) for failure

While mpe_confidence < confThresh:

Perform local tests or request diagnosis from other agents to maximize relevance/cost

Propagate evidence to other agents

Return diagnosis

Architectural points:

Agents can trade off accuracy vs. cost using a confidence threshold

Agents can infer current status from past evidence given a temporal failure model

Agents can reduce load and improve robustness by propagating evidence

Probabilistic models enable agents to reduce diagnosis cost

Diagnosis cost = probing + communication cost

minimizing cost for ip path diagnosis

IP Path

User

A

B

K

N

Dest

Minimizing cost for IP path diagnosis
  • IP path diagnosis: ISP (AB), rest of path (BN), or destination (NDest)
  • Simulated 6000 Autonomous System (AS) topology
  • 1 DA per AS that can test links and destinations associated with that AS
  • All diagnostic agents have knowledge of prior link failure probabilities
  • Diagnostic agents are reachable up to the point of failure
  • Status of inter-AS links and destination hosts drawn from prior probabilities
  • Evidence collection and propagation follow DAs in the AS path

Evidence collection

Failure report

DA 1

User AS A

DA 2

AS B

DA k

AS K

DA n

Dest AS N

Diagnosis

Evidence propagation

agents can trade off accuracy and cost

0.9

0.8

1.0

0.7

0.6

confThresh

0.4

0.5

Agents can trade off accuracy and cost
  • 13 confidence thresholds, 500 users, 5 trials
incorporating past evidence reduces probing costs
Incorporating past evidence reduces probing costs
  • cache duration = number of past time steps of evidence to consider
  • Inter-AS link failures modeled as a Markov chain (Gilbert model)
  • 100 users, 5 trials, 30 time steps
  • >95% accuracy
evidence propagation reduces probing and communication costs
Evidence propagation reduces probing and communication costs
  • 10,000 users
  • 5 trials
  • 1 time step
  • >95% accuracy

50,000 failures

100,000 failures

agents can learn probabilistic models for tcp overlay connection diagnosis

TCP Overlay Path

User

Proxy

Server

Agents can learn probabilistic models for TCP overlay connection diagnosis
  • Learn inter-AS TCP failure probabilities from Planetseer (28.3 million TCP connections from 196 hosts over 10 hours)

Src

AS

Dst

AS

Hour

Src

AS

Dst

AS

Hour

TCP Conn.

UserProxy

TCP Conn.

ProxyServer

HTTP Proxy Conn.

UserServer

2. Diagnose HTTP proxy connections on CoDeeN without using probes

learned diagnostic knowledge improves accuracy
Learned diagnostic knowledge improves accuracy
  • Accuracy: 80% vs. 53%
    • Train on hour x
    • Test on hour x + 1
  • Accuracy improves as training interval increases
    • Train on first x hours, test on hour x + 1
  • Accuracy remains high as training set age increases
    • Train on hour 1, test on hour x > 1
benefits of capri
Benefits of CAPRI
  • An extensible language for diagnostic data and knowledge
    • Based on Bayes nets and PRMs
  • Distributed diagnosis while minimizing probing and communication cost
    • accuracy/cost tradeoff
    • incorporating past evidence
    • evidence propagation
  • Robustness to missing data
    • probabilistic inference using cached data
  • Ability to learn diagnostic knowledge
    • learn conditional failure probabilities using PRMs
future work
Future Work
  • Costs and incentives
    • Learning the true network costs of diagnostic tests
    • Dynamically adjusting cost
    • Incentives for agent to reveal evidence
  • Intelligent routing of diagnostic queries
  • Temporal failure models
    • Learning temporal failure models
    • Predicting failure duration
  • Diagnosis using data from end users
modeling dynamic networks
Modeling Dynamic networks
  • Model network component state as a Markov chain (Gilbert model)
  • Dynamic Bayes net (DBN):

0.03

OK

FAIL

0.97

0.71

0.29

s1

s2

s3

P(s3=OK | s1=FAIL) =

P(s3=OK | s2=OK)  P(s2=OK | s1=FAIL)

+ P(s3=OK | s2=FAIL)  P(s2=FAIL | s1=FAIL)

ad