Fault detection isolation and diagnosis in multihop wireless networks
Download
1 / 55

Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks - PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on

Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks. Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong Zhou Microsoft Research Presented by -Maitreya Natu. Network Management. Faults directory. …. Root cause. Healthy network. Corrective measure. Faulty network.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks' - leo-wilson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fault detection isolation and diagnosis in multihop wireless networks

Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks

Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong Zhou

Microsoft Research

Presented by

-Maitreya Natu


Network management
Network Management Wireless Networks

Faults directory

Root cause

Healthy network

Corrective measure

Faulty network


Tasks involved in network management
Tasks involved in Network Management Wireless Networks

  • Continuously monitoring the functioning

  • Collecting information about the nodes and the links

  • Removing inconsistencies and noise from the reported information

  • Analyzing the information

  • Taking appropriate actions to improve network reliability and performance


Challenges in wireless networks
Challenges in wireless networks Wireless Networks

  • Dynamic and unpredictable topology

    • link errors due to fluctuating environment conditions

    • Node mobility

  • Limited capacity

    • Scarcity of resources

  • Link attacks


Proposed framework
Proposed framework Wireless Networks

  • Reproduce inside a simulator, the real-world events that took place

  • Use online trace driven simulation to detect faults and analyze the root causes


Network management1
Network Management Wireless Networks

Network model

Types of faults

Healthy network

Faults directory

Creating a network model


Network management2
Network Management Wireless Networks

Network model

Fault diagnosis

Types of faults

Faulty network

Faults directory

Detected faults


Network management3
Network Management Wireless Networks

Network model

what-if analysis

Types of faults

Faults directory

Corrective measures

Detected faults


Key issues
Key issues Wireless Networks

  • How to Accurately reproduce what happened in the network inside a simulator

  • How to build fault diagnosis on top of a simulator to perform root cause analysis


Accurate modeling
Accurate modeling Wireless Networks

  • Use real traces from the diagnosed network

    • Removes dependency on generic theoretical models

    • Captures nuances of the hardware, software and environment of the particular network

  • Collect good quality data

    • By developing a technique to effectively rule out erroneous data


Fault diagnosis
Fault diagnosis Wireless Networks

  • Performance data emitted by trace driven simulation is used as baseline

  • Any significant deviation indicates a potential fault

  • Simulator selectively injects a set of suspected faults and searches a set that most produces the expected performance

  • An efficient algorithm is designed to determine root causes


System overview
System Overview Wireless Networks

6. Search for set of faults that result in best explanation

Link/Node

failure

Faults Directory

7. Report the

cause of failure

simulator

Link RSS

Interference Injection

Link Load

Error

Traffic Simulator

+/-

Expected loss rate

Throughput noise

Topology changes

Routing update

5. Discrepancy

Found

Loss rate

Throughput noise

4. Compare Expected & Average

Performance

1. Receive Cleaned Data

2. Drive Simulation

3. Compute Expected Performance


Why simulation based diagnosis
Why Simulation Based Diagnosis? Wireless Networks

  • Much better insights into the network behavior than any heuristic or theoretical technique

  • Highly customizable and applies to a large class of networks

  • Ability to perform what-if analysis

    • Helps to foresee the consequences of a corrective action

  • Recent advances in simulators have made possible their use for real-time analysis


Accurate modeling1
Accurate modeling Wireless Networks

Network model

Types of faults

Healthy network

Faults directory


Current network models
Current network models Wireless Networks

  • Bayesian networks to map symptom-fault dependencies

  • Context Free Grammars

  • Correlation Matrix



Building confidence in simulator accuracy
Building confidence in simulator accuracy Wireless Networks

  • Problem

    • Hard to accurately model the physical layer and the RF propagation

    • Traffic demands on the router are hard to predict


Building confidence in simulator accuracy1
Building confidence in simulator accuracy Wireless Networks

  • Problem

    • Hard to accurately model the physical layer and the RF propagation

    • Traffic demands on the router are hard to predict

  • Solution

    • “after the fact” simulation

    • Agents periodically report information about the link conditions and traffic patterns to the link simulators


Simulations when the rf condition of the link is good
Simulations when the RF condition of the link is good Wireless Networks

Modeling the contention from flows within the

interference and communication ranges.

Modeling the overheads of the protocol stack such as

parity bits, MAC-layer back-off, IEEE 802.11 inter-frame

spacing and ACK, and headers.


Simulations with varying received signal strength
Simulations with varying received signal strength Wireless Networks

Simulator estimate deviates from real,

when signal strength is poor

Throughput matches closely with the simulator’s estimate,

when signal quality is good


Why simulation results deviate in case of poor signal strength
Why simulation results deviate in case of poor signal strength?

  • Lack of accurate packet loss as a function of packet size, RSS and ambient noise.

    • Depends on signal processing hardware and the RF antenna within the wireless cards

  • Lack of accurate auto-rate control

    • Adjustment of sending rate done by WLAN cards based on the transmission conditions


How to model auto rate control done by wlan cards
How to model auto-rate control done by WLAN cards? strength?

  • Use Trace driven simulation

  • When auto-rate is in use

    • Collect the rate at which the wireless card is operating and provide the reported rate to the simulator

  • Otherwise

    • Data rate is known to the simulator


How to model accurate packet loss as a function of packet size rss and ambient noise
How to model accurate packet loss as a function of packet-size, RSS and ambient noise?

  • Use offline analysis

  • Calibrate the wireless cards and create a database associating environmental factors with expected performance

    • E.g., mapping from signal strength and noise to loss rate


Experiment to model the loss rates due to poor signal strength
Experiment to model the loss rates due to poor signal strength

  • Collect another set of traces

    • Slowly send out packets

    • Place packet sniffers near both the sender and the receiver, and derive loss rate from the packet level trace

  • Seed the wireless link in the simulator with a Bernoulli loss rate that matches loss rate with the real traces


Estimated and measured throughput when compensating for the loss rate due to poor signal strength

Even though the match is not perfect, its not expected to be a

problem, because

many routing protocols try to avoid the use of poor quality links

Poor quality links are used only when certain parts of mesh network have poor connectivity to the rest of the network

In a well-engineered network, not many nodes depend on such bad link for routing

Estimated and measured throughput when compensating for the loss rate due to poor signal strength

Loss rate and the measured throughput do not monotonically decrease with the signal strength due to the effect of auto-rate


Stability of channel conditions
Stability of channel conditions a

  • How rapidly do channel conditions change and how often a trace should be collected?


Temporal fluctuation in rss
Temporal fluctuation in RSS a

  • Fluctuation magnitude is not significant

  • Relative quality of signals across different number of walls remain stable


Stability of channel conditions1
Stability of channel conditions a

  • How rapidly do channel conditions change and how often a trace should be collected?

    • When the environment is generally static, nodes may report only the average and standard deviation of the RSS to the manager every few minutes


Dealing with imperfect data
Dealing with imperfect data a

  • By neighborhood monitoring

    • Each node reports performance and traffic statistics for its incoming and outgoing links

    • And for other links in its communication range

      • Possible when node is in promiscuous mode

  • Thus multiple reports are sent for each link

  • Redundant reports can be used to detect inconsistency

  • Find the minimum set of nodes that can explain the inconsistency in the reports


Summary
Summary a

  • How to accurately model the real behavior?

    • Solution: Use trace-based simulation

  • Problem: Simulation results are good for strong signals but deviate for bad RF conditions

    • Need to model the autorate control

      • Use trace-driven data

    • Need to model the loss rate due to poor signal strength

      • Use offline analysis

  • How often a trace should be collected?

    • Very little data (average and standard deviation of RSS), at fairly low time granularity, as channels are relatively stable

  • How to deal with imperfect data

    • By neighborhood monitoring


Fault diagnosis1
Fault diagnosis a

Network model

Types of faults

Faulty network

Faults directory

Detected faults


Current fault diagnosis approaches
Current fault diagnosis approaches a

  • AI techniques

    • Rule based systems

    • Neural networks

  • Model traversing techniques

    • Dependency graphs

    • Causality graphs

    • Bayesian networks


Fault isolation and diagnosis
Fault Isolation and Diagnosis a

  • Establish the expected performance in the simulation

  • Find difference between expected and observed performance

  • Search over the fault space to detect which set of faults can re-produce performance similar to what has been observed


Collecting data from traces
Collecting data from traces a

  • Trace data collection

    • Network topology

      • Each node reports its neighbor and routing tables

    • Traffic statistics

      • Each node maintains counters of traffic sent and received from immediate neighbors

    • Physical medium

      • Each node reports signal strength of wireless links to neighbors

    • Network performance

      • Includes both the link and end-to-end performance, which can be measured through loss rate, delay, throughputs

      • Focus is on link level performance


Simulating the network performance
Simulating the network performance a

  • Traffic load simulation

    • Link based traffic simulation

    • Adjust application sending rate to match the observed link-level traffic counts

  • Route simulation

    • Use actual routes taken by packets as input to the simulator

  • Wireless signal

    • Use real measurement of signal strength

  • Fault injection

    • Random packet dropping

    • External noise sources

    • MAC misbehavior


Fault diagnosis algorithm
Fault diagnosis algorithm a

  • General approach

Simulator

Expected performance

Network settings

Simulator

Observed performance

Network settings

Faults set

How to find ?


How to search the faults efficiently
How to search the faults efficiently? a

  • Different types of faults often change one or few metrics

    • E.g., random dropping only affects link loss rate

  • Thus use metrics in which observed and expected performance is significantly different, to guide the search


Scenario where faults do not have strong interactions

Consider large deviation from expected performance as anomaly

Use decision tree to determine the type of fault

Fault type determines the metric to quantify performance difference

Locate faults by finding the set of nodes and links with large difference between expected and observed performance

Scenario where faults do not have strong interactions


Scenario where faults have strong interactions
Scenario where faults have strong interactions anomaly

  • Get the initial diagnosis set from the decision tree algorithm

  • Iteratively refine the fault set

    • Adjust the magnitudes of faults in the fault set

      • Translate difference in performance into change in faults’ magnitude

      • It maps the impact of a fault into its magnitude

      • Remove fault whose magnitude is too small

    • Add new faults that can explain large differences between the expected and observed performances

  • Iterate till the change in fault set is negligible


Example scenario
Example scenario anomaly

2

3

1

4

5


Example scenario1
Example scenario anomaly

  • Observed performance

  • Increased loss rate at 1-4 and 1-2

  • No increase in the sending rate of 1-4, 1-2

  • No increase in noise experienced by

  • neighbors

2

3

Inference

1

Increased Sending Rate

4

5

Y

N

Too low CW

Increased Noise

Y

N

Increased Loss

Noise

Y

N

Packet Drop

Normal


Example scenario2
Example scenario anomaly

  • Observed performance

  • Increased loss rate at 1-4 and 1-2

  • No increase in the sending rate of 1-4, 1-2

  • No increase in noise experienced by

  • neighbors

2

3

Inference

1

Increased Sending Rate

4

5

Y

N

Too low CW

Increased Noise

Y

N

Increased Loss

Noise

Y

N

Packet dropping at node 1

Packet Drop

Normal


Accuracy of fault diagnosis
Accuracy of fault diagnosis anomaly

  • Correctness of the model

    • Complete information

    • Consistent information

    • Timely information

  • Correctness of the reported symptoms

    • Right size of the threshold to report a symptom

    • Difference in the behavior of faults

    • Timely reporting of symptoms


System implementation
System implementation anomaly

  • Windows XP

  • Agents run on every wireless node and reports information collected on demand

  • Managers collect and analyze information

  • Collected information is cast into performance counters supported by Windows

  • Manager is connected to a backend simulator. Collected information is converted to script to drive the simulation

  • Testbed:

    • Multihop wireless testbed built using IEEE 802.11a cards

    • Commercially available network sniffer called Airopeek is used for data collection

    • Native 802.11 NICs provide rich set of networking information


Evaluation data collection overhead
Evaluation: Data collection overhead anomaly

Data collection traffic has little effect

Overhead < 800 bits/s/node

Management traffic overhead

Performance of FTP flow with

and without data collection

No data cleaning: Each link is reported only once

With data cleaning: Each link is reported by all observers for consistency check


Data cleaning effectiveness
Data cleaning anomalyeffectiveness

Coverage greater than 80% in all cases

Higher accuracy with grid topology

Higher coverage when using history

Higher accuracy with denser networks

Higher accuracy with client-server traffic


Evaluation fault diagnosis
Evaluation: Fault diagnosis anomaly

Detecting external noise

Detecting random dropping

  • Symptom: Significant difference in noise level in nodes

  • Noise sources are correctly identified with

  • at most one or two false positives

  • Inference error in magnitudes of noises is

  • within 4%

  • Symptom: Significant difference in loss rates in links

  • Less than 20% of fault links are left

  • undetected

  • No-effect faults are faulty links sending

  • less that threshold (250) packets of data


Evaluation fault diagnosis1
Evaluation: Fault diagnosis anomaly

Detecting combinations of all

Detecting MAC misbehavior

  • Symptom: Significant discrepancy in throughput on links

  • Coverage is mostly around 80% or higher

  • False positives within 2


What if analysis
what-if anomaly analysis

Network model

Types of faults

Faults directory

Corrective measures

Detected faults


What if analysis1
What-if analysis anomaly

Topology

Diagnosis

Corrective measures


Limitations
Limitations anomaly

  • Limited by accuracy of the simulator

  • Time to detect the faults is acceptable for detecting long term faults but not transient faults

  • Choices of traces to drive the simulation has important implications

  • Focus has only been on faults resulting in different behavior


Conclusion
Conclusion anomaly

  • Used trace data for modeling the network

  • Data collection techniques are presented to collect network information and detect a deviation from the expected performance

  • Fault diagnosis algorithm is proposed to detect the root causes of failure

  • A scheme for what-if analysis is proposed to evaluate alternative network configuration for efficient network operation


Future work
Future work anomaly

  • Validation on a large test-bed

  • Performance analysis in presence of mobility

  • Detecting malicious attacks

  • Diagnosis in presence of incomplete network information

  • More deeply investigating the potential of what-if analysis


References
References anomaly

  • L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault Detection, Isolation, and Diagnosis in Multihop Wireless Networks, Microsoft Technical Report, Microsoft Researh-TR-2004-11, Dec. 2003

  • M. Steinder, A. Sethi, A survey of fault localization techniques in computer networks, Technical Report 2001, CIS Dept., Univ of Delaware, Feb 2001

  • M. Steinder, Probabilistic inference for diagnosing service failures in communication systems, PhD thesis, Univ. of Delaware, 2003


Questions
Questions anomaly

  • What is proposed solution to model the throughput when the signal strength is poor? In Table 2, the simulated throughput monotonically decreases with the loss rate while the measured throughput does not. Why?

  • What could be the causes of generation of false positives in the fault diagnosis results? When can the false positive ratio increase?

  • http://www.cis.udel.edu/~natu/861/861.html


ad