Internet routing cos 598a today detecting anomalies inside an as l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 37

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS PowerPoint PPT Presentation


  • 202 Views
  • Uploaded on
  • Presentation posted in: General

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Traffic SNMP link statistics Packet and flow monitoring Network topology IP routers and links

Download Presentation

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet Routing (COS 598A)Today: Detecting Anomalies Inside an AS

Jennifer Rexford

http://www.cs.princeton.edu/~jrex/teaching/spring2005

Tuesdays/Thursdays 11:00am-12:20pm


Outline

  • Traffic

    • SNMP link statistics

    • Packet and flow monitoring

  • Network topology

    • IP routers and links

    • Fault data, layer-2 topology, and configuration

    • Intradomain route monitoring

  • Interdomain routes

    • BGP route monitoring

    • Analysis of BGP update data

  • Conclusions


Why is Traffic Measurement Important?

  • Billing the customer

    • Measure usage on links to/from customers

    • Applying billing model to generate a bill

  • Traffic engineering and capacity planning

    • Measure the traffic matrix (i.e., offered load)

    • Tune routing protocol or add new capacity

  • Denial-of-service attack detection

    • Identify anomalies in the traffic

    • Configure routers to block the offending traffic

  • Analyze application-level issues

    • Evaluate benefits of deploying a Web caching proxy

    • Quantify fraction of traffic that is P2P file sharing


Collecting Traffic Data: SNMP

  • Simple Network Management Protocol

    • Standard Management Information Base (MIB)

    • Protocol for querying the MIBs

  • Advantage: ubiquitous

    • Supported on all networking equipment

    • Multiple products for polling and analyzing data

  • Disadvantages: dumb

    • Coarse granularity of the measurement data

      • E.g., number of byte/packet per interface per 5 minutes

    • Cannot express complex queries on the data

    • Unreliable delivery of the data using UDP


Collecting Traffic Data: Packet Monitoring

  • Packet monitoring

    • Passively collecting IP packets on a link

    • Recording IP, TCP/UDP, or application-layer traces

  • Advantages: details

    • Fine-grain timing information

      • E.g., can analyze the burstiness of the traffic

    • Fine-grain packet contents

      • Addresses, port numbers, TCP flags, URLs, etc.

  • Disadvantages: overhead

    • Hard to keep up with high-speed links

    • Often requires a separate monitoring device


Collecting Traffic Data: Flow Statistics

  • Flow monitoring (e.g., Cisco Netflow)

    • Statistics about groups of related packets (e.g., same IP/TCP headers and close in time)

    • Recording header information, counts, and time

  • Advantages: detail with less overhead

    • Almost as good as packet monitoring, except no fine-grain timing information or packet contents

    • Often implemented directly on the interface card

  • Disadvantages: trade-off detail and overhead

    • Less detail than packet monitoring

    • Less ubiquitous than SNMP statistics


Using the Traffic Data in Network Operations

  • SNMP byte/packet counts: everywhere

    • Tracking link utilizations and detecting anomalies

    • Generating bills for traffic on customer links

    • Inference of the offered load (i.e., traffic matrix)

  • Packet monitoring: selected locations

    • Analyzing the small time-scale behavior of traffic

    • Troubleshooting specific problems on demand

  • Flow monitoring: selective, e.g,. network edge

    • Tracking the application mix

    • Direct computation of the traffic matrix

    • Input to denial-of-service attack detection


Network Topology


IP Topology

  • Topology information

    • Routers

    • Links, and their capacities

      • Internal links inside the AS

      • Edge links connecting to neighboring domains

  • Ways to learn the topology

    • Inventory database

    • SNMP polling/traps

    • Traceroute

    • Route monitoring

    • Router configuration data


Below IP

  • Layer-2 paths

    • ATM virtual circuits

    • Frame Relay virtual circuits

  • Mapping to lower layers

    • Specific fibers

    • Shared optical amplifiers

    • Shared conduits

    • Physical length (propagation delay)

  • Information not visible to IP

    • Stored in an inventory database

    • Not necessarily generated/updated automatically


Intradomain Monitoring: OSPF Protocol

  • Link-state protocol

    • Routers flood Link State Advertisements (LSAs)

    • Routers compute shortest paths based on weights

    • Routers identify next-hop to reach other routers

2

1

3

1

3

2

1

5

4

3


Intradomain Route Monitoring

  • Construct continuous view of topology

    • Detect when equipment goes up or down

    • Input to traffic-engineering and planning tools

  • Detect routing anomalies

    • Identify failures, LSA storms, and route flaps

    • Verify that LSA load matches expectations

    • Flag strange weight settings as misconfigurations

  • Analyze convergence delay

    • Monitor LSAs in multiple locations with go

    • Compare the times when LSAs arrive

  • Detect router implementation mistakes


Passive Collection of LSAs

  • OSPF is a flooding protocol

    • Every LSA sent on every participating link

    • Very helpful for simplifying the monitor

  • Can participate in the protocol

    • Shared media (e.g., Ethernet)

      • Join multicast group and listen to LSAs

    • Point-to-point links

      • Establish an adjacency with a router

  • … or passively monitor packets on a link

    • Tap a link and capture the OSPF packets


Reducing the Volume of Information

  • Prioritizing the messages

    • Router failure over router recovery

    • Link failure or weight change over a refresh

    • Informational messages about weight settings

  • Grouping related messages

    • Link failure: group messages for the two ends

    • Router failure: group the affected links

    • Common failure: group links failing close in time


Anomalies Found in the Shaikh04 paper

  • Intermittent hardware problem

    • Router periodically losing OSPF adjacencies

    • Risk of network partition if 2nd failure occurred

  • External link flaps

    • Congestion on edge link causing lost messages

    • Lost adjacency leading to flapping routes

  • Configuration errors

    • Two routers assigned the same IP address

    • Inefficient config leading to duplicate LSAs

  • Vendor implementation bug

    • More frequent refreshing of LSAs than specified


Interdomain Route Monitoring


Motivation for BGP Monitoring

  • Visibility into external destinations

    • What neighboring ASes are telling you

    • How you are reaching external destinations

  • Detecting anomalies

    • Increases in number of destination prefixes

    • Lost reachability to some destinations

    • Route hijacking

    • Instability of the routes

  • Input to traffic-engineering tools

    • Knowing the current routes in the network

  • Workload for testing routers

    • Realistic message traces to play back to routers


BGP Monitoring: A Wish List

  • Ideally: knowing what the router knows

    • All externally-learned routes

    • Before policy has modified the attributes

    • Before a single best route is picked

  • How to achieve this

    • Special monitoring session on routers that tells everything they have learned

    • Packet monitoring on all links with BGP sessions

  • If you can’t do that, you could always do…

    • Periodic dumps of routing tables

    • BGP session to learn best route from router


Using Routers to Monitor BGP

Establish a “passive” BGP

session from a workstation

running BGP software

Talk to operational

routers using SNMP or

telnet at command line

eBGP or iBGP

(+) BGP table dumps do not

burden operational routers

(-) Receives only best routes from

BGP neighbor

(+) Update dynamics captured

(+) not restricted to interfaces

provided by vendors

(-) BGP table dumps

are expensive

(+) Table dumps show all

alternate routes

(-) Update dynamics lost

(-) restricted to interfaces

provided by vendors


Collect BGP Data From Many Routers

Seattle

Cambridge

Chicago

Detroit

New York

Kansas City

Philadelphia

Denver

San

Francisco

St. Louis

Washington, D.C.

2

Los Angeles

Dallas

Atlanta

San Diego

Phoenix

Austin

Orlando

Houston

Route Monitor

BGP is not a flooding protocol


Detecting Important Routing Changes

  • Large volume of BGP updates messages

    • Around 2 million/day, and very bursty

    • Too much for an operator to manage

  • Identify important anomalies

    • Lost reachability

    • Persistent flapping

    • Large traffic shifts

  • Not the same as root-cause analysis

    • Identify changes and their effects

    • Focus on mitigation, rather than diagnosis

    • Diagnose causes if they occur in/near the AS


BR

BR

BR

E

E

E

E

E

E

BGP Update

Grouping

BGP

Updates

Events

Challenge #1: Excess Update Messages

  • A single routing change

    • Leads to multiple update messages

    • Affects routing decision at multiple routers

Persistent Flapping Prefixes

Group updates for a prefix with inter-arrival < 70 seconds,

and flag prefixes with changes lasting > 10 minutes.


Determine “Event Timeout”

Cumulative distribution of BGP update inter-arrival time

BGP beacon

(70, 98%)


Long Events

Event Duration: Persistent Flapping

Complementary cumulative distribution of event duration

(600, 0.1%)


Detecting Persistent Flapping

  • Significant persistent flapping

    • 15.2% of all BGP update messages

    • … though a small number of destination prefixes

    • Surprising, especially since flap dampening is used

  • Types of persistent flapping

    • Conservative flap-damping parameters (78.6%)

    • Protocol oscillations, e.g., MED oscillation (18.3%)

    • Unstable interface or BGP session (3.0%)


C

B

A

D

E

E

E

E

Example: Unstable eBGP Session

  • Flap damping parameters is session-based

  • Damping not implemented for iBGP sessions

Peer

AT&T

p

Customer


No Disruption

Event

Classification

Loss/Gain of Reachability

“Typed”

Events

Events

Internal Disruption

Single External Disruption

Multiple External Disruption

Challenge #2: Identify Important Events

  • Major concerns of network operators

    • Changes in reachability

    • Heavy load of routing messages on the routers

    • Flow of the traffic through the network

Classify events by type of impact it has on the network


C

D

A

B

E

E

E

E

E

E

Event Category – “No Disruption”

p

AS2

AS1

No Traffic Shift

“No Disruption”: each of the border routers has no traffic shift

AT&T


C

D

A

B

E

E

E

E

E

E

Event Category – “Internal Disruption”

p

AS2

AS1

“Internal Disruption”: all of the traffic shifts are internal traffic shift

AT&T

Internal Traffic Shift


C

D

A

B

E

E

E

E

E

E

Event Type: “Single External Disruption”

p

AS2

AS1

external Traffic Shift

AT&T

“Single External Disruption”: traffic at one exit point shifts to other exit points


Statistics on Event Classification


Event

Correlation

“Typed”

Events

Clusters

Challenge #3: Multiple Destinations

  • A single routing change

    • Affects multiple destination prefixes

Group events of same type that occur close in time


Main Causes of Large Clusters

  • External BGP session resets

    • Failure/recovery of external BGP session

    • E.g., session to another large tier-1 ISP

    • Caused “single external disruption” events

    • Validated by looking at syslog reports on routers

  • Hot-potato routing changes

    • Failure/recovery of an intradomain link

    • E.g., leads to changes in IGP path costs

    • Caused “internal disruption” events

    • Validated by looking at OSPF measurements


BR

BR

BR

E

E

E

E

E

E

Traffic Impact

Prediction

Large

Disruptions

Clusters

Challenge #4: Popularity of Destinations

  • Impact of event on traffic

    • Depends on the popularity of the destinations

Netflow Data

Weight the group of destinations by the traffic volume


Traffic Impact Prediction

  • Traffic weight

    • Per-prefix measurements from Netflow

    • 10% prefixes accounts for 90% of traffic

  • Traffic weight of a cluster

    • The sum of “traffic weight” of the prefixes

  • Flag clusters with heavy traffic

    • A few large clusters have large traffic weight

    • Mostly session resets and hot-potato changes


Conclusions

  • Network troubleshooting from the inside

    • Traffic, topology, and routing data

    • Easier to understand what’s going on

    • … though still challenging to collect/analyze data

  • Traffic measurement

    • SNMP, packet monitoring, and flow monitoring

  • Routing monitors

    • Track network state and identify anomalies

    • Intradomain monitor capturing LSAs

    • BGP monitor capturing BGP updates


Next Time: BGP Routing Table Size

  • Three papers

    • “On characterizing BGP routing table growth”

    • “An empirical study of router response to large BGP routing table load”

    • “A framework for interdomain route aggregation”

  • Review only of the first paper

    • Summary

    • Why accept

    • Why reject

    • Avenues for future work

  • Optional

    • Vanevar Bush on “As We May Think” (1945)


  • Login