Internet routing cos 598a today detecting anomalies inside an as l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS PowerPoint PPT Presentation


  • 197 Views
  • Uploaded on
  • Presentation posted in: General

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Traffic SNMP link statistics Packet and flow monitoring Network topology IP routers and links

Download Presentation

Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet routing cos 598a today detecting anomalies inside an as l.jpg

Internet Routing (COS 598A)Today: Detecting Anomalies Inside an AS

Jennifer Rexford

http://www.cs.princeton.edu/~jrex/teaching/spring2005

Tuesdays/Thursdays 11:00am-12:20pm


Outline l.jpg

Outline

  • Traffic

    • SNMP link statistics

    • Packet and flow monitoring

  • Network topology

    • IP routers and links

    • Fault data, layer-2 topology, and configuration

    • Intradomain route monitoring

  • Interdomain routes

    • BGP route monitoring

    • Analysis of BGP update data

  • Conclusions


Why is traffic measurement important l.jpg

Why is Traffic Measurement Important?

  • Billing the customer

    • Measure usage on links to/from customers

    • Applying billing model to generate a bill

  • Traffic engineering and capacity planning

    • Measure the traffic matrix (i.e., offered load)

    • Tune routing protocol or add new capacity

  • Denial-of-service attack detection

    • Identify anomalies in the traffic

    • Configure routers to block the offending traffic

  • Analyze application-level issues

    • Evaluate benefits of deploying a Web caching proxy

    • Quantify fraction of traffic that is P2P file sharing


Collecting traffic data snmp l.jpg

Collecting Traffic Data: SNMP

  • Simple Network Management Protocol

    • Standard Management Information Base (MIB)

    • Protocol for querying the MIBs

  • Advantage: ubiquitous

    • Supported on all networking equipment

    • Multiple products for polling and analyzing data

  • Disadvantages: dumb

    • Coarse granularity of the measurement data

      • E.g., number of byte/packet per interface per 5 minutes

    • Cannot express complex queries on the data

    • Unreliable delivery of the data using UDP


Collecting traffic data packet monitoring l.jpg

Collecting Traffic Data: Packet Monitoring

  • Packet monitoring

    • Passively collecting IP packets on a link

    • Recording IP, TCP/UDP, or application-layer traces

  • Advantages: details

    • Fine-grain timing information

      • E.g., can analyze the burstiness of the traffic

    • Fine-grain packet contents

      • Addresses, port numbers, TCP flags, URLs, etc.

  • Disadvantages: overhead

    • Hard to keep up with high-speed links

    • Often requires a separate monitoring device


Collecting traffic data flow statistics l.jpg

Collecting Traffic Data: Flow Statistics

  • Flow monitoring (e.g., Cisco Netflow)

    • Statistics about groups of related packets (e.g., same IP/TCP headers and close in time)

    • Recording header information, counts, and time

  • Advantages: detail with less overhead

    • Almost as good as packet monitoring, except no fine-grain timing information or packet contents

    • Often implemented directly on the interface card

  • Disadvantages: trade-off detail and overhead

    • Less detail than packet monitoring

    • Less ubiquitous than SNMP statistics


Using the traffic data in network operations l.jpg

Using the Traffic Data in Network Operations

  • SNMP byte/packet counts: everywhere

    • Tracking link utilizations and detecting anomalies

    • Generating bills for traffic on customer links

    • Inference of the offered load (i.e., traffic matrix)

  • Packet monitoring: selected locations

    • Analyzing the small time-scale behavior of traffic

    • Troubleshooting specific problems on demand

  • Flow monitoring: selective, e.g,. network edge

    • Tracking the application mix

    • Direct computation of the traffic matrix

    • Input to denial-of-service attack detection


Network topology l.jpg

Network Topology


Ip topology l.jpg

IP Topology

  • Topology information

    • Routers

    • Links, and their capacities

      • Internal links inside the AS

      • Edge links connecting to neighboring domains

  • Ways to learn the topology

    • Inventory database

    • SNMP polling/traps

    • Traceroute

    • Route monitoring

    • Router configuration data


Below ip l.jpg

Below IP

  • Layer-2 paths

    • ATM virtual circuits

    • Frame Relay virtual circuits

  • Mapping to lower layers

    • Specific fibers

    • Shared optical amplifiers

    • Shared conduits

    • Physical length (propagation delay)

  • Information not visible to IP

    • Stored in an inventory database

    • Not necessarily generated/updated automatically


Intradomain monitoring ospf protocol l.jpg

Intradomain Monitoring: OSPF Protocol

  • Link-state protocol

    • Routers flood Link State Advertisements (LSAs)

    • Routers compute shortest paths based on weights

    • Routers identify next-hop to reach other routers

2

1

3

1

3

2

1

5

4

3


Intradomain route monitoring l.jpg

Intradomain Route Monitoring

  • Construct continuous view of topology

    • Detect when equipment goes up or down

    • Input to traffic-engineering and planning tools

  • Detect routing anomalies

    • Identify failures, LSA storms, and route flaps

    • Verify that LSA load matches expectations

    • Flag strange weight settings as misconfigurations

  • Analyze convergence delay

    • Monitor LSAs in multiple locations with go

    • Compare the times when LSAs arrive

  • Detect router implementation mistakes


Passive collection of lsas l.jpg

Passive Collection of LSAs

  • OSPF is a flooding protocol

    • Every LSA sent on every participating link

    • Very helpful for simplifying the monitor

  • Can participate in the protocol

    • Shared media (e.g., Ethernet)

      • Join multicast group and listen to LSAs

    • Point-to-point links

      • Establish an adjacency with a router

  • … or passively monitor packets on a link

    • Tap a link and capture the OSPF packets


Reducing the volume of information l.jpg

Reducing the Volume of Information

  • Prioritizing the messages

    • Router failure over router recovery

    • Link failure or weight change over a refresh

    • Informational messages about weight settings

  • Grouping related messages

    • Link failure: group messages for the two ends

    • Router failure: group the affected links

    • Common failure: group links failing close in time


Anomalies found in the shaikh04 paper l.jpg

Anomalies Found in the Shaikh04 paper

  • Intermittent hardware problem

    • Router periodically losing OSPF adjacencies

    • Risk of network partition if 2nd failure occurred

  • External link flaps

    • Congestion on edge link causing lost messages

    • Lost adjacency leading to flapping routes

  • Configuration errors

    • Two routers assigned the same IP address

    • Inefficient config leading to duplicate LSAs

  • Vendor implementation bug

    • More frequent refreshing of LSAs than specified


Interdomain route monitoring l.jpg

Interdomain Route Monitoring


Motivation for bgp monitoring l.jpg

Motivation for BGP Monitoring

  • Visibility into external destinations

    • What neighboring ASes are telling you

    • How you are reaching external destinations

  • Detecting anomalies

    • Increases in number of destination prefixes

    • Lost reachability to some destinations

    • Route hijacking

    • Instability of the routes

  • Input to traffic-engineering tools

    • Knowing the current routes in the network

  • Workload for testing routers

    • Realistic message traces to play back to routers


Bgp monitoring a wish list l.jpg

BGP Monitoring: A Wish List

  • Ideally: knowing what the router knows

    • All externally-learned routes

    • Before policy has modified the attributes

    • Before a single best route is picked

  • How to achieve this

    • Special monitoring session on routers that tells everything they have learned

    • Packet monitoring on all links with BGP sessions

  • If you can’t do that, you could always do…

    • Periodic dumps of routing tables

    • BGP session to learn best route from router


Using routers to monitor bgp l.jpg

Using Routers to Monitor BGP

Establish a “passive” BGP

session from a workstation

running BGP software

Talk to operational

routers using SNMP or

telnet at command line

eBGP or iBGP

(+) BGP table dumps do not

burden operational routers

(-) Receives only best routes from

BGP neighbor

(+) Update dynamics captured

(+) not restricted to interfaces

provided by vendors

(-) BGP table dumps

are expensive

(+) Table dumps show all

alternate routes

(-) Update dynamics lost

(-) restricted to interfaces

provided by vendors


Collect bgp data from many routers l.jpg

Collect BGP Data From Many Routers

Seattle

Cambridge

Chicago

Detroit

New York

Kansas City

Philadelphia

Denver

San

Francisco

St. Louis

Washington, D.C.

2

Los Angeles

Dallas

Atlanta

San Diego

Phoenix

Austin

Orlando

Houston

Route Monitor

BGP is not a flooding protocol


Detecting important routing changes l.jpg

Detecting Important Routing Changes

  • Large volume of BGP updates messages

    • Around 2 million/day, and very bursty

    • Too much for an operator to manage

  • Identify important anomalies

    • Lost reachability

    • Persistent flapping

    • Large traffic shifts

  • Not the same as root-cause analysis

    • Identify changes and their effects

    • Focus on mitigation, rather than diagnosis

    • Diagnose causes if they occur in/near the AS


Challenge 1 excess update messages l.jpg

BR

BR

BR

E

E

E

E

E

E

BGP Update

Grouping

BGP

Updates

Events

Challenge #1: Excess Update Messages

  • A single routing change

    • Leads to multiple update messages

    • Affects routing decision at multiple routers

Persistent Flapping Prefixes

Group updates for a prefix with inter-arrival < 70 seconds,

and flag prefixes with changes lasting > 10 minutes.


Determine event timeout l.jpg

Determine “Event Timeout”

Cumulative distribution of BGP update inter-arrival time

BGP beacon

(70, 98%)


Event duration persistent flapping l.jpg

Long Events

Event Duration: Persistent Flapping

Complementary cumulative distribution of event duration

(600, 0.1%)


Detecting persistent flapping l.jpg

Detecting Persistent Flapping

  • Significant persistent flapping

    • 15.2% of all BGP update messages

    • … though a small number of destination prefixes

    • Surprising, especially since flap dampening is used

  • Types of persistent flapping

    • Conservative flap-damping parameters (78.6%)

    • Protocol oscillations, e.g., MED oscillation (18.3%)

    • Unstable interface or BGP session (3.0%)


Example unstable ebgp session l.jpg

C

B

A

D

E

E

E

E

Example: Unstable eBGP Session

  • Flap damping parameters is session-based

  • Damping not implemented for iBGP sessions

Peer

AT&T

p

Customer


Challenge 2 identify important events l.jpg

No Disruption

Event

Classification

Loss/Gain of Reachability

“Typed”

Events

Events

Internal Disruption

Single External Disruption

Multiple External Disruption

Challenge #2: Identify Important Events

  • Major concerns of network operators

    • Changes in reachability

    • Heavy load of routing messages on the routers

    • Flow of the traffic through the network

Classify events by type of impact it has on the network


Event category no disruption l.jpg

C

D

A

B

E

E

E

E

E

E

Event Category – “No Disruption”

p

AS2

AS1

No Traffic Shift

“No Disruption”: each of the border routers has no traffic shift

AT&T


Event category internal disruption l.jpg

C

D

A

B

E

E

E

E

E

E

Event Category – “Internal Disruption”

p

AS2

AS1

“Internal Disruption”: all of the traffic shifts are internal traffic shift

AT&T

Internal Traffic Shift


Event type single external disruption l.jpg

C

D

A

B

E

E

E

E

E

E

Event Type: “Single External Disruption”

p

AS2

AS1

external Traffic Shift

AT&T

“Single External Disruption”: traffic at one exit point shifts to other exit points


Statistics on event classification l.jpg

Statistics on Event Classification


Challenge 3 multiple destinations l.jpg

Event

Correlation

“Typed”

Events

Clusters

Challenge #3: Multiple Destinations

  • A single routing change

    • Affects multiple destination prefixes

Group events of same type that occur close in time


Main causes of large clusters l.jpg

Main Causes of Large Clusters

  • External BGP session resets

    • Failure/recovery of external BGP session

    • E.g., session to another large tier-1 ISP

    • Caused “single external disruption” events

    • Validated by looking at syslog reports on routers

  • Hot-potato routing changes

    • Failure/recovery of an intradomain link

    • E.g., leads to changes in IGP path costs

    • Caused “internal disruption” events

    • Validated by looking at OSPF measurements


Challenge 4 popularity of destinations l.jpg

BR

BR

BR

E

E

E

E

E

E

Traffic Impact

Prediction

Large

Disruptions

Clusters

Challenge #4: Popularity of Destinations

  • Impact of event on traffic

    • Depends on the popularity of the destinations

Netflow Data

Weight the group of destinations by the traffic volume


Traffic impact prediction l.jpg

Traffic Impact Prediction

  • Traffic weight

    • Per-prefix measurements from Netflow

    • 10% prefixes accounts for 90% of traffic

  • Traffic weight of a cluster

    • The sum of “traffic weight” of the prefixes

  • Flag clusters with heavy traffic

    • A few large clusters have large traffic weight

    • Mostly session resets and hot-potato changes


Conclusions l.jpg

Conclusions

  • Network troubleshooting from the inside

    • Traffic, topology, and routing data

    • Easier to understand what’s going on

    • … though still challenging to collect/analyze data

  • Traffic measurement

    • SNMP, packet monitoring, and flow monitoring

  • Routing monitors

    • Track network state and identify anomalies

    • Intradomain monitor capturing LSAs

    • BGP monitor capturing BGP updates


Next time bgp routing table size l.jpg

Next Time: BGP Routing Table Size

  • Three papers

    • “On characterizing BGP routing table growth”

    • “An empirical study of router response to large BGP routing table load”

    • “A framework for interdomain route aggregation”

  • Review only of the first paper

    • Summary

    • Why accept

    • Why reject

    • Avenues for future work

  • Optional

    • Vanevar Bush on “As We May Think” (1945)


  • Login