An Annotation Layer for Network Management

An Annotation Layer for Network Management George Porter, Arne Baste, David Chu, Dilip Joseph Randy H. Katz NetRads Retreat - June 2005

Goal of today’s talk • Snapshot of our thinking in this area • Several open research problems as to appropriateness of piggybacking, effectiveness of distributed observation, etc. • Your feedback appreciated

Outline • Motivating example: Discovering and protecting network service performance during stress • PNEs as A-Layer building block • Overview: Annotation layer as provider of component building block for network management • Revisit network service example with A-Layer • Research challenges, open issues, opportunities

Outline • Motivating example: Discovering and protecting network service performance • PNEs as A-Layer building block • Overview: Annotation layer as provider of component building block for network management • Revisit network service example with A-Layer • Research challenges, open issues, opportunities

Dist Tier R IC Client Motivating Example:Network service slowdown/failure FTP R NFS Web IS DNS DNS DNS Server tier • Problem: • Users in the access tier complain of slow web access, can’t mount files, and “DNS operation timed out messages” • This problem started today at 10am • Where to begin? • Network connectivity between users and outside seems ok • But name resolution is intermittent and slow • We need tools to figure out who is affected, who isn’t affected, the cause, and a solution.

Dist Tier R IC Client Motivating Example:Network service slowdown/failure FTP R NFS Web IS DNS DNS DNS Server tier • Network connectivity to DNS? [ping,traceroute] • Are DNS requests making it to the server tier? • What is happening to the request completion rate (is it lower)? Vs network path losses (I.e., is it the path or the service?) DNS server CPU level up • Localize the problem: • Only this user? Or other clients? • Just that server? What is happening to the DNS req/reply completion rate of other servers in that cluster? Correlations? Is this user anomalous? • So far: DNS overloaded, leading to timeouts on client end

Dist Tier Client FTP R NFS R Web IS IC SMTP DNS DNS Server tier R II • Why is the service overloaded? • Is there an usual number of requests from other sources? [deviation from the mean] • What is the status of requests to this service network-wide? How has it changed since before the first reports of the problem? • We discover that the number of DNS requests from access and ISP networks is unchanged (must be in server tier) • Other correlations? Yes, to SMTP traffic at ISP ingress • We suspect the endpoint of SMTP traffic, a spam appliance, as the cause of DNS performance loss • No unusual surges of DNS from access or ISP (from outside our enterprise network) • Thus originating inside the server tier • And correlated to SMTP traffic

Dist Tier Client DNS FTP R NFS R Web IS IC SMTP DNS DNS Server tier R II • Eliminate false positives: testing this conjecture via experimental intervention • Temporarily b/w throttle SMTP traffic from ISP ingress • Test DNS latency from access network • Find that DNS latency goes down when SMTP volume goes down • We enact a new (but temporary) policy: • Redirect requests from access tier to secondary or tertiary DNS server (service separation for different users) • BW regulate SMTP traffic to keep DNS server CPU load from peaking • Access users’ service restored--their traffic is protected. • Problem localized and mitigated • Long term solution: software upgrade, firmware upgrade, add dedicated DNS cache for appliance

Example Review • Localizing and identifying problem required • Network-wide visibility despite stressed links/servers • Path information (network connectivity, protocol request/reply completion information) • Finding changes in behavior (avg # requests/unit time, rate of change of traffic) • Finding correlations between traffic (traffic classes, volume, network level paths) • Experimental intervention (correlation to causation) • Enabling new policy (redirecting traffic to secondary server, BW throttling/fencing misbehaving flows)

Network-wide visibility despite surges/overload/high loss rates Low overhead Path statistics gathering Some protocol visibility (TCP, IP, Services like DNS, NFS) Need to discover Changes to request-reply rate, completions, latency over time Correlations between different flows, protocols, parts of the network New policies (Actions) For experimental intervention (root cause discovery) To protect good traffic BW shaping, blocking, scheduling, fencing, selective drop Security Against non-operators using this infrastructure Against DoS attacks Principles for network management

iBox PNEs (Programmable Network Elements) and iBoxes • Inspection-and-action points • Deep, multiprotocol, packet inspection • No routing, just observation and marking • Actions: Selective drop, b/w fencing and shaping, notification of operators, query “points of observation” • Some protocol visibility to TCP, UDP, ‘good’ network service protocols like DNS/NFS • Per-flow session state and reverse path visibility • Per-flow and per-path simple statistics gathering (latencies, round trip times, requests/sec, address source and destinations)

Annotation Layer • Explicit layer for iBox-to-iBox communication via packet annotations url: X iBox iBox iBox • Annotations: • Fixed size • Encoded to enable the de-annotation of packets • Multiple payload types based on any layer of the flow • Security field for authentication

A-Layer Annotation Design • Encode annotations in between IP and transport • Allow annotations to be stacked (multiple) • Annotations are removed by iBoxes before reaching endhosts • Motivation: start with large (but versatile) annotation format • When we discover the set of annotations that are most effective for network management, we can reduce the footprint to support that set

Categories of annotations SNMP proxy Netflows Alteon, Packeteer.

Internet Edge II R Primary & Secondary DNS Servers Distribution Tier S S A Mail Server B R R S IA IS C Spam Appliance Server Edge Access Edge D S 10.0.0.101 10.0.0.102 ... ... 10.0.0.255 10.0.0.1 10.0.0.2 ... 10.0.0.100 iBox placement In an Enterprise Network: iBoxes at points of hierarchical division These locations give iBoxes ability to monitor and classify traffic flowing through them. Also, iBoxes can slow down, block, fence, and drop traffic to ease surges and protect “good” traffic from bad/ugly traffic

Routing to other iBoxes Represents “core” iBoxes • Once we know which iBoxes exist, we need to know how to reach them so we can send them annotations • Requires building up this table at each iBox • Topology dependent • If a packet’s destination address doesn’t match an iBox in this table, we remove all annotations to ensure endhost correctness Represents “edge” iBoxes

A B C D E Active vs Passive annotations • When to send “active” annotations (I.e., a separate packet) vs when to passively annotate? • Available during high traffic (passive) vs expedient (active) • Associate timers with each queue • When packet arrives and an annotation is dequeued, we reset the timer • If the timer goes off, we generate a new dummy packet, annotate it, send it off to the right destination iBox, and reset the timer

A-Layer as component building blocks for observe-analyse-act • Observe • Path statistics; req/reply completion rate,latency; new conn rate; connection age; protocol types/mixtures; their change over time • Analyse • Correlations; mean changing over time (chi-sq); PCA; experimental intervention (act, then observe) • Act • BW throttling, selective drop, packet scheduling, bw fencing

Centralized More control, consistent information (but could be out of date) Centralize policy (no need to cast policy over multiple nodes) Distributed routing preferred over centralized approach Similar motivation for iBoxes/A-Layer Distributed Quick distribution of information Need for information throughout the network Works during network partitions, provides visibility during surges when it is hard to get packets through Up-to-date info, but might be inconsistent But, consistency hard; could start bad feedback loops; need to elect leader Why Distributed observe-analyse-act?

Dist Tier Client DNS FTP R NFS R Web IS IC SMTP DNS DNS Server tier R II • Path-oriented connectivity and reachability • Network service monitoring • Are requests getting through? What is their rate? What has been happening to the DNS latency? Where are “DNS hotspots”? • iBoxes can store characteristics of paths through the network • Types of protocols they see, volume of protocols, rate of change of traffic, distribution of source/destination addresses seen, network errors, topology information • NetFlows as statistics gathering at a single point • Extract and share reports from this information • Annotate packets with IBox Source annotation to have access to inside-vs-outside/paths chosen and paths taken • Annotate packets with service reachability reports, link conditions, traffic rates and changes of traffic rates • Annotate packets with protocol reports that represent the mixture of protocols seen at various points throughout the network

Dist Tier Client DNS FTP R NFS R Web IS IC SMTP DNS DNS Server tier R II • Relationship between traffic classes, correlations, anomolies • Discovering anomalies: iBoxes consuming annotations from other parts of the network need to be able to discover when good services lose performance • SLT problem of anomaly detection made easier with more information and visibility • Network data stored in vector form for rate, quantity, time domain • Discovering correlations: For good services that are degrading, finding correlations to anomalous traffic surges, flash traffic, etc. provides hints to cause of problem • Each iBox representing affected traffic needs annotations containing network wide events capturing changes in traffic patterns • “Analysis” components of observe-analyze-act done from multiple network vantage points or centralized?

Dist Tier Client DNS FTP R NFS R Web IS IC SMTP DNS DNS Server tier R II • Experimental Intervention, protection of good traffic via policy actions • Experimental intervention: • Control annotations sent to iBox near source of surge to temporarily throttle • Annotations routed to iBox at ISP ingress to invoke new policy • The policy in the annotation relies on iBox actions of BW shaping, fencing, and TCP ack manipulation to reduce SMTP flow rate • Protection of good traffic: • Policy could include network-level redirection to channel good DNS requests from access networks to a secondary, backup DNS service • Marking traffic not affiliated with surge for protection elsewhere in the network closer to the service location

Policy expression and deployment • When correlations discovered, what to do with them? • Initial efforts are to provide observation platform for visualization of network state • A-Layer/iBoxes as building blocks for operator interaction

“Above the network” services • Right now we envision iBoxes understanding well known network services • Open question as to visibility to higher level applications like web services, enterprise-specific apps • New policy complexity, new correlations and state management needed

Statistical visualization for operators • Open problem to aggregate distributed observations into coherent visualization for operators • Where does the visualization reside? • What are the right metrics/correlations/deviations from mean that are relevant? • How do actions relate to visualization?

SLT analysis • Choice of algorithm • Finding “interesting” correlations • Not being overloaded with too many correlations and events • Deviation from mean, finding patterns, what is normal operation for a protocol?

Managing distributed actions • Managing feedback loops • Providing coherent actions at the global scale based on iBoxes distributed throughout the network • Coordinating actions despite network surges and limited network access, path losses, etc.

Q & A

Q: What about the e2e argument? • Adding/removing annotations: • Annotations easy to remove • Packet paths not modified • Actions such as throttling, scheduling, dropping • Con: affects traffic in ways endhosts can detect • Pro: Provides “library” of components to enable new network services / management features • That’s how we build software • A-Layer gives enterprise operators control over their networks • As long as their applications are supported and work • Enterprise networks usually have white list of allowed apps, all other disallowed • Contrast this to ISPs

Q: What about per-flow state management? • Some routers can keep per-flow state (Netflows) • iBoxes can sample traffic • iBoxes not in correctness path--can act as ‘nops’ • Network traffic parallelizable, targeting 1 GigE • Can be merged into expandable network devices (see Cisco’s server cards that plug into routers)

Q: What about e2e security (IPsec?) • E2e security obscured protocol, but not path stats • Conceivable to discover request/response phases, infer completion rate; keep stats on # connections, flow rates • Statistically infer when a flow is starved for bandwidth; observe bandwidth over time; correlate with destination/sources function (web server, mail server, etc) • Correlations still work over encrypted traffic • Can still perform experiments by affecting flow X, observing flow Y

Q: Why annotate? (Why not send separate packets?) • Annotations are about path characteristics • Can bind to the flow they describe • Statistics follow paths where they are the most relevant • Marries per-path context with each packet of a particular flow (gives iBoxes info they need to throttle, fence, etc) • As packet flow rate increases, more opportunity for visibility by piggybacking • Lower overhead during times of stress • Possible preference of fewer large packets than more small packets • Explicit sending of separate packets still ok • Especially for discovery, control, and policy distribution

Q: Why distributed? • Centralized statistics gathering easy in enterprise networks • But hard during times of stress/traffic spikes/flash traffic • Information might be needed in more than one place • “Act” operations to protect good traffic needs timely info • Contrast to 5-min avgs common in SNMP • Raises difficulty, though • Election protocols, distributed consensus, negative feedback loops, management of iBoxes • Let’s experiment and see • Open research question as to benefit of distributed vs centralized network observation, analysis, and action/actuation

An Annotation Layer for Network Management

An Annotation Layer for Network Management

Presentation Transcript

Network Layer

Network layer

Network Layer

An Annotation Management System for Relational Databases

Network Layer

An Annotation Layer for Network Management

Network Layer

Annotation Management in an XML CMS

Network Layer

Network Layer

Network Layer: Location/Service Management

Network Layer

Network Layer

Network Layer - an Overview

Network Layer Support for Overlay Network

Network Layer

Network Layer

Network Layer

Network Layer