IP Network Management

IP Network Management Richard Mortier Digital Communications II, CUCL

Overview • Introduction • Abstractions • IP network components • IP network management protocols • Pulling it all together • An alternative approach Digital Communications II, CUCL

Overview • Introduction • What’s it all about then? • Abstractions • IP network components • IP network management protocols • Pulling it all together • An alternative approach Digital Communications II, CUCL

What is network management? • One point-of-view: a large field full of acronyms • EMS, TMN, NE, CMIP, CMISE, OSS, AN.1, TL1, EML, FCAPS, ITU, ... • (Don’t ask me what all of those mean, I don’t care!) • From question.com: • In 1989, a random of the journalistic persuasion asked hacker Paul Boutin “What do you think will be the biggest problem in computing in the 90s?” Paul’s straight-faced response: “There are only 17,000 three-letter acronyms.” • We will ignore most of them  Digital Communications II, CUCL

What is network management? • Computer networks are considered to have three operating timescales • Data: packet forwarding [μs, ms] • Control: flows/connections [ secs, mins ] • Management: aggregates, networks [ hours, days ] • …so we’re concerned with “the network” rather than particular devices or protocols • Standardization is key! Digital Communications II, CUCL

Overview • Introduction • Abstractions • ISO FCAPS, TMN EMS, ATM • IP network components • IP network management protocols • Pulling it all together • An alternative approach Digital Communications II, CUCL

ISO FCAPS: functional separation • Fault • Recognize, isolate, correct, log faults • Configuration • Collect, store, track configurations • Accounting • Collect statistics, bill users, enforce quotas • Performance • Monitor trends, set thresholds, trigger alarms • Security • Identify, secure, manage risks Digital Communications II, CUCL

TMN EMS: administrative separation • Telecommunications Management Network • Element Management System • “...simple but elegant...” (!) • (my emphasis) • (often the two go together) • NEL: network elements (switches, transmission systems) • EML: element management (devices, links) • NML: network management (capacity, congestion) • SML: service management (SLAs, time-to-market) • BML: business management (RoI, market share, blah) Digital Communications II, CUCL

The B-ISDN reference model • Asynchronous Transfer Mode “cube” • See IAP lectures, maybe  • Plane management… • The whole network • …vs layer management • Specific layers • Topology • Configuration • Fault • Operations • Accounting • Performance management plane control plane user plane higher layers higher layers plane management layer management ATM adaptation layer ATM layer physical layer Digital Communications II, CUCL

Network management • Models of general communication networks • Tend to be quite abstract and exceedingly tedious! • Many practitioners still seem excited about OO programming, WIMP interfaces, etc • …probably because implementation is hard due to so many excessively long and complex standards! • My view: basic “need-to-know” requirements are • What should be happening? [ c ] • What is happening? [ f, p, a ] • What shouldn’t be happening? [ f, s ] • What will be happening? [ p, a ] Digital Communications II, CUCL

Network management • We’ll concentrate on IP networks • Still acronym city: ICMP, SNMP, MIB, RFC  • Sample size: 102 routers, 105 hosts • We’ll concentrate on the network core • Routers, not hosts • We’ll ignore “service management” • DNS, AD, file stores, etc Digital Communications II, CUCL

Overview • Introduction • Abstractions • IP network components • IP, networks, routers • IP network management protocols • Pulling it all together • An alternative approach Digital Communications II, CUCL

IP primer (you probably know all this) • Destination-routed packets – no connections • Time-to-live field: allow removal of looping packets • Routers forward packets based on routeing tables • Tables populated by routeing protocols • Routers and protocols operate independently • …although protocols aim to build consistent state • RFCs ~= standards • Often much looser semantics than e.g. ISO, ITU standards • Compare for example OSPF [RFC2327] and IS-IS [RFC1142, RFC1195], two link-state routeing protocols Digital Communications II, CUCL

So, how do you build an IP network? $1m? $2m? for a new, populated, backbone router! • Buy (lease) routers • Buy (lease) fibre • Connect them all together • Configure routers • Configure end-systems Wayleaves = $$$ Be a landowner! Correctly. For now. Mwuhahaha. Someone else’s can of worms. Digital Communications II, CUCL

Cisco CRS-1 Multi-shelf system Multiple router flavours A sample taxonomy • Core • OC-12 (622Mbps) and up (to OC-768 ~= 40Gbps) • Big, fat, fast, expensive • E.g. Cisco HFR, Juniper T-640 • HFR: 1.2Tbps each, interconnect up to 72 giving 92Tbps, start at $450k • Transit/Peering-facing • OC-3 and up, good GigE density • ACLs, full-on BGP, uRPF, accounting • Customer-facing • FR/ATM/… • Feature set as above, plus fancy queues, etc • Broadband aggregator • High scalability: sessions, ports, reconnections • Feature set as above • Customer-premises (CPE) • 100Mbps, maybe • NAT, DHCP, firewall, wireless, VoIP, … • Low cost, low-end, perhaps just software on a PC Digital Communications II, CUCL

Network design • Whose network? • ISPs, IXs, enterprise, campus • POPs, DCs • Many designs: flat, hierarchical, hybrids, multiple scales • Many constraints • Business • Backwards compatibility. Who to connect. Peering. • Technology • Power – directly (24x7 operation) and indirectly (cooling) • Port density vs. raw bandwidth • Software reliability • Hardware/software capability • Addressing schemes for scalability, summarization • Can’t run feature X with feature Y on vendor C in network size N • Connectivity/resiliency • “All core routers connect to at least 2 other core routers” • “All edge routers connect to at least 2 core routers” Digital Communications II, CUCL

Router configuration • Initialization • Name the router, setup boot options, setup authentication options • Configure interfaces • Loopback, ethernet, fibre, ATM • Subnet/mask, filters, static routes • Shutdown (or not), queueing options, full/half duplex • Configure routeing protocols (OSPF, BGP, IS-IS, …) • Process number, addresses to accept routes from, networks to advertise • Access lists, filters, ... • Numeric id, permit/deny, subnet/mask, protocol, port • Route-maps, matching routes rather than data traffic • Other configuration aspects: traps, syslog, etc • (Oh, and switch configuration is about as painful) Digital Communications II, CUCL

Router configuration fragments hostname FOOBAR ! boot system flash slot0:a-boot-image.bin boot system flash bootflash: logging buffered 100000 debugging logging console informational aaa new-model aaa authentication login default tacacs local aaa authentication login consoleport none aaa authentication ppp default if-needed tacacs aaa authorization network tacacs ! ip tftp source-interface Loopback0 no ip domain-lookup ip name-server 10.34.56.78 ! ip multicast-routing ip dvmrp route-limit 7000 ip cef distributed interface Loopback0 description router-1.network.corp.com ip address 10.65.21.43 255.255.255.255 ! interface FastEthernet0/0/0 description Link to New York ip address 10.65.43.21 255.255.255.128 ip access-group 175 in ip helper-address 10.65.12.34 ip pim sparse-mode ip cgmp ip dvmrp accept-filter 98 neighbor-list 99 full-duplex ! interface FastEthernet4/0/0 no ip address ip access-group 183 in ip pim sparse-mode ip cgmp shutdown full-duplex router ospf 2 log-adjacency-changes passive-interface FastEthernet0/0/0 passive-interface FastEthernet0/1/0 passive-interface FastEthernet1/0/0 passive-interface FastEthernet1/1/0 passive-interface FastEthernet2/0/0 passive-interface FastEthernet2/1/0 passive-interface FastEthernet3/0/0 network 10.65.23.45 0.0.0.255 area 1.0.0.0 network 10.65.34.56 0.0.0.255 area 1.0.0.0 network 10.65.43.0 0.0.0.127 area 1.0.0.0 access-list 24 remark Mcast ACL access-list 24 permit 239.255.255.254 access-list 24 permit 224.0.1.111 access-list 24 permit 239.192.0.0 0.3.255.255 access-list 24 permit 232.192.0.0 0.3.255.255 access-list 24 permit 224.0.0.0 0.0.0.255 access-list 1011 deny 0000.0000.0000 ffff.ffff.ffff ffff.ffff.ffff 0000.0000.0000 0xD1 2 eq 0x42 access-list 1011 permit 0000.0000.0000 ffff.ffff.ffff 0000.0000.0000 ffff.ffff.ffff tftp-server slot1:some-other-image.bin tacacs-server host 10.65.0.2 tacacs-server key xxxxxxxx rmon event 1 trap Trap1 description "CPU Utilization>75%" owner config rmon event 2 trap Trap2 description "CPU Utilization>95%" owner config Digital Communications II, CUCL

Router configuration • Lots of large, fragile text files • 00s/000s routers, 00s/000s lines per config • Errors are hard to find and have non-obvious results • Router configuration also editable on-line • Order matters! • How to keep track of them all? • Naming schemes, directory trees, CVS, ssh upload and atomic commit to router • Perhaps even a proper database • State of the art is pretty basic • Few tools to check consistency, design goals • Generally generate configurations from templates and have human-intensive process to control access to running configs • Topic of current research [Feamster et al] This counts as advanced! Digital Communications II, CUCL

Overview • Introduction • Abstractions • IP network components • IP network management protocols • ICMP, SNMP, NetFlow • Pulling it all together • An alternative approach Digital Communications II, CUCL

ICMP • Internet Control Message Protocol [RFC792] • IP protocol #1 • In-band “control” • Variety of message types • echo/echo reply [ PING (packet internet groper) ] • time exceeded [ TRACEROUTE ] • destination unreachable, redirect • source quench Digital Communications II, CUCL

Ping (Packet INternet Groper) • Test for liveness • …also used to measure (round-trip) latency • Send ICMP echo • Valid IP host [RFC1122, RFC1123] must reply with ICMP echo response • Subnet PING? • Useful but often not available/deprecated • “ACK” implosion could be a problem • RFCs ~= standards Digital Communications II, CUCL

Traceroute • Which route do my packets take to their destination? • Send UDP packets with increasing time-to-live values • Compliant IP host must respond with ICMP “time exceeded” • Triggers each host along path to so respond • Not quite that simple • One router, many IP addresses: which source address? • Router control processor, inbound or outbound interface • Asymmetric routes (return path != outbound path) • Routes change • Do we want full-mesh host-host routes anyway?! • Size of data set, amount of probe traffic • This is topology, what about load on links? Digital Communications II, CUCL

SNMP • Protocol to manage information tables at devices • Provides get, set, trap, notify operations • get, set: read, write values • trap: signal a condition (e.g. threshold exceeded) • notify: reliable trap • Complexity mostly in the MIB design • Some standard tables, but many vendor specific • Non-critical, so often tables populated incorrectly • Many tens of MIBs (thousands of lines) per device • Different versions, different data, different semantics • Yet another configuration tracking problem • Inter-relationships between MIBs Digital Communications II, CUCL

IPFIX • IETF working group • Export of flow based data out of IP network devices • Developing suitable protocol from Cisco NetFlow™ v9 • [RFC3954, RFC3955] • Statistics reporting • Setup template • Send data records matching template • Many variables • Packet/flow counters, rule matches, quite flexible Digital Communications II, CUCL

Overview • Introduction • Abstractions • IP network components • IP network management protocols • Pulling it all together • Network mapping, statistics gathering, control • An alternative approach Digital Communications II, CUCL

An hypothetical NMS • GUI around ICMP (ping, traceroute), SNMP, etc • Recursive host discovery • Broadcast ping, ARP, default gateway: start somewhere • Recursively SNMP query for known hosts/connected networks • Ping known hosts to test liveness • Iterate • Display topology: allow “drill-down” to particular devices • Configure and monitor known devices • Trap, Netflow™, syslog message destinations • Counter thresholds, CPU utilization threshold, fault reporting • Particular faults or fault patterns • Interface statistics and graphs (MRTG) Digital Communications II, CUCL

NOC, NOC. Calling AT&T… Digital Communications II, CUCL

What are they all looking at? http://www.stat.ee.ethz.ch/mrtg/ Digital Communications II, CUCL

An hypothetical NMS • All very straightforward? No, not really • A lot of software engineering: corner cases, traceroute interpretation, NATs, etc • Correctness • MIBs may contain rubbish • Can only view inside your network anyway • Tunnelled, encrypted protocols becoming prevalent • Efficiency • Rate pacing discovery traffic: ping implosion/explosion • SNMP overloading router CPUs • Using NMSs also not straightforward • How to setup “correct” thresholds? • How to decide when something “bad” has happened? • How to present (or even interpret) reams and reams of data? Digital Communications II, CUCL

Overview • Introduction • Abstractions • IP network components • IP network management protocols • Pulling it all together • An alternative approach • From the edges… Digital Communications II, CUCL

Anemone • An endsystem network management platform • Collect flow information from endsystems, and • Combine with topology information from routeing protocols • Endsystems have more information about their traffic network devices • No router support required • A platform to support many applications • Currently concentrating on “managed networks” • E.g. governments, enterprises, etc • High complexity, high value • High degree of endsystem control Digital Communications II, CUCL

Applications • Real-time and historical analysis • Current topology + ingress, egress flows gives global picture of network behaviour • Capacity planning, anomaly detection • Modelling “what if” scenarios • Plug into a simulator back-end • “What happens to the network if we move all our Exchange servers to a single data centre?” • Automatic configuration • Close the loop: enable network to meet dynamic SLAs • Reconfigure network to track e.g. time of day load changes Digital Communications II, CUCL

Challenges • How do you instrument an entire network? Do you need to instrument all endsystems? • What network information should be captured and stored? • How do you access data stores distributed across a large network (ca. 300,000 nodes)? 1% coverage gives 99.999% bytes and flows Flow data augmented with application and user context Use distributed query system Digital Communications II, CUCL

Summary • Introduction • What is network management? • Abstractions • ISO FCAPS, TMN EMS, ATM • IP network components • IP, networks, routers • IP network management protocols • ICMP, SNMP, etc • Pulling it all together • Outline of a network management system • An alternative approach: from the edges Digital Communications II, CUCL

The end • Questions • Answers? • http://www.cisco.com/ • http://www.routergod.com/ • http://www.ietf.org/ • http://ipmon.sprintlabs.com/pyrt/ • http://www.nanog.org/ Digital Communications II, CUCL

Backup slides • Internet routeing • OSPF • BGP Digital Communications II, CUCL

Internet routeing • Q: how to get a packet from node to destination? • A1: advertise all reachable destinations and apply a consistent cost function (distance vector) • A2: learn network topology and compute consistent shortest paths (link state) • Each node (1) discovers and advertises adjacencies; (2) builds link state database; (3) computes shortest paths • A1, A2: Forward to next-hop using longest-prefix-match Digital Communications II, CUCL

OSPF (~link state routeing) • Q: how to route given packet from any node to destination? • A: learn network topology; compute shortest paths • For each node • Discover adjacencies (~immediate neighbours); advertise • Build link state database (~network topology) • Compute shortest paths to all destination prefixes • Forward to next-hop using longest-prefix-match (~most specific route) Digital Communications II, CUCL

BGP (~path vector routeing) • Q: how to route given packet from any node to destination? • A: neighbours tell you destinations they can reach; pick cheapest option • For each node • Receive (destination, cost, next-hop) for all destinations known to neighbour • Longest-prefix-match among next-hops for given destination • Advertise selected (destination, cost+, next-hop') for all known destinations • Selection process is complicated • Routes can be modified/hidden at all three stages • General mechanism for application of policy Digital Communications II, CUCL

Digital Communications II, CUCL

Anemone: where are we now? • Three major components • Flow collection • Route collection • Distributed database • Building prototypes, simulating system Digital Communications II, CUCL

Data collection • Flow collection • Hosts track active flows • Using low overhead event posting infrastructure, ETW • Built prototype device driver provider & user-space consumer • Used packet traces for feasibility study on (client, server) • Peaks at (165, 5667) live and (39, 567) active flows per sec • Route collection • OSPF is link-state: passively collect link state adverts • Extension of my work at Sprint (for IS-IS and BGP); also been done at AT&T (NSDI’04 paper) Digital Communications II, CUCL

The distributed database • Logically contains • Traffic flow matrix (bandwidths), {srcs}×{dsts} • …each entry annotated with current route from src to dst • N.B. src/dst might be e.g. (IP end-point, application) • Large dynamic data set suggests aggregation • Related work • { distributed, continuous query, temporal } databases • Sensor networks • Potential starting points: Astrolabe or SDIMS (SIGCOMM’04) • Where/what/how much to aggregate? • Is data read- or write-dominated? • Which is more dynamic, flow or topology data? • Can the system successfully self-tune? Digital Communications II, CUCL

The distributed database • Construct traffic matrix from flow monitoring • Hosts can supply flows they source and sink • Only need a subset of this data to get complete traffic matrix • Construct topology from route collection • OSPF supplies topology → routes • Wish to be able to answer queries like • “Who are the top-10 traffic generators?” • Easy to aggregate, don’t care about topology • “What is the load on link l ?” • Can aggregate from hosts, but need to know routes • “What happens if we remove links {l…m} ?” • Interaction between traffic matrix, topology, even flow control Digital Communications II, CUCL

The distributed database • Building simulation model • OSPF data gives topology, event list, routes • Simple load model to start with (load ~ # subnets) • Precedence matrix (from SPF) reduces flow-data query set • Can we do as well/better than e.g. NetFlow? • Accuracy/coverage trade-off • How should we distribute the DB? • Just OSPF data? Just flow data? A mixture? • How many levels of aggregation? • How many nodes do queries touch? • What sort of API is suitable? • Example queries for sample applications Digital Communications II, CUCL

IP Network Management