NMS requirement/recommendations Belgrade, October 21 2009 Vidar Faltinsen, UNINETT

NMS requirement/recommendations Belgrade, October 21 2009 Vidar Faltinsen, UNINETT

This talk reflects lessons learned up through the years (19 years) of NMS development at UNINETT and NTNU (the Norwegian University of Science and Technology)and other universities around NorwayLessons learned from a number of commited people always aiming to improve network operations

Our context • The network is complex • A lot of equipment • Heaps of traffic around the clock • No system is perfect • Errors will occur – incidents will hit us • Motto: be proactive and ahead • The user should not call you – you should be the first to know! • Keep in mind: If information is good… (posted at the right time, kept up to date)… …the user is (more) patient!

Avoid a monolithic NMS • Not an absolute rule, but be a sceptic • If the system is too massive it tends to set the agenda. • You should shape the system, not the other way around. • If too much resources must be invested into understanding the system… • …then even more resources must be put into accommodating the system to your needs   • The NMS has no intrinsic value… • …it should be a useful tool for you • But remember nothing is for free – you must in any case invest in understanding what your tools actually do

Not one tool - a set of tools • Special purpose tools with limited scope is good • Example of tool categories: • inventory systems • trouble ticket systems • status monitors • measurements (and threshold monitors) • server/services focused • netflow analysis • security-focused • configuration tools • simulation • Tools should (ideally) not overlap • Have a well defined single authority as source for your data sets, i.e.; • the set of equipment (with attributes) we manage is defined in one place • similarly for our locations (with attributes), etc, etc • Autodetection is good • But in a controlled environment (be aware of weak SNMPv2 security)

Avoid complexity • A given tool should manage your whole domain • Avoid a hierarchy of managers if possible • snmp polls can be done in parallel • Bandwidth is not a bottleneck • Throw ”iron” (CPU, memory, disk I/O, battery backed disk controller) at NMS utilization problems • If necessary segregate database on a separate system, possibly also webfront • …but consider redundancy (more later)

Place your monitor strategic • A monitor placed in the periphery of your network is more likely to be cut off • place in a central (network wise) location • redundant network access (VRRP, HSRP…) • Redundant power, incl redundant source of source (UPS/ideally standby generator) • Monitor the monitor! • Use SMS for alarms in addition to email • Place the SMS sending device physically connected to the NMS

Classify your alarms • Think through: What are the most vital alarms? What is less important? • Make sure the most vital alarms actually reach you! • and not drown in 10.000 other alarms… • or stay saturated in an overworked NMS… • Red and green lamps are good • in large environments in a hierarchal display

Use a single event/alarm system • The set of tools/monitors you use should all report to one event/alarm system • i.e. using snmp traps or email or… • The central event/alarm system should scale • coping with many events • make priorities / sort out important alarms • Correlate events – but be realistic • Detect ”in shadow” scenarios • Classify stateful alarms in pairs (down/up) • Suppress flapping alarms (line going up,down,up,down…) • Use hysteresis for threshold alarms. Set high and low tresholds. • Again: keep robustness. • Rather one alarm to many than missing an important one • Allow a flexible setup for alarm profiles • every person tends to have his own preferences… (but have a company policy) • alarms at night/weekend vs daytime • important alarms vs less important • alarms within vs outside the person’s scope of duty/responsibility

Redundant NMS • Single point of failure is never good • Complete redundancy is not realistic • Too expensive • Complexity may bite you • Three possible ways to go: • Monitor the monitor. Have a spare machine. Have backup. 24x7 guard on duty. Replace ASAP. • Do continous live replication of the NMS machine to a hot spare. • Manually (with few steps) set the hot spare in operation (inherit the NMS IP address) • Use anycast combined with live replication • Secondary NMS automatically takes over when primary NMS dies

Without numbers you are nothing • When an incident occurs – do you have enough data to investigate – and actually pinpoint the cause? • Disk is cheap • Collect heaps of statistical data • Have a scheme for compressing data as time goes (RRD/Stager method) • Focus on good search tools, reports and visualisation methods to make traffic/statistical anomalies easy to detect • Isolation and classification of an error tends to consume most of the recovery time • Autodection of thresholds and more complex anomaly detection is even better • Remember to moderate the total flow of alarms (classify alarms)

Logs are gold, scripts as well • Log, log, log • Syslog is also a management system  • Small (shell) scripts can be gold • A good idea can be only a few code lines away… • A culture that motivates creativity, allows continous implementation of new scripts/add-ons will step by step improve the overall management process!

Commit to open source • Open source development works • Sharing ideas and running code widely improves the quality • Distributed contributions can speed up implementation • (Poorly documented) single person projects will eventually die

Adopt good naming standards • Do not underestimate the value of sound names for your equipment, rooms and locations • The name of the device should in itself give an idea of what the device is (does) and where it is placed • Example: mtfs-272-sw (a switch in area ”mtfs”, wiring closet ”272”) • Also use a thought-through naming standard for router interfaces and switch ports

NMS Security • Restrict access to NMS to authorized crew only • both network access and physical access • Isolate management IP address of switches and base stations to dedicated subnets • Firmly restrict SNMP access to the network equipment – only from the NMS(es). • remember SNMP v2 security is weak • Be even more restrictive if you allow/use SNMP Write • consider SNMP v3 or Netconf

MIB requirements • Your network equipment should support: • RFC 3418: SNMPv2-MIB (system) • RFC 2863: IF-MIB (interfaces, incl. 64 bit counters) • RFC 4293: IP-MIB (IP-interfaces and ARP; IPv4 and IPv6) • RFC 4133: ENTITY MIB (modules, optics, software, serial numbers) • Not supported by Juniper  • RFC 4188: BRIDGE-MIB (bridge table) • RFC 4363: Q-BRIDGE MIB (bridge table per vlan, vlan config) • Not supported by Cisco  • RFC 3635: Etherlike-MIB (duplex) • RFC 2368: MAU-MIB (medium) • equipment support seems scarse  (HP has support) • Your NMS should whenever possible use standard/IETF MIBs rather than vendor proprietory MIBs

Key points – in summary • Be proactive • Detect important alarms early • Inform the users • Log, log, log (snmp collect) • Use a number of tools • Adopt good naming standards • Value the engineer – small scripts are gold • Educate your crew! (in both NMS operations and procedures)

NMS requirement/recommendations Belgrade, October 21 2009 Vidar Faltinsen, UNINETT