Grid Monitoring using Nagios and RRDtool
E N D
Presentation Transcript
Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics
In a perfect world … • Individual node status • Is it up? • What is its load? • What is the memory and swap usage? • NFS and network load? • Are the partitions full? • Are applications and services running properly? • Amalgamated node status • Same info, but across groups of nodes
In a perfect world … • Historical information • Trends • Notification of service states • e.g. Storage down to 100 megs free = Warning • Storage down to 10 megs free = Critical • sshd no longer running = Failure • notify by email, pager, mobile • Easy access to monitoring information • web, email, digest, mobile
In a perfect world … • Avoidance of “Too many red flashing lights” • “Just the facts, ma’am” – only want root cause failures to be reported, not cascade of every downstram failure. • also includes avoiding unnecessary checks • e.g. HTTP responding, therefore no need to ping • e.g. power outage, doesn’t ping, so don’t bother trying anything else • Other wish list requirements?
Aspects of Current Grid Monitoring • LDAP (Lightweight Directory Access Protocol) is the current foundation for MDS. Designed frequent read, infrequent write. • MDS (Monitoring and Discovery Service) uses LDAP for maintaining static and dynamic system details. • R-GMA (Relational Grid Monitoring Architecture) meant to address shortcomings of LDAP based MDS system by using hierarchy of relational databases. Now being deployed. • GRIS (Grid Resource Information Service) stores details about the state of “the grid” (at least from the local node) • GIIS (Grid Index Information Service) ties together several GRISes • HBM (Heart Beat Monitor) monitor Globus services – seems to have died a quiet death
Existing Grid Monitoring Lacks… • Historical information for trends • Simple interface for accessing information • Automated response to changes in system state Here is where RRDtool and Nagios can contribute
RRDtool www.rrdtool.com • Round Robin Database for time series data storage • Command line based • From the author of MRTG • Made to be faster and more flexible • Includes CGI and Graphing tools, plus APIs • Solves the Historical Trends and Simple Interface problems
Define Data Sources (Inputs) • DS:speed:COUNTER:600:U:U • DS:fuel:GAUGE:600:U:U • DS = Data Source • speed, fuel = “variable” names • COUNTER, GAUGE = variable type • 600 = heart beat – UNKNOWN returned for interval if nothing received after this amount of time • U:U = limits on minimum and maximum variable values (U means unknown and any value is permitted)
Define Archives (Outputs) • RRA:AVERAGE:0.5:1:24 • RRA:AVERAGE:0.5:6:10 • RRA = Round Robin Archive • AVERAGE = consolidation function • 0.5 = up to 50% of consolidated points may be UNKNOWN • 1:24 = this RRA keeps each sample (average over one 5 minute primary sample), 24 times (which is 2 hours worth) • 6:10 = one RRA keeps an average over every six 5 minute primary samples (30 minutes), 10 times (which is 5 hours worth) • Clear as mud! • all depends on original step size which defaults to 5 minutes
RRDtool Database Format Recent data stored once every 5 minutes for the past 2 hours (1:24) Old data averaged to one entry per day for the last 365 days (288:365) } RRD File --step 300 (5 minute input step size) RRA 1:24 RRA 6:10 RRA 288:365 Medium length data averaged to one entry per half hour for the last 5 hours (6:10)
RRDtool Example • Monitoring a car – fuel in the tank plus odometer 12:05 12345 KM 7.0 L 12:10 12357 KM 5.8 L 12:15 12363 KM 5.2 L STOP 12:20 12363 KM 5.2 L 12:25 12363 KM 5.2 L RESTART 12:30 12373 KM 4.2 L 12:35 12383 KM 3.2 L 12:40 12393 KM 2.2 L 12:45 12399 KM 1.6 L 12:50 12405 KM 9.0 L REFUEL 12:55 12411 KM 8.4 L 13:00 12415 KM 8.0 L 13:05 12420 KM 7.5 L 13:10 12422 KM 7.3 L 13:15 12423 KM 7.2 L
RRDtool Example • Create an RRD to store distance and fuel rrdtool create car.rrd --start 920804400 \ DS:speed:COUNTER:600:U:U \ DS:fuel:GAUGE:600:U:U \ RRA:AVERAGE:0.5:1:24 \ RRA:AVERAGE:0.5:6:10 • --start Defines earliest time RRD accepts
RRDtool Example • Input data: rrdtool update car.rrd 920804700:12345:7.0 920805000:12357:5.8 rrdtool update car.rrd 920805300:12363:5.2 920805600:12363:5.2 rrdtool update car.rrd 920805900:12363:5.2 920806200:12373:4.2 rrdtool update car.rrd 920806500:12383:3.2 920806800:12393:2.2 rrdtool update car.rrd 920807100:12399:1.6 920807400:12405:9.0 rrdtool update car.rrd 920807700:12411:8.4 920808000:12415:8.0 rrdtool update car.rrd 920808300:12420:7.5 920808600:12422:7.3 rrdtool update car.rrd 920808900:12423:7.2
RRDtool Graphing • Now with data in the RRD, RRDtool can generate graphs: rrdtool graph speed.gif \ --start 920804400 --end 920808000 \ --vertical-label m/s \ DEF:myspeed=car.rrd:speed:AVERAGE\ DEF:myfuel=car.rrd:fuel:AVERAGE \ CDEF:realspeed=myspeed,1000,* \ LINE2:realspeed#FF0000 \ LINE2:myfuel#00FF00
RRDtool Graphing Output • Much more interesting graphs possible • Multiple RRDs may be used as sources for variables • Auto-interpolation of points • Functions and calculations can be applied to variables • Legends, labels, and text can be inserted
Nagios www.nagios.org • Instantaneous service level monitoring • Web based interface • Somewhat complicated set of configuration files to manually edit • Automated notification of change in service level (email, phone, etc.) • Defines WARNING, CRITICAL, FAILED levels
Nagios Host Definitions • Define details about each node and their hierarchy in the network: define host{ host_name tbce01 alias Testbed CE address 163.1.243.105 parents edg-testbed notifications_enabled 1 process_perf_data 1 check_command check-host-alive notification_interval 120 notification_period 24x7 notification_options d,u,r }
Nagios Service Definitions • Define details about each service: define service{ name ping check_command check_ping!100.0,20%!500.0,60% contact_groups linux-admins check_period 24x7 max_check_attempts 3 normal_check_interval 5 notification_interval 120 notification_period 24x7 notification_options c,r }
Nagios Service and Host Polling • Pull model, where Nagios server executes command to fetch host or service status • Requires remote hosts and services to cooperate • NRPE installed on clients allows server to execute “plugins” to poll for information • Alternatively use existing client reporting mechanisms (ping, wget, http) • Server responsible for configuration of polling intervals and details to be polled
Nagios Service and Host Reporting • Push model, where services and hosts decide when to report status to Nagios server • push data when available/relevant • generally full access to node-local data • requires configuring every node independently • authentication of nodes at server • nodes need to know who to send data to
Finally, some other monitors • NWS (Network Weather Service) attempts to predict network utilisation from historical information • Ganglia cluster monitoring system, provides aggregate graphs of cluster performance – Globus/EDG tie-ins underway • Map Center EDG project to monitor Grid status and services • ActiveMap, GridPortal, and InfoPortal* appear to be inactive projects