1 / 20

Nagios on Tier1 farm

Nagios on Tier1 farm. Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008. Overview. What we had before (Sure) Introduction to Nagios and how it is configured for the farm What might we do next. Sure monitoring - 1. Consists of a server and clients Communication via sysreq command

rnewsom
Download Presentation

Nagios on Tier1 farm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20th June 2008

  2. Overview • What we had before (Sure) • Introduction to Nagios and how it is configured for the farm • What might we do next

  3. Sure monitoring - 1 • Consists of a server and clients • Communication via sysreq command • Required scripts set up for each client to run checks and report results to server

  4. Sure monitoring - 2 3 main tasks: • check host alive • active using ping • passive accepting heartbeat messages • receive alarm messages • receive “backup started” and “backup finished” messages

  5. Sure monitoring - 3 Problems: • configuration not directly under Tier1 control • requires locally-written and locally maintained scripts • limited view of farm alarms and state • alarms only visible on server screen

  6. Introduction to Nagios • highly configurable • under active development (Nagios 2.11 legacy, Nagios 3.0.2 latest stable) • active user community (mailing list) • some commercial offerings • extensive documentation part of installation • allows local extensions

  7. Introduction to Nagios – basics -1 Nagios: • schedules test commands, for example: is space used in /var filesystem larger than permitted limit • accepts results as return code (0 - OK, 1 – warning, 2 – critical, 3/-1 – unknown), and a single line message

  8. Introduction to Nagios – basics -2 Nagios (continued): • displays via Web interface to authorised users • sends notification via e-mail, SMS, RSS, Morse code, jungle drums etc • may run an event handler, e.g. if a test fails, then put this batch node offline

  9. Introduction to Nagios – networked clients • Nagios server can use check_nrpe command to run test on networked client • client must be running nrpe client process to • accept and run check requests • accept results and return to server • Nagios server can also use ssh or smtp to perform checks (little experience on Tier1)

  10. Single server, many clients Nagios server Nagios client Nagios client Nagios client Nagios client

  11. Running scheduled checks and web server puts heavy load on Nagios server Tier1 uses master and slave servers: master keeps all results, runs web server and sends notifications slaves schedule tests, run them and return results to master (using send_nsca command to nsca daemon) Introduction to Nagios – slave servers

  12. If slave server has crashed: master server checks whether tests have been run to schedule (freshness checking) if test is stale (test results not returned to schedule), master will run test (force check) Introduction to Nagios – “freshness”

  13. Master and slaves servers; many clients Master server Slave server Slave server Slave server Client Client Client Client Client Client Client Client Client

  14. Introduction to Nagios – clearing alarms If check condition has been corrected and you want to clear alarm before the next scheduled test: • can force check (from master or slave) by issuing appropriate formatted command to server • scripts available to do this

  15. Introduction to Nagios - configuration In our configuration Nagios knows about: • hosts • host groups • services (for checking) • contacts and contact groups • time periods (when tests are valid, when to send contact messages)

  16. Introduction to Nagios - configuration • Configuration is made simpler by extensive use of templates, for example: • define a template for a generic host • use it to define many other hosts, only changing parameters that are different (e.g. host name, address, group to which it belongs) • can be recursive

  17. # Generic host definition template define host{ name generic-host; name of host template notifications_enabled 1; Host notifications are enabled event_handler_enabled 1; Host event handler is enabled flap_detection_enabled 1; Flap detection is enabled process_perf_data 1; Process performance data retain_status_information 1; Retain status information retain_nonstatus_information 1; Retain non-status information register 0; Template definition check_command check-host-alive max_check_attempts 10 notification_interval 720 notification_period 24x7 notification_options d,u,r }

  18. define host{ use generic-host host_name ganglia0430 parents swt-5530-0 alias Ganglia Host hostgroups aux-services contact_groups thorne address 130.246.183.173 } define host{ use generic-host host_name shelob parents swt-4400-1 alias CSF Webserver ……………

  19. Introduction to Nagios - plugins • Test scripts are known as plugins • Can be written in any suitable language: shell script, Perl, C, Pascal • About 60 standard plugins (available by RPM from Dag Wieers’ repository) • About 30+ locally written plugins • plus 14+ specially written for Castor

  20. Nagios links • Nagios home page: http://www.nagios.org/ • For locally written plugins: http://cvs.gridpp.rl.ac.uk/viewcvs/viewcvs.cgi/nagios/plugins/ • For GridPP information about Nagios: http://www.gridpp.ac.uk/wiki/Nagios

More Related