1 / 35

Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Using OMD/Nagios to Monitor Complex Hardware/Software Systems . Joe VanAndel NCAR/EOL 2012/3/29. Why is Monitoring Important?. Why is Monitoring Important?. Software systems can be very complex: networked data sources multiple computers long running daemons

arnon
Download Presentation

Using OMD/Nagios to Monitor Complex Hardware/Software Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using OMD/Nagios to Monitor Complex Hardware/Software Systems • Joe VanAndel • NCAR/EOL • 2012/3/29

  2. Why is Monitoring Important?

  3. Why is Monitoring Important? • Software systems can be very complex: • networked data sources • multiple computers • long running daemons • Hardware (including computers) can fail

  4. Why is Monitoring Important (2)? • Someone is relying on your system to produce or process data. • Computers are better than people at monitoring - manual procedures are error prone and don’t cover 24x7. • Your staff may need to be notified out-of-hours if failures occur.

  5. Why is Monitoring Important to S-Pol? • S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected. • Notifications allow unattended operation, so staff don’t have to stay on site 24x7. • Can not afford to have 3 shifts in field projects

  6. What is OMD? • Open Monitoring Distribution (http://omdistro.org) • runs on Linux • Bundles Nagios with 16 useful utilities, including • check_mk - creates Nagios configurations for you! • rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data.

  7. Why use OMD? • complete package of monitoring tools • avoid the effort of compiling and integrating Nagios add-ons • Web based monitoring - from anywhere!

  8. Why use check_mk? • Automatically generates Nagios rules for each machine you monitor. • Lower overhead allows monitoring more checks on more hosts. • easy to create both hardware and software checks. • The S-Pol radar had 700 checks running on 14 hosts - we didn’t want to generate the Nagios configuration manually.

  9. check_mk architecture RRD is “Round Robin Database” which efficiently stores the output from check_mk. figure from http://mathias-kettner.de

  10. check_mk_agent

  11. Getting Started with OMD • install the RPM • $ omd create mysite # the monitoring instance • create scripts in /usr/lib/check_mk_agent/local • $ check_mk -I # run inventory • $ omd start mysite # start daemons. • open the check_mk URL in a browser.

  12. Writing a check is simple • write a C program, shell script, or Python script • query hardware or software status • output string(s) to stdout: "0 PgenTritonRaidStatus - OK" • run a check_mk inventory to • find your script • generate the Nagios configuration

  13. /usr/lib/check_mk_agent/local/filecount #!/bin/bash DIRS="/var/log /tmp" for dir in $DIRS do count=$(ls $dir | wc --lines) if [ $count -lt 50 ] ; then status=0 statustxt=OK elif [ $count -lt 100 ] ; then status=1 statustxt=WARNING else status=2 statustxt=CRITICAL fi echo "$status Filecount_$dir count=$count;50;100;0; $statustxt - $count files in $dir" done

  14. S-Pol monitoring • Radar hardware for S-Band & Ka-band: • antenna • transmitter • receiver • Klystron temperature • Container temperatures

  15. Hardware Monitoring Architecture

  16. Sixnet Controller

  17. Hardware monitoring • Sixnet controller communicates to measurement modules using RS-485 • monitors transmitter status • monitors antenna status • monitors transmitter temperature • Sixnet controller runs Linux, so adding a check_mk_agent was easy!

  18. What else? • Computer status: • cpu load, • disk space, • memory usage • radar software - tasks running, products being produced • fetching data: satellite images, soundings, forecast model output

  19. Implementation • installed OMD on a rack-mount Linux server • installed check_mk_agent on all monitored computers • wrote scripts, installed in /usr/lib/check_mk_agent/local

  20. Implementation(2) • Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware • Wrote a program on the Sixnet that reported hardware status to check_mk_agent • Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts

  21. Types of S-Pol checks • scripts/programs directly monitor hardware or software • hybrid scripts - process the output of an existing program, output check_mk status reports.

  22. Implementation(2) • configured GSM cell phone to send SMS messages • software from gnokii.org • bought local SIM • wrote script to limit frequency of SMS messages

  23. Sample Web Screens

  24. Challenges • learning how to create advanced checks with graphs • Avoiding false alarms (particularly after hours!) • limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful!

  25. How well did OMD/Nagios work? • The second shift only had to be on-site from 3:00PM to 8:00PM, rather than until 11:00PM • Daytime: OMD/Nagios warned staff of problems on multiple occasions. • Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions

  26. 24x7 Operations : w/o working 24x7 • Added SMS (text message) notifications to Nagios • Technicians and Engineers carried cell phones • Nagios sent SMS when hardware or software problems occurred. • Technicians and Engineers would access Nagios web pages via 3G modems on laptops

  27. FUTURE • Monitoring of diesel generators • Add remote control: • generator & transfer switch • reset of transmitter faults • reset of antenna faults

  28. Conclusion • Monitoring is important for any system, critical for complex or unattended operation • OMD/Nagios makes it easy to deploy monitoring • OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site. • Notifications via SMS and remote access to OMD’s web pages are very helpful.

  29. Acknowledgments • Ethan Galstad - Nagios chief developer • Mathias Kettner - check_mk • Fatima Dembele (summer intern) - prototyping • Paloma Gutierrez - hardware monitoring • Chris Burghart - Ka-band monitoring • Mike Dixon - Ka-band & HAWK monitoring

  30. Questions?

More Related