Hobbit Monitor Linuxforum 2007 March 3rd 2007 Henrik Storner <firstname.lastname@example.org> Hobbit – Monitoring that works
Agenda • Hobbit history • What does “monitoring” mean? • Demo / Screen shots • Architecture: The Hobbit components • Hobbit server • Network server monitoring • Server monitoring with the clients • Setting it up: Quick tour of the configuration • Custom checks • Hobbit at the CSC Copenhagen data center • Future
Hobbit history • In 2001, CSC's Managed Web Services division had no monitoring of websites, only box-monitoring with Unicenter/TNG. • Enter Big Brother – the UI is great, but BB is written in Korn shell and is slooooow! • bbgen toolkit (2002) eliminated some slow parts of BB, but kept the BB daemon. • Hobbit (2005) is mostly compatible with BB but has a completely different architecture. • Hobbit contains no BB parts.
What does “monitoring” mean? • Availability: Can I access the website? • Performance: ... without getting bored ? • Capacity: ... when we triple the number of users? It is vital that you know which of these questions you must answer. The main focus for Hobbit is availability (but there's a bit of the others thrown in)
A small demo Hobbit overview
Hobbit architecture • 1 Hobbit server holds all current data, i.e. the status of everything we monitor. • Usually, the Hobbit server also hosts the web interface, and runs tasks which take care of storing history- and trend-data. • 1+ servers perform network tests, and reports the results to the Hobbit server (may be the same) • Clients collect data from each monitored server, and send it to the Hobbit server for analysis.
The Hobbit server • Stores current data in RAM only, and never does any slow disk I/O or process forking • Historical and trend data stored on disk • Core hobbitd daemon feeds server tasks via an IPC “channel”, using shared memory • Server tasks handle e.g. client data analysis, trend data updates, and alerting • Some tasks can be distributed on multiple servers (e.g. client data analysis) • Extensible – you can write your own tasks, e.g. a task that stores all measurements in a database.
Hobbit web interface • Overview pages are static HTML, rebuilt once a minute with the current status data. • Detailed status pages are dynamically generated • “Critical Systems” view is dynamic • Will probably switch to an all-dynamic setup • The web UI is not particularly attractive or flexible, so web designers are welcome! • Some customization is possible by modifying header- and footer-files
Network service monitoring • ping : Is the server alive ? • Connect : Will it accept new connections ? • Service : Is the network service running ? • Application : Is it working ? • It is easy to check that the service is running because it uses a standard protocol (eg HTTP) • But the end-user only cares about the application!
Webserver says it's “200 OK” Check the actual data returned !
AJP13 VNC clamd spamd cupsd rsync Oracle TNS listener add your own service definition Standard network tests • ping • FTP • SSH • SMTP • POP • IMAP(S) • NNTP(S) • LDAP(S) query • HTTP(S) w/ content • SSL certificate
Server monitoring – Hobbit clients • Usually a “Hobbit client” runs on the server. • Clients are really dumb – they know how to collect some data, but they know nothing about interpreting the data they collect. • Runs uptime, free, df, ps, netstat, mount, who ... • Collects server-side statistics • Scans server log files for new entries • Collects data for directories and individual files • The raw data is sent to the Hobbit server for analysis.
File attributes File data Directory sizes Log file data Data can be graphed Standard server tests • CPU load average • System uptime • System clock • Memory usage • Swap usage • File system usage • Process counts • Network ports
Setting it up • All configuration is kept on the Hobbit server • All configuration files are text based • Uses regular expressions a lot • bb-hosts (list hosts Hobbit knows about, defines network service tests and the web page layout) • hobbit-alerts.cfg (rules for sending alerts) • hobbit-clients.cfg (rules for analyzing data from clients) • client-local.cfg (instructions for client data collection)
bb-hosts # # Master configuration file for Hobbit # group My hosts 127.0.0.1 localhost # bbd http://localhost/ 192.168.1.1 demohost # pop3 http://127.0.0.1/ smtp \ cont=Login;https://www/Login.php;Please.*userid
hobbit-alerts.cfg HOST=demohost SERVICE=http MAIL email@example.com TIME=W:0800:2200 SERVICE=http MAIL firstname.lastname@example.org SCRIPT /usr/local/bin/smsalert +4512345678
hobbit-clients.cfg HOST=* DISK / 80 90 EXHOST=backup.foo.com HOST=db.foo.com DISK %/data/ IGNORE HOST=%web[1-9] PROC apache MIN=4 MAX=20 DIR /var/log/apache SIZE<100000 TRACK yellow FILE /var/www/index.html \ MD5=dd2cf7192db28919203eef126943b
client-local.cfg # This file tell clients what file/log data to report [linux] log:/var/log/messages [web1] dir:/var/log/apache file:/var/www/default.html:md5
Why server-side analysis? • Managing configuration files on each monitored server is impossible when you have 2000 clients. • Bulk configuration updates are much easier • Configuration settings can apply to groups of hosts • Adding new analysis tools only requires upgrading the Hobbit server – not all of the clients (provided they already collect the necessary data, of course) • Having RAW data available is USEFUL. • Only downside: Your Hobbit server must spend some cpu time analyzing the raw client data
Custom checks • Custom checks normally check something, then send a red/yellow/green “status” message • A check can run locally on a host as part of the client installation • A check can run centrally and pull data from several hosts (eg. grabbing data with SNMP) • A check can run on the Hobbit server, using data that has already been collected (“combo-tests” or extra client-data analysis) • Numeric data can be tracked in graphs
Simple client-side check #!/bin/bash COLUMN=weather; COLOR=green DEGREES=`/usr/local/bin/getweather temperature` if [ $DEGREES -ge 30 ]; then COLOR=red; fi $BB $BBDISP \ “status $MACHINE.$COLUMN $COLOR `date` temperature=$DEGREES” exit 0
Server-side checks • You can hook modules into all kinds of Hobbit data: Status messages, data collected from Hobbit clients and so on. • E.g. Hobbit clients run “who” to report who is logged on. • To monitor for a root login on all servers only takes is 62 lines of Perl (see hobbitd_rootlogin.pl in source)
Windows, SNMP and other stuff • Windows client: BBWinNote: Does not support central configuration • SNMP add-on: Devmon • Both are OSS, available on Sourceforge.net • Other add-ons available, e.g. for database monitoring. • Add-ons for Big Brother (available from deadcat.net) can be used – but check licensing
Hobbit@CSC - Summary • The Copenhagen data center is the largest CSC data center in EMEA, globally in the top 5. • Hobbit/BBWin/BB clients on 90% of all servers. • Hobbit is considered mission-critical. • Lots of network tests, especially for Web- and middleware systems (J2EE and LDAP) • Web application monitoring done through customer-built “monitoring” web pages
Hobbit@CSC – Multiple views • Multiple sets of web pages with Hobbit data: • One set grouped by account manager, then by account: Lets the account manager quickly see if his customers are running OK • One set grouped by sysadmin group, then by account: Lets the system administrators quickly see what servers need attention • One set for customers who want access to Hobbit • The “Critical Systems” view is monitored 24x7
Hobbit@CSC - reports • Availability reports pre-generated for daily, weekly and monthly availability • Reports and detailed history available on-line for 3 months • Monthly reports available for 12 months • Graphs clean-up automatically, provide data for 1½ years (1 day average)
3.800 hosts 28.000 statuses 9.500.000 updates/day= 111 updates/second 3.100 network tests 40.000 webpages/day 27.000 RRD files = ~160.000 RRD datasets 8.500 RRD graphs/day Hobbit@CSC: Statistics • 1 Web / 2 net serversSun E220R server450 Mhz Ultrasparc II1 GB RAM2x72 GB SCSI disk • 1 RRD serverHP DL3803 Ghz Xeon1 GB RAM2x72 GB SCSI disk
Future work • Load balancing of Hobbit tasks: 4.3.0 • Graph updates and viewing • History log storage • Client data analysis • Network checks • High availability ? Maybe not ... can be handled externally • Re-design the web UI – any volunteers ? • Automated web checking of a full user session, perhaps using Mozilla or Konqueror
Questions ? The End