ASE133: Monitoring The Heterogenous Environment

ASE133: Monitoring The Heterogenous Environment Edward Barlow www.edbarlow.com Barlowedward@hotmail.com August 15-19, 2004

THE PROBLEM • Many companies are no longer single vendor • Dramatic increase in purchased products / packages • Products use different system software and databases • Business nightly batches • I have heard sybase described as legacy  - these should be stable • Application Monitoring • No mechanism – must be hand crafted • Current Employer Stats • Oracle x 3 • Sybase x 20 (one 250 GB db) • SQL Server x 20 (one 700 GB db) • Large numbers of other Unix, NT, and Linux • Numerous Purchased Applications with specific monitoring requirements • Extensive DR environment

THE PROBLEM • My Job is to keep the systems up • (but I like to do other stuff…) • Dbas are expensive – we need tools to be effective • Tools often are pricey and useless • Proactive Systems Management • A problem not encountered is a problem that needs not be solved • Problems not prevented due to lack of information are unacceptable • I must report upwards for *every* outage – i hate saying the disk ran out of space or other stuff that should never happen. • Fundamentally I don’t like getting up at 3am for computer problems

MY SOLUTION - AUTOMATION • Anything you do more than once should be automated • Perl allows easy automation of checks • 100 line program – easy to write – simple • Ideal administration language – all dba’s should be conversant • Similar to both C and shell  • When something breaks – i ensure it doesn’t happen again – with usually about a half hour of effort • Fundamentally if you can keep the systems from running out of space (especially NT), and react immediately any time anything gives you an alarm (log files), you dramatically reduce the chances of any issues impacting production • Fundamentally you should be the first to know a system component has gone down – especially the non obvious components.

TALK OUTLINE • Thoughts On The Problem • My solution history • What I am Monitoring • The Solution • The Details

PART #1 THOUGHTS ON THE PROBLEM • Distributed data is the problem – log files and system state are on many many systems • Would be great to replicate this system state data to one spot so I can view/manage it easily • Stuff running on each box should be minimized • Use DBMS only for data that needs to be redistributed so multiple people can use it • Distributed Viewing of Data • Real Time Alarms • Morning Review Reports • Paging • Email

CATEGORIES OF DATA • Heartbeats • Events • Batch Jobs • Your own Monitoring Agents • Other (performance)

HEARTBEATS • Traffic light type data (red/yellow/green) • You don’t care about history • Frequently Updated • May have thresholds • My decision – monitoring programs responsible for it • Should be able to Turn Off on system wide basis • You care about missed heartbeats

EVENTS • Care about history • For 2-3 days only • MlpAlarmCleanupDB.pl • Major source is Log Files • My approach to log files is to frequently copy them to alarm server and to parse them there – ignoring common errors. Parser keeps file position from the last parse so it restarts at that point.

BATCH JOBS • Just like heartbeat • You only care for status of last run • Not Frequent Updating • Frequency as low as once a week (or less) • Sets of batch jobs may be related • Ie. start of day • Differ on ‘subsystem’ field – see later.

AGENTS • Agents are the utilities that collect your data for you • I have probably 30 or so agents in my freeware collection • If you rely on monitoring for your administration, then not monitoring is an issue. You must be aware when agents are not working for whatever reason. • For Agents that save heartbeats – use last heartbeat time • For Agents that save events – you may not have any db interaction. So you must generate agent heartbeat • Agents must *never* hurt the system

PERFORMANCE I keep this category open. Performance data could easily be collected via the same mechanisms. I have not had a need to implement any performance collectors – it seems that there are multiple great performance apps out there that do a good job. Personally I am using historical server graphing solution I wrote that seems quite nice. But that’s a separate talk. Albeit a short talk… I have not packaged it up though.

WHAT WE CAN DO GENERICALLY • Ftp files from unix • Net::myFTP module copies locally or remotely as needed • Files on NT are easy • Rcp seems depricated and many like ssh but I simply don’t like to run stuff on other boxes… • Use syslogd if possible • Server connections – DBI is the best solution especially as it provides some MSSQL compatibility and total language independence. • Yes… I finally ported all my stuff off dblib…

PART #2: MY SOLUTION HISTORY • Pass #1 (1992) Many individual monitors that generate emails • Pass #2 (1997) Webmonitor • attempt to do this via dynamic web pages (hard to set up and configure – i consider this project to have failed). Some really cool stuff but overall too complex • Pass #3 (1998) Server Documenter Utility • Build static web pages every morning for morning review • Some of these pages refresh intra day • Complements Pass #1 • Html is a good UI for this stuff • Relys on ftping EVERYTHING from remote servers locally… Data stored in flat files including copies of all logs, space history, audit information, RUN scripts, interfaces… etc… • Pass #4 (2004) Integrated alarming application • Provides simple dynamic UI to all events • Data stored in database • Integrated everything

PROS AND CONS • Individual monitors • Even with integrated email & paging module don’t always handle what you want – ie how do you handle error logs or the fact that a disk is at 92%. • Good for notification of emergency – ie heartbeat management • Server Documenter works great • I have been very happy with it • Routine management of 50+ servers in under an hour a day • But… its not real time • Awesome job of storing cross environment reporting and keeping copies of system information in case you need to rebuild. • This is the next pass • Real time integrated alarming…

WHAT IS INVOLVED IN NEW APPROACH • SIMPLE COMPONENTS AS FOLLOWS • Perl Library (MlpAlarm.pm) • Heartbeat(), Event() functions • Reporting API – do all the work / selects etc… • Can be called from simple shell functions too… or you could write your own java interface in 20 minutes… • Web Page • Uses reporting api – all db connectivity in the library file – no sql calls. • Utilities • Daemon process for alarming • Batch for event cleanup • Database setup scripts • A Database • 100mb and never had trouble

NOTES – RESTARTABLE • Program Flow use lib ‘D:/ADMIN_SCRIPTS/lib’ use CommonFunc; exit() if Am_I_Alive() I_Am_Alive() While( 1 ) { • … do something… • I_Am_Alive() • Sleep(n) • } • Run frequently in cron – restarts program if down • Nothing more complicated than using a lock file is requrired and testing the file time • Requires clocks to be mostly in sync (but they normally are) or running from the same machine

SO HOW DO YOU DO THIS • Well the thing I came up with is that the most generic solution was an api – all can be built on it. • I couldn’t justify any interface other than dynamic cgi for simplicity purposes – the good news here is that it’s a single perl script which should be installable with no sweat. • Database on some junk server – which is now defined as production… but who cares about performance – it will be ok.

NOTES – DATABASE DRIVEN • All configuration information stored in database • Behavior controlled by this information • All controlled via the single interface

NOTES – NEVER FAIL • Your Database will be down • Someone will run with wrong version of perl… • ALARMING/MONITORING MUST NEVER FAIL PRODUCTION • But… I havent implemented a fail over solution – if the monitoring server is down you are effectively out of luck.

PART #3: WHAT IM MONITORING • All the attached programs are included in download • Some of them may make no sense in your environment, but most of them will • You can always just copy the code and use it to work • Write your own application specific monitors – its easy…

MONITORING – UNIX • Error Logs – syslogd – monitor.pl • Wonderful – errors allready centralized • Unix, Cisco, Mail, etc… • Has its own configurable severity levels • I effectively tail the centralized syslogd files and use the built in severity to store event data in the system for all kinds of stuff • Disk Space • Different disk space command on each system type • I have diskmon generate syslog messages – handled as above • portmonitor.pl • Ping systems/ports – tell me if any services are down • Round robin

MONITORING – NT • NT is too distributed and the tools are too graphical • Servers don’t seem to be as well monitored • Errorlog – pc_eventlog.pl • Read from perl • Disk space – pc_diskspace.pl • Critical * critical * critical • 60% of my database troubles are from disk space issues • Autoextend trades one type of problem for another • NT MSDB Jobs – pc_scheduled_job_rpt.pl • NT Log shipping cross check – pc_backupreport.pl

MONITORING – MORE UNIX • listen_cmfmsgs.pl – only my environment – listen for application specific tib error messages • feedmonitor.pl – watch that feed files arrive in timely manner • cisco_logmon.pl – parse/read cisco log files collected with syslogd • dnsmonitor.pl – checks that all our dns servers are working • reuters_monitor – check our reuters feeds • linemonitor.pl – checks our external connectivity

MONITORING – DATABASE • Space – space_monitor.pl • Error logs (check_dbcc_output.pl, get_syb_error_logs.pl, get_bkup_srvr_logs.pl) • Blocks – check_for_blocks.pl • Circular deadlocks (application specific) • This also looks for log suspend etc… • Number of Users – check_sybase_num_users.pl • Check runnable users • Internal small things • Run via server_documenter.pl

MONITORING - REPLICATION • check_sybase_repserver.pl • admin who_is_down • admin disk_space • admin health • Also checked via port_monitor.pl

MONITORING – NOTES • Up/down is a side effect of other monitoring • All programs if they don’t connect inform the db of that fact • Monitor server lightly loaded – as is the db server • Double Check Backups and Log Shipping • Backup scripts also use alarming functions in event of problem • Read logs and save that stuff

MONITORING - NOTES • Integrated with Sybase Backup Scripts • Alarm all my nightly stuff • Integrated reading of NT logs for SQL Server • Some stuff like blocks frequent - 3-10 minutes • Space & Logspace is a heavy task – every hour or so • Error Logs – every hour

MONITORING - ORACLE • Oracle Logs • Oracle Backups • Oracle

PART #4: THE SOLUTION

BROWSER BASED INTERFACE • Configurable, Consistent browser based data viewer • Seems to do quite well • Must be placed in an executable directory on a web server

SCREENS: ALL REPORTS

ADMINISTRATION: ALARM ROUTING

ADMINISTRATION : CONTAINER DEFINITION

ADMINISTRATION : DEFINE PRODUCTION

ADMINISTRATION : DELETE DATA

ADMINISTRATION : REPORT SETUP

ADMINISTRATION : CONTAINER DEFINITION

ADMINISTRATION : IGNORE LIST

PART #5: MlpAlarm.pm NAME MlpAlarm - Alarming And Monitoring Library DESCRIPTION This perl module provides a generic mechanism to manage alarms and monitoring. The tool is distributed as a perl module with several associated programs (a web based GUI, an alarm router, and some monitoring programs). The library contains several simple functions to monitor your systems and some back end functions used by the reporting user interface. Alarm data is stored in a database (Sybase or SQL Server).

Library Notes • As with all perl modules self documenting with perldoc mlpalarm.pm • Uses my DBIFunc.pm library – its an ok wrapper. • Must have “use lib” line included to find it • All functions by name interface • -query=>’abc’ • All functions check their own syntax and complain in english if required args not found or if non legal arg passed • Use –debug=>1 if you have problems (good general practice)

TABLE LAYOUT: HEARTBEAT create table Heartbeat ( monitor_program varchar(20) not null, monitor_time datetime not null, last_ok_time datetime null, system varchar(30) not null, subsystem varchar(50) null, state varchar(20) null, message_text varchar(256) null, document_url varchar(256) null, reviewed_by varchar(256) null, reviewed_time datetime null, reviewed_until datetime null, reviewed_severity varchar(30) null, batchjob varchar(20) null, internal_state char(1) default 'I' null )

TABLE LAYOUT: EVENT create table Event ( monitor_program varchar(20) not null, monitor_time datetime not null, system varchar(30) not null, subsystem varchar(50) null, event_time datetime not null, severity varchar(20) null, event_id int null, message_text varchar(256) null, message_value int null, document_url varchar(256) null, reviewed_by varchar(256) null, reviewed_time datetime null, internal_state char(1) default 'I' not null )

THE KEY – Program, System, Subsystem • What should the key be • My best key is monitoring_program, system, subsystem • System is often a host but not always – could be a business system • Subsystem (optional) is the key

CONTAINERS • Separate Data Visualization from internal grouping • Users should be able to look at rollup’s of systems (normal) and should also be able to get to other necessary data views • Containers are sets of systems • Auto create based on program or individually specify • Then you can create containers which are geared to each user

STATE/SEVERITY VALUES p1 Key to the system is the state and severity levels. Message routing will be based on a combination of the key (monitoring_program/system/subsystem) and this value. The following are legitimate heartbeat states * EMERGENCY = system down / critical failure * CRITICAL = serious problem. * ALERT = non fatal error needing attention * ERROR = non fatal error possibly requiring administrator attention * WARNING = non fatal warning. * INFORMATION = a simple message. Synonym for OK. * DEBUG = messages only of interest to developers

STATE/SEVERITY VALUES p2 Additionally, batch jobs may also be * STARTED = the job has started * COMPELTED = the job has completed normally If a batch job aborts/fails, the status should not be COMPLETED (it should be one of the other states from above). Note that you can have the batch job submit many RUNNING heartbeats, with different message texts, to identifying the exact position in the batch.

MlpHeartbeat() Heartbeat message. A heartbeat message may require attention but the system does not keep heartbeat history. Ping is an example of a heartbeat message. Requried Arguments: -state, -monitor_program -system -state => [STARTED, COMPLETED, RUNNING, EMERGENCY, CRITICAL, ALERT, ERROR, WARNING, INFORMATION, DEBUG] Optional Arguments: -debug -subsystem -message_text -batchjob -batchjob => 1 if a batch job – this makes it not an error to not have heartbeat every 10 minutes -event_time => valid sybase date time format – default=now

MlpEvent() MlpEvent saves Events, which are monitioring messages where history may be of interest. Application errorlogs are examples of events. Required arguments: -message_text=>[text string] -severity=>[EMERGENCY, CRITICAL,ALERT,ERROR, WARNING, INFORMATION, DEBUG] -system=>[system being monitored] -monitor_program=>[program name. Default is basename($0)] -severity

ASE133: Monitoring The Heterogenous Environment