1 / 25

Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

Lemon monitoring and Lemon Alarm System (sensors, exception, alarm). Ivan Fedorko 22/11/2010. Overview. Lemon overview Lemon agent and sensors How to write new sensor Exception sensor LAS. Lemon components. Measurement Repository. Lemon-web. RRD tool / Python. Repository Backend.

alia
Download Presentation

Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemon monitoringandLemon Alarm System(sensors, exception, alarm) Ivan Fedorko 22/11/2010

  2. Overview • Lemon overview • Lemon agent and sensors • How to write new sensor • Exception sensor • LAS

  3. Lemon components Measurement Repository Lemon-web RRD tool / Python Repository Backend Application Server Apache/ PHP Oracle Database Node Monitoring User Interfaces TCP/UDP HTTP Monitoring Agent Local Cache SQL Lemon CLI (command line tool to access data) Web Browser Lemon-host-check (command line tool node exceptions) Sensor Sensor Sensor

  4. Lemon agent and sensors • Sensor: • A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement: • Metric Classes: • The equivalent to a class in OOP (Object Orientated Programming) • Metric Instance: • Is an instance (an object) of a metric class which has its own configuration data. • Metric ID: • A unique identifier associated with a particular metric instance of a particular metric class. Monitoring Agent Sensor Sensor Metric Class Metric Class Metric Class Metric Class Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance

  5. Lemon agent and sensors • MSA (Monitoring Sensor Agent) • forks sensors and communicate with them using custom protocol over a bi-directional “pipes” • configures metric instances of metric classes of a sensor and pulls for metrics • to configure: ncm-ncd --configure fmonagent • configuration:/etc/lemon/agent/ • log:/var/log/lemon-agent.log • checks on status of sensors • caches data locally ( e.g. /var/spool/lemon-agent/ ) Monitoring Agent Sensor Sensor Metric Class Supported Lemon Sensors http://lemon.web.cern.ch/lemon/doc/sensors.shtml Metric Class Metric Class Metric Class Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance

  6. Lemon agent and sensors /etc/lemon/agent/sensors/linux_CDB.conf Monitoring Agent Sensor Sensor Metric Class Metric Class Metric Class Metric Class Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance

  7. Lemon agent and sensors Example of linux sensor Lemon API • What we can store: • StoreSample01: • agent: <metric_id> <timestamp> • user: <value> • StoreSample02: • agent: <metric_id> • user: <timestamp> <node> <value> • StoreSample03: • agent: • user: <node> <metric_id> <timestamp> <value> Reporting on behalf

  8. Create new sensor • Check instruction and example code • http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml • Prepare your code and test • Test can be done localy • Prepare templates • For sensor  will be transformed to DB table definition • For metrics  create table/metric, metadata check on server • Ask for metric ID at lemon.support • Commit templates with ID to CDB and inform lemon.support • If new exception is introduced, should be alarmed?

  9. Sensor template example Somewhere in your node template Sensor template For configuration For db backend

  10. Exception sensor • Objective: • To run corrective action when the occupancy of the /tmp partition is greater then 80%. • Involved Metrics • With ID 9104 (system.partitionInfo) • Field 1 = mountname, field 5 = percentage occupancy • Correlation • Correlation((9104:1 eq '/tmp') && (9104:5 > 80)) • Actuator /usr/local/sbin/clean-tmp-partition -o 75

  11. Exception sensor • an officially supported Lemon sensor coded in C++ • developed in collaboration between CERN and BARC • implements the Lemon alarm protocol • has a correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine • supports reporting on behalf of other monitored entities • allows corrective actions (actuators) up to n-times or within a given time window • is the primary interface to inserting alarms into the Lemon framework. • Provides one and only one metric class “alarm.exception” • Full documentation at: • http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml

  12. Exception sensor Alarm evolution

  13. Exception metric name of the exception to be used on the web and later for operator's GUI "/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) ); short description to be also passed to GUI if false, ncm component will not include exception to config if false, value are stored in lemon DB archive • what is the alarm's importance? Not used now! • 0 - informative -> to be handled at convenience • 1 - low - 9/5 support - to be handled within working hours, e-mail outside working hours • 2 - high - 24/24 support - requires immediate action - PK or expert call

  14. Correlation • Basic format of a correlation is: [entity_name]:<metric_id>:<field_position>         <operator>         <reference_value> ... • Where, • entity_name • An optional parameter, used for reporting on behalf of other entities • The name of the entity (wildcards ‘*’ are supported) • metric_id • The id of the metric to check • field_position • The field to use within the metric. • Allows the correlation to extract a single value from a multi-valued metric • Operater • E.g. ==, !=, >, <, eq, ne, regex, !regex … • reference_value • A string or number used to compare the metric_id:field_position against

  15. Correlation • Use basic object for correlation <metric_id>:<field_position> 2. Combine exceptions 10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0) 4. Join the metrics (9200:1 == 9208:1) • You can collect information on behalf, you can define exception on behalf [entity_name]]:<metric_id>:<field_position> e.g. (*:9501:5 != 200) && (*:9501:5 != 301) 5. "correlation", "4109:2 ne 'symlink('/system/kernel/version')'"

  16. Actuator • Information: • Run as forked processes. • Are connected to the sensor via a pipe. • All information written to stdout or stderr by the actuator is caught and recorded in the agents log file. • All actuator attempts are logged centrally and recorded locally in the agents log file. • Running shell style actuators: • The system call used to run actuator doesn’t provide shell style conveniences. • To use shell style syntax like *, &&, | etc you must define you actuator like this: • Actuator /bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\”

  17. Actuator "/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) ); • The maximum number of times an actuator can run consecutively before a final alarm is generating • The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor. • Time window to execute all maxruns of actuator • Actual value of correlation objects are accessible for actuator Actuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\" Actuator /bin/sh -c \\"/bin/echo 'Died lemonmrd daemon $act_value_01 ' | /bin/mail -s 'Lemon RRD Daemon problem' 0041xyz@mail2sms.cern.ch\\"

  18. Exception config "minoccurs", 5 specifies how many time the exception should occur before rising the exception  “silent", yes An exception which is considered silent effectively sets the exception state to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console “local", yes Technically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check.

  19. LAS CDB SMS Event Management System LAS Business Logic PL/SQL Lemon-web LAS GUI Exception Metrics Lemon Oracle DB Operator Administrator

  20. LAS • Raw alarm • represents an processed exception on one of the monitored entities (reported by Lemon). Not all exceptions become an alarm(only one with code 000, 005, 135). And only from host not in maintenance. • L-alarm • can represent one or more alarms. It is an item visible on the operators screen in the LAS GUI. • every L-alarm must be acknowledged with created ticket in ITCM • states: active, inactive, acknowledge, inhibited • alarms may be grouped to L-alarm (by entities, exceptions, cluster rules…) • What is no_contact alarm? • Exception not evaluated on host! • Alarm means that data are not arriving from host to DB. • One of heartbeat metrics (6335, 6336, 9500, 10005) reported every (usually) 5 min • Procedure in db is checking if last entry not older than (usually) 10 min

  21. LAS

  22. LAS How to avoid alarm • disable (e.g. by lemon-host-check) • make exception local • make exception silent • set host to maintenance  actuator will run /etc/lemon/exceptions/state.conf

  23. Backup From now on backup

  24. Sensor parameters

  25. Smoothing

More Related