1 / 17

LAS for System Administrators

LAS for System Administrators. LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD. LAS building blocks. Oracle DB server running LAS logic and storing LAS data - PL/SQL OraMon – application server Inserting exceptions to Oracle DB Web server

Download Presentation

LAS for System Administrators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  2. LAS building blocks • Oracle DB server • running LAS logic and storing LAS data - PL/SQL • OraMon – application server • Inserting exceptions to Oracle DB • Web server • Providing access to LAS data from Oracle DB to LAS GUI (business logic) • Remote monitoring – ping, http • SURE gateways for UIMON/AFS Lemon Tutorial

  3. LAS hardware • Two independent instances • Primary • Oracle DB and OraMon – lemondb1 • Web server – lemonweb02 • Secondary • Oracle DB and OraMon – lemondb2 • Web server – lemonweb01 • Remote monitoring machines • Lxfsrk4104 (aliased as lemonmr & lemonr01) • lxservb01 (alias lemonr02) Lemon Tutorial

  4. Oracle DB server check • Login to machine (lemondb1,lemondb2): > source ~oracle/.oraprofile.LEMON* > tnsping LEMON_A (LEMON_C for lemondb2) Check output of the previous command Example: OK (0 ms) Lemon Tutorial

  5. OraMon check • Already checked by LAS GUI • Lemon-host-check • ORAMON_WRONG procedure • Log file: /var/log/OraMon.log Lemon Tutorial

  6. Apache web server check • Already checked by LAS GUI • Lemon-host-check • HTTPD_WRONG procedure • Log file: /var/log/httpd/error_log Lemon Tutorial

  7. Remote monitoring check • Runs as sensor (remote) on remote monitoring machines • Lemon-host-check • Agent log file: /var/log/edg-fmon-agent.log Lemon Tutorial

  8. SURE gateways for UIMON/SURE • Runs as a sensor (suregateway) on remote monitoring machines • Agent process and log file • ISSUE: AFS machines • Uses lemon-sure-multiplexer process as a gateway • Lxfsrk4104 only • Check existence of the daemon, log file: /var/log/lemon-sure-multiplexer.log Lemon Tutorial

  9. lemon-cli • Command line tool for extracting raw (un-interpreted) data from lemon. • Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server over SOAP (aliased as lemonmr, physical machine: lxfsrk4104) • Limitations • local cache is limited to seven days worth of history (purged everyday by the agent) • remote server queries limited to 20,000 returned results • this limitation will be removed when the new lemon API is deployed (end Q4, begin Q1 2007) • local cache contains much more information then is recorded at the server • Why? smoothing!! • Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers • If the information you want is < 7 days use the local cache!! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml Lemon Tutorial

  10. lemon-cli (II) - Examples • Resolving a metric id to a name • lemon-cli –m syslog • Displays all the metrics whose name contains ‘syslog’ • Referencing time periods (--end, --start), e.g. • 1h = 1 hour • 2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds • Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!) • If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m 10005 –n lxb[0001-1000] --server • All alarms can be seen on the machine using • lemon-cli –class “alarm.exception” • 1 005, 1 135 and 1 000 are alarms • lemon-host-check interprets all the codes for you!! Lemon Tutorial

  11. lemon-host-check (I) • Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon-agent. • Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) • Makes sure that the information reported to you is up to date (fresh!!) • Checks that all sensors are running, and that 1 and only 1 agent processing is running. • Must be logged in as root! • Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial

  12. lemon-host-check (II) - Examples • Check for active alarms on the machine • lemon-host-check • Disable alarms “syslogd and klogd” • lemon-host-check –disable "30023,30032“ • Show me alarms even if they are disabled • lemon-host-check –force • Disable all alarms for the next 1 hour 30 minutes and 23 seconds • lemon-host-check –disable-all –duration 1h30m23s “demo intervention” • View a list of all disabled alarms • lemon-host-check –list • Enable all alarms • lemon-host-check –enable-all • Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again. Lemon Tutorial

  13. lemon-host-check (III) • Pre-alarms • Recent concept added to lemon. • Aims at dealing with transient alarms. • Real Use Case: • high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. • Not visible by default in lemon-host-check • Caution: • If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). • Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) • If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!) Lemon Tutorial

  14. lemon-host-check (IV) • Common errors: No monitoring agent process running / Too many monitoring agent processes running • service edg-fmon-agent restart • If that fails project-elfms-lemon@cern.ch Possible false exception • lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) • Why? • The agent maybe too busy to answer lemon-host-check • Maybe some sensors have failed to retrieve the necessary information • Solution • re-run lemon-host-check again • Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. • Try again • Still failing, contact service manager and CC project-elfms-lemon@cern.ch Lemon Tutorial

  15. FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? • Linux (lemon agent, ping, http check) • Solaris (lemon agent, UIMON) • Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? • AFS machines have their own monitoring tools – no information available • UIMON monitored machines – running UIMON process and multiplexer to send alarms to suregateway sensor on remote monitoring machines We knew nodes' polling on SURE, what is implemented in Lemon? • Remote sensor on remote monitoring machines Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? • No, just two independent instances running in parallel. • In future (with RAC) there will be failover for OraMon and only one Oracle DB Lemon Tutorial

  16. FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? • Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? • Full documentation with examples athttp://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml • Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80 LAS reduction rules and multi-hosts tickets: a direct mapping? • Several use cases: • e.g. 12 x spma_wrong on 12 nodes of cluster YYY • One LAS item if the number of machines reaches 51% of the active nodes in cluster • Several LAS items if they appear in burst and the alarm has been already reduced • Individual machine LAS items if below 51% • If new machines appear, there will be a new reduced LAS item for each set of them A mean to detect when a node started to be "alarmed" and when this stopped. • /var/log/ncm/component-setodesiredstate.log* log file on the machine in question Lemon Tutorial

  17. FAQ (III) What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? • No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml QUESTIONS? Lemon Tutorial

More Related