1 / 16

Lemon

Lemon. Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS. Outline. Lemon Structure Deployment at CERN Use cases Alarms Web visualization Summary. Lemon – LHC Era Monitoring.

miette
Download Presentation

Lemon

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS

  2. Outline • Lemon • Structure • Deployment at CERN • Use cases • Alarms • Web visualization • Summary Hepix 9-13/05/2005 Karlsruhe

  3. Lemon – LHC Era Monitoring • Lemon is a software package containing tools for monitoring status and performance of computers: • Distributed monitoring system scalable to ~10k nodes • Provides active monitoring of software and hardware in the Computer Center on centrally managed clusters • Facilitates early error detection and problem prevention • Provides persistent storage of the monitoring data • Executes corrective actions and send notifications • Offers a framework for further creation of sensors for monitoring • Most of the functionality is site independent • It is used at CERN by: • System administrators, service managers, cluster responsibles • Developers and service/data challenges • Managers and general users • Link: http://cern.ch/lemon Hepix 9-13/05/2005 Karlsruhe

  4. Repository backend SQL RRDTool / PHP Correlation Engines SOAP SOAP apache TCP/UDP HTTP Monitoring Repository Monitoring Agent Nodes Lemon CLI Web browser Sensor Sensor Sensor User Lemon - schema Hepix 9-13/05/2005 Karlsruhe

  5. Components • MSA – Monitoring Sensor Agent • Spawns multiple Monitoring Sensors (MS) to measure data in defined intervals and sends data to Monitoring Repository • MS - Monitoring Sensor • Uses standard C++, perl API – it is easy to write your own sensor • Several sensors exist for performance, process, hw and sw monitoring, grid VO’s job reporting, database monitoring, security, alarms (total 260 metrics) • MR – Monitoring Repository • Stores data in an Oracle (the full history) – backed up to tape in Castor • Flat file version available as well (with most functionality preserved) • We run two of them on two independent machines with two databases with failover (aiming for High Availability with Oracle Real Application Cluster) • LRF - Lemon RRD Framework • is used to cache the data in easily accessible way (rrd files) for web graphics • In connection with Quattor Configuration DB provides service and cluster overview • RRD stands for Round Robin Database (time aging data with predefined binning) – developed by Tobias Oetiker in ETH, Zurich (http://www.rrdtool.org) • LAG – Lemon Alarm Gateway • Generic gateway for alarms Hepix 9-13/05/2005 Karlsruhe

  6. Lemon at CERN • Lemon monitors about 2200 computers in ~100 clusters • On average it collects about 70 metrics from each host • Part of the ELFms tools • Integrated with Sure alarm system • Collecting about 1.5 GB/day • Integrated with CDB for configuration • Leaf (LHC-Era Automated Fabric) for scheduling of interventions Node Configuration Management Node Management • Configuration • Derived from Configuration Database (CDB) • individual configuration per cluster/host • hierarchical structure • monitoring state is derived from CDB • Leaf tools allow scheduled downtimes, interventions, on demand changes • Alarm system • Sure – legacy system receiving alarms from Lemon • Integration with new LASER system (LHC alarm system) is ongoing Hepix 9-13/05/2005 Karlsruhe

  7. Computer Center Overview • Entry page displays status overview of the key services • Allows choosing the individual cluster, rack, host or other categories Hepix 9-13/05/2005 Karlsruhe

  8. Reboot occurrence history graph Use(ful) cases (I) • Kernel upgrade • Kernel version is “measured” on the boot of the machine • Automatic tools for upgrading the kernel on a cluster retrieve information from Lemon and schedule reboot of a machine based on this info • Web interface allows monitoring of the progress Hepix 9-13/05/2005 Karlsruhe

  9. Use(ful) case (II) • Searching for a host • High load, network usage,… • Metric distributions allow identification of hosts with problematic performance Hepix 9-13/05/2005 Karlsruhe

  10. Integration of Web interface • Web interface has been through various plug-ins adopted to accommodate additional information/links to help management of the computer center • Examples: • Configuration database browser (browses external XML config files) • ITCM (Remedy) ticket – external error tracking database • CC tracker (synoptic view of the computer center) – XML defined geometry • Alarm display • Metric information display • Raw data grapher (JPgraph) • External functionalities are customizable Hepix 9-13/05/2005 Karlsruhe

  11. Computer Center display • Lemon Web Interface is interfaced with Computer Center database of objects • Provides search of objects as well as listing • Interfaced through the XML defined geometry of the computer center • Generic design Hepix 9-13/05/2005 Karlsruhe

  12. Automatic recovery actions • Alarm Sensor • For defined values of measured metrics an actuator is called with predefined action • An example: ssh daemon dead – action /sbin/service sshd start • Definition: metric X, field Y != reference value Z => call actuator • If success log only • Else call action up to max times • Each occurrence is logged in the Monitoring Repository • Already about 70 predefined alarms with automatic recovery actions • After first month of deployment it reduced number of problem tickets by half • Correlation engine • Allows wide definition of alarms and recovery actions (in development) Hepix 9-13/05/2005 Karlsruhe

  13. ITCM (Remedy) tickets occurrence Remedy Ticket tracking • Error trending metric with values on number of interventions/occurrences of problems • Several categories created by: • Hardware • Software • Clustered by contract type/cluster • Reporting problems whether scheduled or not and whether system was rebooted • Allows tracking of interventions per type of problem • Web interface to show the trend Hepix 9-13/05/2005 Karlsruhe

  14. Database (Oracle) Monitoring • In cooperation with ADC group at CERN we have developed a sensor for measuring performance entities in Oracle Database: • Number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … • Allows identification of bottlenecks and gives overview of the stability of the system • Works on both 9i and 10g version of the Oracle • Integration into services/RAC • Configuration of service integrated with Oracle Enterprise Repository Hepix 9-13/05/2005 Karlsruhe

  15. Service challenges, GRID VOs • Lemon allows • Virtual clusters • clusters defined on request by service managers • Or defined by scripts – updated dynamically on demand • Or Defined for specific purpose • An example: Atlas DC04 challenge, Network challenges,… • Clusters defined dynamically • An example: hosts running GRID jobs on the batch cluster belonging to the given Virtual Organization • Provides hooks in Lemon for defining any dynamic grouping of hosts Hepix 9-13/05/2005 Karlsruhe

  16. Summary • Lemon serves to provide monitoring information about the computers in the Computer Center at CERN • Thanks to its integration with Sure (alarm system) it allows fast and easy identification and repair of problems. We will convert to a new accelerator alarm system this year (LASER). Lemon provides LAG (Lemon Alarm Gateway) to feed alarms into arbitrary alarm systems. • In connection to CDB it allows easier overview of services and visualisation of their performance • In connection to Remedy (ITCM – problem tracking) allows an overview of the problems for the given service • It has been a useful tool for general monitoring of performance and also for system administrators in debugging problems • Lemon is also used and developed elsewhere – BARC institute in India, Accelerator department at CERN, CMS is adopting it for its online farm monitoring,… • Lemon is used for GridIce and can provide data to MonAlisa Hepix 9-13/05/2005 Karlsruhe

More Related