1 / 14

Lemon Monitoring Update

Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005. Lemon Monitoring Update. http://cern.ch/lemon. Outline. Where last HEPIX left off monitoring Usage overview of Lemon Security update on Lemon

kagami
Download Presentation

Lemon Monitoring Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Miroslav Siket, German Cancio, Murthy Chandregiri, Rohitashva Sharma, Dennis Waldron CERN-IT/FIO-FS HEPIX Workshop SLAC Oct 10-14, 2005 Lemon MonitoringUpdate http://cern.ch/lemon

  2. HEPIX Workhosp, SLAC Outline Where last HEPIX left off monitoring Usage overview of Lemon Security update on Lemon Alarm systems and integration with Lemon Lemon new development Installation and configuration Conclusion

  3. HEPIX Workhosp, SLAC Last Hepix followup No common monitoring solutions in HEP community: Solutions based on preference of service managers at all sites (Ganglia, Nagios, ranger, NGOP, Lemon, MonAlisa,…) Not compatible infrastructure, difficult to make gateways into each other Mostly not including alarm system or problem tracking solutions Most not scalable, some do not provide service or other type of monitoring, missing features, no GUIs,… No generally known solution, some go for their own, some choose from the pool of existing ones Multiple development lines, not much information available Will Lemon fill in the gap?

  4. HEPIX Workhosp, SLAC Lemon usage overview Overview (Sep 2005) Usage: LCG sites (180 sites with 1,100 nodes, multiple clusters per site in production) AB department at CERN (5 clusters with 100+ nodes, production and test) Aachen - plan to use it on their cluster – about 20 nodes S3group (US company) - 2 clusters (28nodes) Evaluating – several other institutes (IN2P3, INFN,…) No feedback from HEPIX community, or no plans towards Lemon (BNL) Platforms: i386, x86_64, ia64, Solaris Lemon parts used: Agent, sensors (default set) + own sensors (GridIce) Server (“flat-file” based) Oracle based – evaluation by IN2P3 Web based status pages

  5. Overview cont… • Configuration management: • Yaim • Quattor • Home made tools (scripts) and some (small sites) manually • Problems: • Needed detailed manual on writing sensors, examples • Restricted functionality of file based server (enhancements undergoing) • Wishes: • Configuration GUIs, management • Enhanced web interface • Additional backend solutions (SQLite, MySQL) • Security (LCG requirement) • Comments: • Usually satisfied with the support/response of Lemon team.

  6. HEPIX Workhosp, SLAC Security in monitoring Almost no monitoring comes with authentication and confidentiality of monitoring data (some do that by obscuring their protocols/hide implementation information) What are the requirements: Authentication of source (i.e. prevent malicious parties from faking data and unwanted source from over-flooding servers, limits impact of compromised machines,…) Encryption of data (some data could be confidential – failure rates, systems status, security statistics, grid user statistics, …) Access restrictions – who can read the data New version of Lemon is coming with SSL (RSA,DSA or X509) based authentication and possibility of encryption of data between the sensors (sensor agent) and server Authentication – transferring signed data over network – checked on the server against public key of the machine Encryption – data is encrypted using public key of the server(s) Access – XML based secure access

  7. Lemon security • Secure mechanism to copy node's public keys to server machine is needed • Still one needs secure network environment to avoid DOS attacks • Three default modes of operation to be supported: • No encryption, no authentication • Authentication • Authentication and encryption Node1 [rsa_encrypt(s.pub_key)] rsa_sign(n1.sec_key) Server1 rsa_verify(metric,n3.pub_key) [rsa_decrypt(s.sec_key)] Node2 [rsa_encrypt(s.pub_key)] rsa_sign(n2.sec_key) Server2 rsa_verify(metric,n1.pub_key) [rsa_decrypt(s.sec_key)] Node3 [rsa_encrypt(s.pub_key)] rsa_sign(n3.sec_key)

  8. HEPIX Workhosp, SLAC Alarm systems and Lemon Requirements on the alarm system: Scalability: 10k+ nodes, 250 alarms, with possible high frequency (100+/min) Horizontal alarms reduction – f.e. no contact alarm hides other alarms on the same node Vertical reduction – f.e. 70% of nodes in cluster x show same alarm -> alarm on x History and tracking, priority, sorting Associated help, possible actions (opening a ticket,…) GUIs for users and for operators, system managers with ACL Data mining possibility SURE alarm system: Legacy system at CERN Computing centre (12years) Scalability problems, requires new GUIs (tcl/tk based) Missing features Special sensor that gathers information and sends it to the SURE system

  9. Alarm systems and Lemon LASER (The LHC Alarm SERvice Project) • Provides nice interfaces and most of the features • Under evaluation by our team • Currently there is a problem with potential number of defined alarms (1M+) • The data is inserted to laser from Lemon by LAG (Lemon Alarm Gateway)

  10. Lemon Alarm System • Alarm information is available also through the status web pages • We are considering an option to build web based alarm system on the existing Lemon infrastructure • Current implementation includes overview of alarms with possibility to choose global/cluster views,…

  11. HEPIX Workhosp, SLAC Enhancements Modular configuration (xinit.d style) /etc/lemon/agent/default.conf /etc/lemon/agent/transport/*.conf (location of servers) /etc/lemon/agent/sensors/*.conf (sensor definitions) /etc/lemon/agent/metrics/*.conf (metric setup) DB monitoring enhanced (Oracle) Added user, tablespaces, session and wait class monitoring. XML data retrieval Data retrieval in XML from the application server over https Authentication and controlled access Working on C/C++, Java, Perl, Python and PHP APIs Exceptions Adding on behalf and multiple metrics correlation engine

  12. HEPIX Workhosp, SLAC Installation and setup Simplified Lemon installation consists of three steps: Server installation Client installation Web interface installation 1. Server installation: install edg-fabricMonitoring-server rpm (“flat file” server) Configure receiving port in /etc/edg-fmon-server.conf Start the server daemon 2. Client installation: Install edg-fabricMonitoring-agent rpm (comes with default metric configuration) Configure server and its port in /etc/edg-fmon-agent.conf Start the client daemon on all monitored hosts

  13. Installation and setup (II) 3. Web interface installation • Install and start apache server (with php) on your server • Install rrdtool and lrf (lemon rrd framework) rpms • Configure your clusters in clusters.conf file and start lemonmrd daemon • Possible additional components: • Computer center synoptic view through xml file • Problem tracking system integration (through php plug-in to your DB/application) • Quattor CDB configuration view – through CDB xml profiles • Oracle based Repository (for very large installations with high scalability and increased functionality) • Other, new components are easy to add • View detailed instructions at: http://cern.ch/lemon/doc/installation/installation.html

  14. HEPIX Workhosp, SLAC Conclusions Lemon team is working to satisfy requirements and provide concise monitoring system HEPIX community feedback would be welcome Alarm system is emerging Check for updates at http://cern.ch/lemon

More Related