1 / 20

Update on Service Availability Monitoring

Update on Service Availability Monitoring. Marian Babik, Paloma Fuente, et al. (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH). Overview. Recent changes and releases SAM Update-20 SAM Update-22 Update-22 details and impact Operations and support. Update-20 Changes.

eugeniaj
Download Presentation

Update on Service Availability Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Update on Service Availability Monitoring Marian Babik, Paloma Fuente, et al. (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH)

  2. Overview • Recent changes and releases • SAM Update-20 • SAM Update-22 • Update-22 details and impact • Operations and support

  3. Update-20 Changes • Released: December 2012 • Last SAM release based on gLite • New features: • Operational Tools Monitoring - http://ops-monitor.cern.ch • Operational Tools Availability in MyEGI • Monthly reports in central MyEGI • Nagios configuration improvements • More information: http://cern.ch/go/bH6K

  4. Update-22 Changes • Planned: April 2013 • Integration of EMI probes • based on EMI/UMD • Following EMI probes were integrated: • CREAM, WMS, BDII, ARC, LFC, FTS, SRM, ARGUS, GLEXEC, WN, UNICORE • Complete repackaging of SAM • Improved yaim configuration

  5. Recent Activities • Required close collaboration with EMI and EGI JRA1 • Large-scale testing activity (with EMI) • https://twiki.cern.ch/twiki/bin/view/EMI/NagiosServerEMITestbed0022012 • SAM/Nagios probes WG (with EGI JRA1) • Meetings with EMI PTs • Evaluation of EMI probes (business logic) • Reported to EGI OMB

  6. Next Release in Detail (1) • Update-22 will be a non-backward compatible in packaging • Installation from base SL5 is expected (no upgrade path, no SL6 support) • Probe packages imported to SAM • Middleware from UMD • Considerable simplification of repository setup (just SAM, UMD and EPEL)

  7. Next Release In Detail (2) • Simplified yaim configuration: • new SAM_NAGIOS nodetype • SAM/Nagios configuration • Run-time optimizations • EMI NAGIOS nodetype provided by EMI • lightweight EMI-UI • environment setup for the probes • yaim –n NAGIOS –n SAM_NAGIOS

  8. Next release in Detail (3) • Changes to metric names are needed: • org.sam.CREAMCE-JobSubmit -> emi.cream.CREAMCE-JobSubmit • Metric translation mechanism was implemented to handle transition period • NGIs sending both new and old metrics at the same time • Status and Availability history will be kept in both local and central databases

  9. Impact on SAM • Probes are now part of the middleware (and developed by many different PTs) • Continuous coordination from JRA1 is crucial after the end of EMI • SAM release schedule now depends on PTs • Probes still shipped with SAM • But testing expected from PTs and middleware providers to ensure probes work with underlying middleware

  10. Impact on EGI SR • EGI Staged Rollout (SR) assumes already tested production ready release • SAM can no longer guarantee this since: • Lacks control over probes and probe-to- middleware interfaces • No longer competent to test if probes work correctly with underlying middleware • Unable to ensure probes will work against production infrastructure • More complex testing needed

  11. Possible Options • SAM testing releases • Via dedicated testing repository • Process similar to EGI SR (lightweight) would be needed to evaluate a testing release • Once approved – SAM would release to SR • UMD adopts the probes and does the initial testing to ensure • Probes work with released middleware • Spots major issues early in the process and can block the release

  12. Operations and Support • SAM central services (since Sept. 2012) • 206 operational tickets • upgrades, generating reports, interventions, profile changes • 62 re-computations • GGUS (since Sept. 2012): • 117 GGUS tickets in 3rd level • 36 GGUS tickets in 2nd level

  13. Summary • SAM central services stable • Substantial improvements in adoption of EMI probes, operational tools monitoring and Nagios configuration features • Continuous support and bug-fixing • Near-term plans (MS710)

  14. Backup

  15. Near-term plans • Update-22 will conclude development work planned for EGI-InSPIRE • but SAM will continue to evolve • Until end of EGI-InSPIRE • Continuous support and bug-fixing • Maintenance and operations of the SAM central services • SAM central Oracle databases • SAM central services (MyEGI and API) • EGI monthly reports • Operational Monitoring and Availability

  16. WEB API statistics - March • ~ 2.5M hits/month • ~ 60k hits/day • Top hosts quering the Web API: • mon-it.cnaf.infn.it (167k hits) • rocnagios.grid.sinica.edu.tw (110k hits) • rocmon-fzk.gridka.de (85k hits) • ngi-de-nagios.gridka.de (85k hits) • Failures (0.2%)

  17. SAM Scope • SAM grid monitoring (SAM-Gridmon) • Central services (Web, API, availability) • SAM-Nagios • Monitoring platform supporting multiple configurations: • NGI-Nagios • VO-Nagios • Operations Tools-Nagios (ops-monitor)

  18. SAM Overview SAM regional instances • 40 regional instances • Hosting over 230 metrics • Monitoring over 4000 services

  19. Validation and deployment • SAM operates nightly validation platform • Runs basic validation tests for each component • 12 VMs running all known configurations • SAM-Gridmon • SAM-Nagios • NGI Nagioses (NGI_IT, CERN, NGI_UK) • VO Nagios • Operated continuously • Installed/upgraded every 2 days to latest SAM-Update (SVN)

  20. Validation and deployment • Upgrade of the preproduction line • CERN ROC • SAM central service (grid-monitoring-preprod) – became part of EGI testbed • Upgrade of the production line • SAM central service (grid-monitoring) • EGI SR • Upgrade of the production services • Tested by EAs • EGI SR report

More Related