1 / 36

Service Availability Monitoring

Service Availability Monitoring. Marian Babik , Wojciech Lapka , Paloma Fuente, Jacobo Tarragon , Robert Veznaver (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH ) Anastasios Andronidis (AUTH). Agenda. Motivation Usage Capabilities Architecture Interfaces Day to day.

lindsey
Download Presentation

Service Availability Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Service Availability Monitoring Marian Babik, WojciechLapka, Paloma Fuente, Jacobo Tarragon, Robert Veznaver (CERN) Emir Imamagic (SRCE) PaschalisKorosoglou(AUTH) AnastasiosAndronidis (AUTH)

  2. Agenda • Motivation • Usage • Capabilities • Architecture • Interfaces • Day to day

  3. Why SAM ? • Understand and improve quality of services delivered by tiers and sites • Provide feedback to management and funding agencies if sites and tiers comply with the previously agreed SLA

  4. SAM today SAM - distributed monitoring framework for computing availability and reliability of sites and services. 262 metrics 4200 services monitored 10 VOs 40 SAM instances 700 000 metric results/day

  5. Use of SAM • WLCG • Experiments (ATLAS, CMS, LHCb, ALICE) • Management • EGI • VOs (Biomed, , , Gisela) • Management • Operations (COD, ROD) • Site managers

  6. Use of SAM 729 546 metric results/day 550 metrics results/s

  7. Use of SAM 729 546 metric results/day 550 metrics results/s

  8. SAM capabilities • Open source based • Nagios, ActiveMQ, Django • Framework for executing Nagios probes and aggregating metric results • existing probes for almost every grid middleware (EMI, gLite, Unicore, ARC, Desktop Grids, QCR) • Notification • Reporting • Web interface and Web API • Support for third-party monitoring systems • OSG

  9. How SAM works ?

  10. MyWLCG – Web Interface

  11. MyWLCG - Web API • Exposing Web API to a number of clients: • Experiments dashboards • EGI dashboards • SLS • 3rd parties • Supporting XML, JSON • On average 2.0M hits/month

  12. SAM Day to Day • Coordination • Scope and effort management (roadmap, PoW), change management, communication plan • Development • Actively maintaining/improving 10 components • ~200k lines of code, +200 packages • 612 development tickets closed (last 9 months) • Validation and staged rollout • Validation infrastructure deployed • Running all SAM services on 10 nodes • Continuous validation of latest development release • EGI and WLCG staged rollout

  13. SAM Day to Day • Support • Direct support to WLCG VOs and EGI/OSG • SNOW: Grid Infrastructure Monitoring SE (2 FEs) • GGUS: 3rd level SAM/Nagios SU • 191 tickets closed (last 9 months) • Operations • Production and PreProduction infrastructures • Responsible for the operation of: • 2 SAM-Gridmon: central monitoring services • 8 VO SAM-Nagios: monitoring WLCG VO services • 1 OPS SAM-Nagios: monitoring the monitoring services! • 396 tickets closed (last 9 months)

  14. SAM Operations

  15. Summary • SAM is tracking availability and reliability of sites in order to understand and improve their QoS • SAM is an open-source based platform • SAM is used daily by WLCG and EGI to monitor sites, services and compute their availability • SAM will continue supporting WLCG and EGI in their day to day operations

  16. Contacts and References • Technical • tom-developers@cern • Support • SNOW (Grid infrastructure monitoring SE – SAM/Nagios FE) • Links • SAM documentation (http://cern.ch/go/Qq8w) • SAM internaldocs (http://cern.ch/go/lnH7) • SAM centralservice • http://grid-monitoring.cern.ch/mywlcg/ • SAM CHEP 2012 papers: • SAM operations (http://cern.ch/go/SPt8) • SAM architecture (http://cern.ch/go/Mst6)

  17. Backup slides

  18. Challenges • Many requirements from both WLCG and EGI • Possible improvements • Technology evolution • Testing of services not defined in GOCDB/OIM • Definition/testing of meta-services • Generic mechanism for loading results from other monitoring systems • Regional availability computations

  19. MyWLCG – Web Interface

  20. MyWLCG – Web Interface

  21. MyWLCG – Web Interface

  22. MyWLCG – Web Interface

  23. Open Source Technologies • Many mature technologies improved by the community and successfully used at scale • Integrate them as pluggable tools in SAM • Nagios: monitoring platform • Push and pull model monitoring • Pluggable system for probes • Simple notification system • System well known by many system administrators • ActiveMQ: messaging infrastructure • Integration platform for Nagios instances • Standardized messaging protocol • Message throughput high performance • Resiliency to network failures

  24. WLCG today Global collaboration of more than 170 computing centres in 36 countries, linking up national and international grid infrastructures.

  25. How SAM works ?

  26. Configuration • Topological aggregation • one source for topology aggregating OSG, EGI and WLCG sources • Profile management • defines and manages metrics • Nagios Configuration Generator • bootstraps Nagios based on information about needed topology and defined metrics

  27. How SAM works ?

  28. Collection/Notification • Nagios - open-source monitoring platform • Provides the following benefits for SAM: • Push and pull model monitoring • Pluggable system for probes • Nagios exchange - with many existing probes • System well known by many system administrators • Basic notification system

  29. How SAM works ?

  30. Transport/Filtering • ActiveMQ - open-source messaging and integration patterns server • Provides the following benefits for SAM: • integration platform for Nagios instances • standardized messaging protocol • high performance in terms of message throughput • resiliency to network failures Credits: Lionel, Massimo

  31. How SAM works ?

  32. Storage/Aggregation • Relational storage of metric results • Oracle, MySQL • Aggregation • Status computation – state of services and sites at a given point in time calculated based on the received metric results from Nagios • Availability computation • Availability - fraction of time a service was up during the period the service was known • Reliability – fraction of time a service was up during the period the service was scheduled to be up

  33. How SAM works ?

  34. Visualization/Reporting • http://youtu.be/oG-1B6KaKnk

  35. Aggregation/Storage

  36. Aggregation/Storage

More Related