1 / 29

New WLCG Grid Service Monitoring Displays

New WLCG Grid Service Monitoring Displays. James Casey, CERN IT-GD HEPIX, November 2007. Overview. Service Monitoring in WLCG Site Service Monitoring Nagios Central Monitoring GridMap Future work. WLCG Monitoring Working Groups. 3 groups created by Ian Bird, Oct’06

cgoddard
Download Presentation

New WLCG Grid Service Monitoring Displays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New WLCG Grid Service Monitoring Displays James Casey, CERN IT-GD HEPIX, November 2007

  2. Overview • Service Monitoring in WLCG • Site Service Monitoring • Nagios • Central Monitoring • GridMap • Future work Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  3. WLCG Monitoring Working Groups • 3 groups created by Ian Bird, Oct’06 • “….to help improve the reliability of the grid infrastructure….” • “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” • “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Metric Repositories Views ……. System Analysis Application monitoring …… Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  4. Monitoring You can’t manage what you don’t measure... accuracy and credibility appropriate metrics - directly relevant to user experience • clearly defined and understood measurement instrumentation - active, passive, collection intervals, alarms data collection points - system element  service real-time  historical Sensors/Agents  Transport  Repositories Views Grid Monitoring Presentation automated decision making manual decision making Control Slide by Max Böhm, EDS Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  5. WLCG Grid Monitoring Landscape Domain Monitoring Tools in use Grid Applications Application monitoring Experiment Dashboards ... GStat SAM/GridView GridICE GridPP Real Time Monitor ... Grid Middleware centralservices Grid Services monitoring site services localresources Lemon/SLS Nagios Ganglia ... Local monitoring site Slide by Max Böhm, EDS Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  6. Grid Monitoring Landscape View site registry GOCDB (other monitoring tools) one per experiment Experiment/VOATLAS Experiment/VO ... Experiment/VO ... Exp. Dashb. VO jobs, data,site reliability RGMA, RGMA, MonALISA Exp. Dashb. AppLayer Apps HTTP/XML pull RGMA job state HTTP/XML push agents MonALISA DB access GOCDB, BDII real time 3D job view RTM Ganga/Panda AtlasProdDB FileCatalog ResourceBroker Info System html Central Services LDAP GOCDB, BDII site status + graphs RB RGMA BDII GStat LFC GridServices FTS LB HTTP/XML pull HTTP/XML DB access data transfer, job status,service availability submit test jobs GOCDB, BDII sites SAM GridView Site Services HTTP/SOAP push CE SE results FabricResources batch GOCDB, extBDII BDII +fabric/job infos GridICE CPUs TBs fabric infos LEMON Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  7. High-level Model See https://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf for details LEMON GridView Experiment Dashboard R-GMA Nagios GOCDB GridView HTTP GridIce Dashboard LDAP GridMap SAM GridIce SAME GridView Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  8. Grid Site Monitoring Principles • Provide an easily extensible site monitoring system • Or be able to plug grid features into existing site monitoring • Should be able to provide (or augment) alarms at the site for the grid services • Don’t force a solution on the site administrators • Should work with any fabric monitoring system that provides basic functionality • Provide the specific plugins to deal with the Grid • Probes that work for Grid Services • Enable export of the data from the site into standard grid monitoring systems e.g. SAM, GridView, GridICE,… • Avoid duplicate running of probes Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  9. Purpose • Bring in data from existing monitoring systems inside the site monitoring tools • Service Availability Monitoring (SAM) • Network performance monitoring (NPM) • Experiment site blacklists (FCR tool) • Experiment dashboards, … • Decided to create a prototype based on Nagios • Due to existing take-up of Nagios in the community • Second stage will be integrate with LEMON • As next most common solution • Based on questionnaire to community Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  10. Nagios • Open source monitoring system • Widely used & actively developed • Host and service problems detection and recovery • Provides set of basic plugins (sensors) • easy to develop custom sensors • No components required on monitored entities Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  11. Site admins Issue alarms Get site status Get Nagios results Get remote results Get VOMS proxy Get site’s & nodes information Refresh proxy Probe descriptions MyProxy … Live node checks Get nodes information Service checks Architecture Monitoring server Site nodes … CE SE LFC Site BDII Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  12. Grid Extensions • Standard probes • provided by SRCE, CERN, OSG • Security facilities & services • CA distribution, Certificate lifetime, MyProxy • Monitoring & information services • R-GMA, BDII, MDS, GridICE • Job management services • Globus Gatekeeper, RB, WMS, WMProxy, Job matching • Data management services • GridFTP, SRM, DPNS, LFC, FTS • Remote gatherers • SAM & NPM • Nagios Config Generator (NCG), Publisher, Credential management Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  13. Standard Components • Probe wrapper • enables integration of standardized probes • One probe can run in Nagios, LEMON, SAM, … • Grid Monitoring Probes Specification • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpecification • Publisher & remote gatherers • integration with other tools • Existing tools can just consume the data. E.g SAM, GridView, Dashboards… • Grid Monitoring Data Exchange Standard • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExchangeStandard Comments, contributions & probes welcome! Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  14. Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  15. Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  16. Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  17. SAM Standard probes NPM Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  18. Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  19. Current Status • Three sets of standard probes integrated • SRCE, CERN, OSG • RPMs in apt and yum repository • http://www.sysadmin.hep.ac.uk • Installation documentation on twiki • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNagiosInstall • Mailing list for community support of sites • wlcg-monitoring-discuss@cern.ch • Will appear in upcoming gLite releases as packaged software • Will be bundled with “follow-up” documentation to help site admins understand what went wrong on probe failure New (early-access) volunteers welcome! Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  20. New visualizations for the Grid ? • Grid monitoring data is complex! • And there are many sites… • Current tools visualize data by sorted tables, bar charts, etc. • Difficult to present an easy to understand top-level view which provides • quick, action oriented oversight and insight • help understand job failures and availability patterns Can new visualizations help? Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  21. GridMap Visualization • Idea • visualize the Grid by using Treemaps (Grid + Treemap = GridMap) • Example GridMap regions site Size of rectangle is e.g. - size of site (#CPUs) - #running jobs - ... Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  22. GridMap Visualization • Idea • visualize the Grid by using Treemaps (Grid + Treemap = GridMap) • Example GridMap ok degraded down Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service - ... Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  23. Multiple Views • GridMaps can be used for top-level, geographical and VO views Global GridMap Top-level View Application Domain GridMap Large-scale Federated Grid Services Infrastructure VO Viewscross-location Corrective action effect Alert Federation,Partner,Site, etc. GeographicalViews Local GridMap Local GridMap Local GridMap Next level of GridMaps Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  24. Trends • Trends can be understood by looking at a sequence of GridMaps Site Availability over time: 20 Sep 2007 21 Sep 2007 22 Sep 2007 23 Sep 2007 24 Sep 2007 25 Sep 2007 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  25. More Views • Correlations of metrics can be discovered by switching between different views sites without colour do not support the VO Site Availability from different VO perspectives: OPS Alice Atlas CMS LHCb Status of different Site Services: Overall Site CE SE SRM site BDII Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  26. GridMap Prototype Architecture GridMap Server GridMap View existingmonitoringsystem(s) Title view1 view2 view3 GridMapServer Web Browser - provides client side code and client supporting services - implements GridMap Layout Algorithm - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python - Browser based Web 2.0 type client component - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight JavaScript libraries, AJAX type communication pattern Grid sites Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  27. GridMap Prototype View Component Link: http://gridmap.cern.ch Drilldown into region by clicking on the title Grid topology view (grouping) Metric selection for size of rectangles Metric selection for colour of rectangles VO selection Overall Site or Site Service selection Show SAM status Show GridView availability data Description of current view Context sensitive information Colour Key Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  28. GridMap Prototype: Link to Existing Tools • Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

  29. Conclusions • To improve reliability we need to: • Provide more information to site administrators • That relate to what users actually see when using their site • A lot of data already gathered, so if possible don’t do it again • Need to get it into the fabric monitoring system already used at a site • Nagios-based prototype validating the approach • Good feedback form early adoptors • Improve the visualization • Too much data - especially for central monitoring (~250 sites) • New techniques help to compress information and bring useful information into view http://gridmap.cern.ch http://nagios-test.cern.ch/nagios (guest:guest) Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays

More Related