1 / 29

GridPP Monitoring & Accounting

GridPP Monitoring & Accounting. Dave Kant CCLRC, e-Science Centre. Monitoring Overview`. Overview How Many Jobs on the Grid? LCG/EGEE Monitoring System Putting it all together for GridPP Future Plans. How Many Jobs on the Grid?.

audra
Download Presentation

GridPP Monitoring & Accounting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

  2. Monitoring Overview` • Overview • How Many Jobs on the Grid? • LCG/EGEE Monitoring System • Putting it all together for GridPP • Future Plans EGEE’03, April 2005 - 2

  3. How Many Jobs on the Grid? • As a way to introduce the various tools that are in development in the LCG/EGEE Grid • There are different sources for getting estimates about the number of Jobs. • Information System • Accounting System • Resource Brokers EGEE’03, April 2005 - 3

  4. How Many Jobs on the Grid? • One source of information is the monitoring system based on R-GMA • Tools which gather information and use the R-GMA backbone for data collection • GIIS Monitor • Apel • Site Functional Tests • Tools which create reports • RB Logging&Bookkeeping data mining • Accounting EGEE’03, April 2005 - 4

  5. http://goc.grid.sinica.edu.tw/gstat/GIIS Monitor • GIIS Monitor developed by GOC Taipei (Min Tsai) • Tool to display and check information published by the site GIIS • Sanity checks, fault detection of information system every 5 minutes • Provides an instantaneous snapshot of the number of Jobs EGEE’03, April 2005 - 5

  6. How Many Jobs on the Grid? • Another source of information is the accounting, which as so many sources, is not complete, but covers most of the resources. • This is not the case for GridPP resources. • Accounting information is based on resource usage published by batch servers EGEE’03, April 2005 - 6

  7. How Many Jobs on the Grid? Latest source is a data mining tool which can be used to examine RB Logging and Bookkeeping information (via R-GMA) at the user level. https://lxn1192.cern.ch:9443/~judit/job-monitor.cgi EGEE’03, April 2005 - 7

  8. How Many Jobs on the Grid? • A further source is based on the work by the EGEE QA Team • They monitor several – but not all – resource brokers on LCG and create reports of their usage. • http://egee-jra2.web.cern.ch/EGEE-JRA2/index.html • Statisticts based on aggregated information • Job Success and job throughput per VO and per RB • Grid efficiency (Execution time vs Waiting Time) EGEE’03, April 2005 - 8

  9. How Many Jobs on the Grid? EGEE’03, April 2005 - 9

  10. How Many Jobs on the Grid? • Job Duration showing a dominance of Dteam and LHCb jobs which are relatively short lived. EGEE’03, April 2005 - 10

  11. Site Functional Tests • Installation and configuration of a site is quite a complicated procedure. • -When there is a new release, sites don’t upgrade at the same time. • -Some upgrades don’t always go smoothly • -Unexpected things happen (who turned of the power?) • -Day-to-day problems; robustness of service under load? • SFT framework consists of a number of tests which probe a site to determine the operational status. • This includes all certified sites in EGEE/LCG infrastructure but also testing uncertified sites (for internal certification process performed by ROCs), monitoring sites that are part of gLite Pre-Production Service, and all other sites that are using LCG or gLite middleware EGEE’03, April 2005 - 11

  12. SFT • SFT runs every 3 hours and writes test results to a database using R-GMA Site summaries and histories SFT used by ROCs for certification Grid–Ireland SFT EGEE’03, April 2005 - 12

  13. http://map.gridpp.ac.uk/GridPP Monitoring Map GPPMon is a lightweight test which sends a simple job to GridPP resources every hour. Links hourly job submission test results to SFT, GSTAT, RSS Feeds and Accounting data EGEE’03, April 2005 - 13

  14. Future Plans for GPPMon • GPPMON - GridPP monitor to be switched off • SFT2 runs every 3 hours and sites/ROCS can run these tests independently, so there is no real need for these jobs. • Proposal is to link GridPP monitoring map to the monitoring data in the R-GMA and make use of changes to the grid M/W e.g. support for longitude and latitude in Glue Schema (LCG 2.6). • Google Map EGEE’03, April 2005 - 14

  15. http://goc03.grid-support.ac.uk/googlemaps/gridpp.html Google Map EGEE’03, April 2005 - 15

  16. Accounting Overview This is a summary of the status of Accounting & Reporting following its deployment in LCG2_6 • Overview • APEL Design • What’s New? • LCG Accounting (OSG , NorduGrid, EGEE) • Issues EGEE’03, April 2005 - 16

  17. Accounting Flow Diagram EGEE’03, April 2005 - 21

  18. Accounting Home Page http://goc.grid-support.ac.uk// 107 Sites publishing data (Sep 02 2005) Over 3.3 Million Job records ~ 100K records per week (period June 1st – mid Aug 2005)

  19. What’s New? • Added GridPP View to the reporting interface • Requirements driven by GridPP • Global view of entire organisation • Tier-2 Summaries • Detailed view at Site level • CSV download of information • Toggle between Normalised / Un-normalised Datasets http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.html EGEE’03, April 2005 - 26

  20. GridPP Input • GridPP Metrics and Deployment Document (J.Coles) • Metric 10:Number of sites publishing accounting data at the end of the last quarter • Metric 11:KSI2K hours of CPU processing delivered (per VO) over the last quarter • We are looking for meaningful plots that allow important conclusions to be drawn without misleading people • Is Job Efficiency meaningful? • Sites treat their data in different ways:- • At Tier-1 WCT are scaled because of the scheduler • At other sites, only system time is scaled • What about Hyper threading? • Perhaps we need to provide descriptive text against each plot to warn of such problems? • Spot potential problems in resource allocation • Identify trends EGEE’03, April 2005 - 27

  21. GridPP View Screen Shots

  22. GridPP View Screen Shots Atlas dominates in Tier1 Atlas and LHCb dominating KSI2K delivered per Tier1/Tier2 per VO Job Efficiency = CPUT/WCT Why is atlas EFF at 60%? Why is DZERO EFF for MANHEP > 1 ?

  23. Tier2 View (NorthGrid)

  24. Site View (Lancaster) Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history Total CPU Usage per VO Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB

  25. APEL IN LCG 2.6 • New version with better documentation • APEL supports PBS and LSF • Consists of a number of components • Core module contains functionality common to all components • Plugin components provide log parsing functionality for PBS and LSF job managers. EGEE’03, April 2005 - 32

  26. Accounting Dissemination • CERN Courier • LCG Computing Newsletter (slightly more technical) • AHM 2005 (more technical still) EGEE’03, April 2005 - 33

  27. APEL and gLite • Is APEL integrated in g-Lite? • Work currently in progress. • We have ported the APEL code into the gLite CVS repository but need to understand functional differences e.g. WMS and use of Condor • What about its development plan? • Future unclear given presence of DGAS in gLite • Areas of possible development: • Condor (easy or complicated) • Reporting Tool (GridICE will most likely provide this) EGEE’03, April 2005 - 40

  28. LCG Accounting Project involves combining results from all three infrastructures and presenting an aggregated view • Peer Infrastructures in LCG • Open Science Grid (Ruth Pordes, Philippe Canal, Matteo Melani) • Nordugrid (Per Oster) • EGEE • Currently, LHCView filters LHC VO data from EGEE accounting data. EGEE’03, April 2005 - 41

  29. Requirements Combine results from all three infrastructures … • Ideally: Distributed queries to multiple databases • Each peer manages an accounting database • LHC VO filtering provided through a web services interface • Initial Implementation: Centralised Collection • Peers publish data into a global database • WebServices or direct MySql inserts Common Problem: Different Grid infrastructures may use different Schemas. GGF define a schema, but quite flexible. May need “translators” to convert from one schema to another. (already exist) EGEE’03, April 2005 - 42

More Related