1 / 19

A lightweight Monitoring and Accounting system for LHCb DC'04 production

A lightweight Monitoring and Accounting system for LHCb DC'04 production. V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo. Outline. Manifesto Monitoring Web interface Internals Accounting Web interface Internals Outlook URLs. Manifesto.

Download Presentation

A lightweight Monitoring and Accounting system for LHCb DC'04 production

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo

  2. Outline • Manifesto • Monitoring • Web interface • Internals • Accounting • Web interface • Internals • Outlook • URLs

  3. Manifesto • Monitoring and Accounting are tasks in DIRAC377 • DIRAC is a Production grid for LHCb • The Monitoring reports the status of jobs while in the WMS (Workload Management System)366 • Instantaneous snapshot of the system • No historic records • The Accounting records the status of jobs afterleaving the WMS • Provides historic record, accumulated statistics and evolution of recorded variables with time • Main users: production and site managers

  4. Design choices • Monitoring • Job information stored centrally in the WMS • Info Provided directly by the job and the WMS • Passive services: no pushing of information • No need for a common consumer API • Job and Application state stored together • Accounting • Separate infrastructure from the monitoring • Jobs can never be on the Accounting and the Monitoring • Domain specific: LHCb production jobs

  5. Monitoring Accounting Read Write Read Write Information Flow DIRAC Web interface Web interface Users Job Heart-beat Job Services & Agents Cleaner Agent WMS Job Database Accounting Database Backend

  6. Monitoring Web Interface 1 • Interface to query monitoring service • JobId popup a window with job details if clicked

  7. Monitoring Web Interface 2 Running jobs by site • The overview shows predefined plots on the production • Generated every few minutes • PyChart used as graphics engine • 100% python • Supports SVG

  8. Monitoring Web Interface 3 • Job status by site and production id

  9. Monitoring Internals • It consists of a XML-RPC service exposing whatever parameters are known to DIRAC • Job parameters stored internally by DIRAC • Primary parameters • Execution site, job status, job owner etc. • Fixed, centrally defined: fast access • Can query on them • Secondary parameters • Number of steps, internal job state, etc • Defined by the production job itself • Stored as key-value pairs • Slower access. Cannot query on them

  10. JMS basic API example from xmlrpclib import ServerProxy server = ServerProxy(monitoring_url) #Retrieve list of jobs verifying some conditions conditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' } jobreq = server.getJobs(conditions) #Print some parameters for each job if jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId') #Bulk operations sum = server.getJobsPrimarySummary(jobreq['Value']) ~3 s to select 95 out of 50k jobs ~40 s ~0.7 s

  11. Accounting Web Interface 1 • GUI for querying the Accounting • Shows results • As graphics • As table • As Excel sheet • Several types of report • Only a few shown here

  12. Accounting Web Interface 2 • Used resources by site

  13. Accounting Web Interface 3 • Used resources by event type • Mb/job • CPU/job • Failed jobs • CPU vs. Exec time • Input and Output data vs. CPU

  14. Accounting Web Interface 4 • Produced data by production ID • Rates • Cumulative • Number of events • Gb of output

  15. Accounting Web Interface 5 • WMS statistics on DIRAC's performance • Plots • Job execution time vs. WMS waiting time • Job execution time vs. WMS matching time • Granularity • Per site • Per production • Integral • Allows assessment of DIRAC's performance

  16. Accounting Internals • Job and DIRAC statistics kept in a database • Site contribution • Data produced and used by jobs and steps • Timing for jobs, steps and DIRAC internals • Separate XML-RPC interfaces to populate and query the accounting tables • Both interfaces have restricted access • Jobs are moved to the accounting system by a cleaner agent after being validated

  17. Accounting Usage • About 10 hits per day • Time to generate daily static reports: 8 min • 60-70% of the time querying the database • 30-40% of the time in the drawing package Total: 169 kjobs Server load<0.2

  18. Outlook • Monitoring page • Transactions in monitoring updates • Further optimisation (bulk operations...) • Search for a faster rendering package • Make the web page dynamic: Less reloads • Accounting • New report types • Normalized CPU • Contribution by country • Rate by site, country etc...

  19. URLs • Monitoring page • http://fpegaes1.usc.es/dmon/DC04/joblist.html • Mirror on: • http://lhcb02.usc.cesga.es/dmon/DC04/joblist.html • Direct link to overview pages • http://lhcb.ecm.ub.es/DC04/Monitoring • Accounting page • http://lhcb.ecm.ub.es/DC04/Accounting/

More Related