1 / 25

Grid Accounting-Gratia

Grid Accounting-Gratia . Philippe Canal (FNAL) Mai 2007. Project Definition Architecture Overview Probes Collectors Gratia Production Test Stand Gratia @ Fermilab Reports Graphs Text Plan. WLCG Upcoming Challenges Metrics OSG Validation. Summary.

maribeth
Download Presentation

Grid Accounting-Gratia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Accounting-Gratia Philippe Canal (FNAL) Mai 2007

  2. Project Definition Architecture Overview Probes Collectors Gratia Production Test Stand Gratia @ Fermilab Reports Graphs Text Plan WLCG Upcoming Challenges Metrics OSG Validation Summary Philippe Canal

  3. Grid Services / Accounting Project Definition • The main goal of the Grid Accounting Joint Project, which is additionally being contributed as an OSG Activity, is to provide the stakeholders with a reliable and accurate set of views of the Grid resources usage. The Grid Accounting Project: • has designed the schema for the accounting attributes, • is ensuring the necessary collectors and sensors are in place in the resource providers, • has defined and is deploying repository and access tools for the reporting and analysis of the grid wide accounting information. • The Accounting system will properly determine a confidence level in the existing accounting information and adequately address and present erroneous or missing accounting data. The Accounting system will adequately protect the privacy of the users and organizations involved. The initial main goal for the accounting system will be to track VO members' resource usage and to present that information in a consistent Grid-wide view, focusing in particular on CPU and Disk Storage utilization. • For the Fermilab Computing Division, the Grid Accounting Project must provide the tool necessary to accurately report on the CPU usage of all the Farm worker nodes installed at Fermilab. • Stakeholders: • OSG • FNAL Computing Division • CMS Philippe Canal

  4. FTEs/Month on Grid Accounting Philippe Canal

  5. Strategic Planning • From CD Strategic Plan: • Further develop and integrate our management, measurement, planning and analysis processes and information systems to help us plan, execute and measure our work and progress on all of the above goals. • From Grid Strategic Plan: • Progress Indicators • Development projects milestones will be tracked and reported. • For experiment layers, compliance with standards will be tracked. • The following will be reported monthly, taking account of the experiment needs: • The % of jobs through grid interfaces per experiment. • The % of data through grid interfaces per experiment. • Number of sites offering turnkey grid access per experiment. • Efficiency of use of grid resources for Monte Carlo production. • Numbers of problems a week and the successful resolution of said problems. • Gratia is the corner stone in achieving those results. • Strategic and tactical plan should be updated accordingly Philippe Canal

  6. Architectural Overview VO Center Grid Operation Center Web Presenter Statistical Analyzer Web Presenter Statistical Analyzer Probe Probe Data Store Access Layer Collector Collector Data Store Access Layer Probe Probe Probe Probe Repository of Accounting Records Repository of Accounting Records WSAPI Probe Data Store Access Layer Collector Web Presenter Probe Probe Statistical Analyzer Repository of Accounting Records Resource Provider Site Philippe Canal

  7. Gratia Probes • Included in OSG 0.6.0 • DCache • PBS, LSF, Sun Grid Engine, Condor 6.8 (Non WS-Gram) • Also available • RawCPU (psacct) • Already deployed at 45 OSG production sites • Will contact more Site administrators (In particular ATLAS Tier 2) later this week. • Next: • Disk Storage • The main question will be “What are we measuring?” • Probe for Condor 6.8 with WS Gram • The question is “Where are the user log files"? • Packaging of probe for Condor 6.9 and then improvement (To be able to separate ‘used’ CPU vs. ‘lost’ CPU due to evictions, etc.) • Display for DCache information Philippe Canal

  8. Gratia Collector • Currently only deployed at Fermilab. • Wider deployment waiting on • Writing of proper install and use Documentation  • Tester of this install will be FermiGrid and UCSD. • [ Implementation of the VOMS based role authentication. ] • Also need some encryption of the DN … • Need to find the DN of users for PBS/LSF jobs. • Need help of GRAM for that. • Can not re-use LCG solution for this. Needs to write additional code. • Will be available in newer GRAM release via their auditing tool. Philippe Canal

  9. Gratia Production • Administration of the nodes and services are being transferred to FermiGrid. • Configuration of the node, including hw upgrades. • Database backup, Database spare machine. • Virtual Machines administration. • Strict release mechanism. • Not yet transferred: • Monitoring of the probe and servers • Data Management • aka making sure the VO/Groups are corrects. Philippe Canal

  10. Gratia Test Stand(s) • Need for more formal/detailed testing • 3 Types: • Full ‘Farm’ install • Testing the full chain Probes->Collector->Reports • Need to run all Batch systems. • Need to run known jobs • Web Services Development • For testing new WS features • Need consistent data set. • Production Clone • To test install procedure and Database Schema upgrades Philippe Canal

  11. Gratia @ Fermilab • All OSG CE have installed the appropriate Probe. • RawCpu probe Installed on USCMS-FNAL-WC1-CE and a few head nodes. • Ready to be deployed to more Farm • Ready to gather more input on which report/graph are needed. • Will set up stake holder meeting in the next few weeks. Philippe Canal

  12. Graphs • Job Count per • Site • VO • User • Per Site For VO • Per VO For Site • Cpu Used (WallClock or Cpu time) per • Site • VO • User • Per Site For VO • Per VO For Site • See http://gratia.opensciencegrid.org:8880/gratia-reporting • See http://gratia-fermi.fnal.gov:8882/gratia-reporting Philippe Canal

  13. Date range: 2007-02-26 00:00:00 GMT - 2007-03-05 23:59:59 GMT Philippe Canal

  14. Philippe Canal Date range: 2007-02-26 00:00:00 GMT - 2007-03-05 23:59:59 GMT

  15. Daily Reports • Report from the job level Gratia db • Main report, includes # of jobs and Wall Duration • Compare with the previous day • Report from the daily summary Gratia db • Report on ‘legacy’ sites (including Panda) • Compare with the previous day • Job Success Rate • Has been between 75% to 95% overall • Fraction of resource used by owner of resource • Many issues: Who owns what? How are they related to VO? • How to deal with Fermilab’s subgroup? • Does Minos ‘own’ any of the Fermilab worker node (for the purpose of this report) • No good source of information of the (shared) ownership of the sites • The closest I have so far is the name of Support Center. • This is trying to answer the metric:Do VOs utilize more resources than would be available to them without OSG? • VORS seemingly contain the information. Philippe Canal

  16. Reporting • Requirements document is being open for public comment addition • https://twiki.grid.iu.edu/twiki/bin/view/Accounting/ReportRequirements • Also reviewing the current implementation and setting course to improve the ease of adding new reports. • Currently DB is ‘open’ to arbitrary (read) sql query from the Web Site (and directly from selected developers). • Can lead to (unintentional or not) denial of services. • Need to implement the ‘Role’ look up for permissions. • A couple of external groups are also looking at reports Philippe Canal

  17. Report Examples • Check where a particular VO is running and/or is well configure • Used to check the new VO ‘Engage’ • Check the length of Job • Used by DZero Philippe Canal

  18. Philippe Canal

  19. WLCG Reporting • Started reporting usage. • Upload done on a daily basis. • Normalization factor is currently estimated. • Sites and/or VO should reporting to LCG • CMS Tier 1 and Tier 2 • ATLAS Tier 1 and Tier 2 (Not all of them are reporting to us yet) • Working with EGEE to insure the data is properly propagated • As of today data is seen at the EGEE GOCDB • Still need to also be propagated for the FNAL Tier1 entry Philippe Canal

  20. Upcoming Challenges • Data Quality • Verify and understand the discrepancies between the number reporting by Condor and the number reported by the RawCPU probe (psacct) • So far anecdotic evidence of problem … • often obscured by other issues (failure from Gram based collection, failure from psacct collections, weird overlap). • No clear reproducible pattern detected yet. • Implement a better estimate of a normalized CPU (for OSG) used • Require a notion of the ‘power’ of the worker node. This could be either: • a performance index passed along the usage record • a description of the cpu (better since we can then change the index being used SpecInt 2000 to SpecInt 2006) • Could/Should come from GLUE schema. • We already have the hostname of the worker node • Probe (near the batch system) or Collector (central place) need to acquire the information • Should be able to use the result of the Site Survey just finished by OSG. Philippe Canal

  21. Monitoring the Accounting  • Sites Status of the Accounting Probes. • Site Administrators / GOC can start taking advantage of the Site Status web page to insure their sites are reporting as expected: http://gratia.opensciencegrid.org:8880/gratia-administration/monitor-status.html?probename=condor:cmslcgce.fnal.gov Philippe Canal

  22. Accounting Project and Metrics • Extension of our charge to provide some of the OSG Metrics. • Metrics includes but is more than Usage Accounting (See Ruth Presentation). Other metrics will come from Operations, Users etc • With the OSG 0.6 release the Accounting Project will start to collect data from and provide information to enable answering of some of these questions. • Site Resources provided: from GIP/GLUE • CPU utilization: by site, by VO • Data transport from SRM/dCache based Storage Elements (SEs). • (Plan to add information from GridFTP based SEs) • Current data is incomplete due to lack of deployment of probes.With the 0.6 release all sites MUST report accounting information. • In the next few months the validity of the data will be verified and the accuracy improved. Philippe Canal

  23. Metrics Accounting can provide We can obtain some idea of how efficiently OSG is using facilities from GRATIA accounting data. GRATIA provides information about utilization. GIP/GLUE provides a description of the facility and provides basic monitoring. From these we can answer questions similar to the following: Facility capability. How much storage is available? Total computing power? Job slots available? What is the availability of sites? What part of a sites facilities are available to OSG How many jobs were processed? How many jobs vs.. slots available? Do VOs utilize more resources than would be available to them without OSG? How big are the jobs being submitted? Average size? Maximum? What % of jobs fail? Due to user error? Due to Grid failure? Philippe Canal

  24. OSG Validation DB • OSG must report Validation information to WLCG • OSG is also considering re-using the Gratia infrastructure to transmit and store Philippe Canal

More Related