1 / 13

Accounting in LCG

This article provides a brief overview of accounting in LCG, including integration with OSG, gLite, and APEL. It covers data collection, transportation, high-level aggregation and reporting, and demos of accounting aggregation.

royl
Download Presentation

Accounting in LCG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accounting in LCG Dave Kant CCLRC, e-Science Centre

  2. APEL in LCG/EGEE 1. Quick Overview 2. The current state of play 3. Integration with OSG 4. Accounting in gLite LCG GDB Rome 2

  3. Overview • Data Collection via Sensors • Transportation via RGMA • High level Aggregation and Reporting via Graphical Front-end High Level Reporting: Tables, Pies, Gantts, Metrics, Trees Aggregation LCG GDB Rome 3

  4. Component View of APEL Sensors (Deployed at site) :- • Process log files; maps DN to Batch Usage; • Builds accounting records: DN, CPU, WCT, SpecInt2000 etc • Accounts for Grid Usage (Jobs) Only • Supports PBS, SunGridEngine, Condor, and LSF • Not REAL-TIME accounting Data Transport:- • Uses RGMA to send data to a central repository • 196 sites publishing, 7.7 Million Job records collected • Could use other transport protocols • Allows sites to control exports of DN information from site Presentation (GOC and Regional Portal) • View, EGEE View, GridPP View, Site View • Reporting based on data aggregation • Metrics (e.g. Time Integrated CPU Usage) • Tables, Pies, Gantt Charts, LCG GDB Rome 4

  5. Demos of Accounting Aggregation Global views of CPU resource consumption. • LHC View • http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php  Shows Aggregation for each LHC VO • Requirements driven by RRB • Tier-1 and Countries are the entry points • LHC VO only • All data normalised in units of 1000 . SI2000 . Hour • GridPP View • http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php  Shows Aggregation for an Organisation at Tier1/Tier2 level • EGEE View (New!) • http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php  Regional Views and detailed site level reporting  Active Development by CESGA/RAL  Pablo Rey Mayo, Javier Lopez, Dave Kant LCG GDB Rome 5

  6. VOs/LCG/EGEE Requirements • One line summary “How Much is Done, and Who did it”. • High Level Anonymous Reporting  How much resource has been provided to each VO  Aggregation across: VOs, Countries, Regions, Grids, Organisations  Granularity: time frame: Weeks, Quarterly, Annually • Finer Granularity at User Level  If 10,000 CPU hours were consumed by Atlas VO, who are the users that submitted the work?  Data privacy laws  A Grid “DN” is personal information which could be used to target an individual.  Who has access to this data and how do you get it? LCG GDB Rome 6

  7. APEL Developments • Extending Batch System Support (Testing Phase)  Support for Condor and SGE. Both are being tested: SGE by CESGA and Condor by GridPP. Un-official releases are available on the APEL Home page.  http://goc..grid-support.ac.uk/gridsite/accounting/sge.html  http://goc.grid-support.ac.uk/gridsite/accounting/condor-prelim.html Gap Publisher (Testing Phase)  Provide sites with better tools to identify and to publish missing data into the archiver. The reporting system uses Gantt charts to identify gaps, and enhancements to the publisher module are being tested. • LCG GDB Rome 7

  8. APEL Issues…1 • Normalisation (Under investigation, CESGA/RAL)  Recall that in order to account for usage across heterogeneous compute farms, data are scaled to a common reference in LCG Reference Scale = 1K.SI2000  Job records scale factor is SI2000_Published_by_Site / Reference  Some sites have a large number of job records where the site SI2000 is zero.  Identify sites via the reporting tools and provide recipe to fix. APEL Memory Usage (Important, will become urgent…)  Site databases are growing ever larger: APEL requires more memory in order to join records (RAL Tier-1 requires 2GB RAM for full build)  Implement a scheme to reduce the number of redundant records used in the Join process: flag rows used in a successful build and delete them as they are no longer needed. DN Accounting ?  Should APEL account for local usage as well as grid usage?  BNL recently sent data to us that included both Grid and local usage • • LCG GDB Rome 8

  9. APEL Issues…2 • Handling Large Log files (Under Investigation)  Condor history and SGE batch logs are very large (> 1 GB )  Large logs are problematic: large amount of memory to read / store records inline. Application run time grows! We don’t want to re-read data that was passed on a previous run (efficiency).  Develop an efficient way to parse these logs? Or ask batch log providers to support log rotation? Or provide a recipe to site admins? • Recipe to site admins half-work as events are lost: event data split over multiple lines. RGMA Queries to Central Repository  Query response time very slow. Prevents some sites from checking continuous consumers are actually listening for data.  Would need to archive data from the central repository to another database in order to speed up such queries.  Not an issue for the reporting front-end  Does not appear to be something that sites urgently need (requested by IN2P3-CC). • LCG GDB Rome 9

  10. Integration with OpenScienceGrid • A few OSG sites have deployed a minimal LCG front-end to publish accounting data into the APEL database (GOCDB registration + APEL sensors + RGMA MON node)  Successful deployment at University of Indiana (PBS and Condor data published) • Due to (subtle) differences in the grid middleware, APELs Core library must be modified to build accounting records in the OSG environment.  LCG: DN  local batch jobId mappings encoded within three log files: LCG job manager  OSG: DN local batch jobId mappings in single log file; globus job manager? • Main Issues Under Consideration  Currently there are THREE versions of APEL CORE library, each sharing common batch system plugins: • LCG production release, gLite 3 development, OSG development  Refactoring of core library to create a new plugin? LCG/gLite/OSG ?  A more sensible approach would be to use a *common* accounting file in BOTH gLite and OSG to provide the grid DN  Local Batch JobId mapping Need a common agreement for log rotation: Prefer lognname-YYYYMMDD.gz (static file) to logname-1.gz (not-static)   • Very much in the early stages, need some common agreements and some more understanding of OSG middleware before proceeding. LCG GDB Rome 10

  11. Accounting in gLite 3 • In gLite the BLAH daemon (provided by Condor) is used to mitigate jobs between the WMS and the Compute element. Consequently, accounting information needed by APEL is no longer in the gatekeeper logs but found elsewhere e.g. in local user home directory. An accounting mapping file has been proposed by DGAS and implemented by gLite middleware developers to simplify the process of building accounting records.  For mapping grid-related information to the local job ID  Independent of submission procedure (WMS or not ...)  No services or clients required on the WN  Format (one line per job, daily log rotation) timestamp=<submission time to LRMS userDN=<user's DN> userFQAN=<user's FQAN> ceID=<CE ID> jobID=<grid job ID> lrmsID=<LRMS job ID> localUser=<uid> • • • • • • Already implemented for BLAH (and CREAM) work in progress for LCG Did not make it into gLite3.0 – no accounting for gLiteCE APEL development to begin in April (D.Kant) Development and Testing expected to take most of April LCG GDB Rome 11

  12. DGAS • DGAS meets some requirements for privacy of user identity  user job info only readable by user, site manager and VO manager • DGAS cannot aggregate info across whole Grid • Solution 1 – DGAS sensors also publish anonymous data to central APEL repository,  User details available in DGAS HLR for VO • Solution 2 – A higher level repository that HLRs can all publish into.  GGF Resource Usage Service – RHUL working on an implementation • BUT DGAS not in gLite3.0 LCG GDB Rome 12

  13. Summary • We have a working accounting system • but work is still required  to keep it working  to meet (conflicting?) outstanding requirements for • Privacy • User information LCG GDB Rome 13

More Related