Rosario Piro, Andrea Guarise, Riccardo Brunetti, Luciano Gaido, Giuseppe Patania, Paolo Veronesi JRA1 All Hands Meeting, Espoo, June 18-20, 2007. Consistency of Accounting Information with DGAS
Job Classification From the accounting point of view, executed jobs can be classified in the following categories: We need to account at least categories I – III !
Problems Accounting “grid” jobs (with grid-related info; cat. I) is mostly straight forward (Some ‘features’ of the job submission chain and of the underlying services, makes it difficult to perform proper accounting also in the trivial cases). Accounting “out-of-band”, “local VO” and “local” jobs (cat. II-IV) is a non trivial task risk of record duplication for certain site configurations e.g. one LRMS head node for multiple CEs (sensors on the CEs read the same LRMS log file to get usage information) The DGAS HLR server checks incoming records for possible duplications! There are many possible circumstances possibly resulting in record duplication, each of them must be taken into consideration before accepting the insertion request for a Usage Record. use of pool accounts to determine the VO is risky e.g. wrong mapping of credentials to pool accounts can occur real case: “/biomed/...” -> “cms003” >.This is not a problem if FQAN is available. Unfortunately many jobs are still submitted with the use of plain, no VOMS, credentials. This should be highly deprecated. DGAS now allows to consider pool accounts optionally. use of a mapping from local user and group accounts to VOs requires an appropriate and up-to-date configuration DGAS allows site administrators to map their local users and/or groups to specific VOs. This can be done separately per each CE of the site.
Important A thorough and pedantic verification of accounting information is ESSENTIAL! cross-check of accounting records with LRMS log files! How much information do we lose? cross-check of local accounting records (on sites) with what ends up in the GOC DB! make sure only true accounting information can end up in the GOC DB (can normal users publish fake accounting records in RGMA?)
DGAS (simplified) Workflow DGAS workflow Hierarchy of L2 HLRs VO L2 HLR Records for a given VO Usage Record 3 Usage Record (from another site) L2 HLR 3 Usage Record VO User HLR 3 Site HLR CE 1 job 2 WN job Usage Record (from another site)
DGAS2APEL workflow • DGAS2APEL is a process that converts the Usage Records from the format adopted by DGAS to the one adopted by APEL. Converted records are then inserted in an RDBMS table known as LcgRecords. • Such records are then forwarded to the GOC by APEL itself via its ‘apel-publisher’ process, which uses RGMA as a high level transport service toward the GOC. R-GMA dgas2apel HLR DB apel-publisher GOC DB LCGRecords HLR
Deployment status Site HLRs forwarding UR to L2 HLR: LNF Pisa Bari Milano Catania Napoli Torino Sensors installed on 43 sites.
Consistency Checks for DGAS • For INFN-Grid we have monitored the consistency of accounting information in DGAS • Helped to realize and solve problems we didn't even imagine ... • Helped to end up with a more complete picture of resource usage by VOs • In our opinion the following set of checks are needed: • Comparison between data in LRMS logs and the Site HLR server • To check if DGAS is correctly collecting usage records. • Comparison between data on Site HLR and converted by DGAS2APEL • To check if DGAS2APEL is correctly translating into the LcgRecord format all the records (and only them) that we plan to forward to GOC. • Comparison between data on Site HLR and published via DGAS2APEL (conversion) + APEL Publisher (forwarding to GOC) • To check that the information are correctly forwarded to GOC by APEL Publisher and RGMA • Comparison between data on LRMS logs and APEL Parser + Publisher(without forwarding to GOC) • Not strictly related to DGAS operation: to check if APEL sensors are correctly collecting usage records.
LRMS logs vs. Site HLR (1) In order to cross-check the information available in the LRMS plain log files with the filtered Usage Records on the Site HLR the following methodology was adopted: • A script parses the LRMS logs and insert the information needed for the checks in a relational database, trying to reflect the way some of this information are filtered by the DGAS algorithms (for example the start date of the job is not straightforward to determine, and this should be taken into consideration performing the checks). • A set of aggregates representing the same quantities, are derived from both datasets (HLR and LRMS) and compared. • If the cross-check script and queries are properly tuned the results should match, a part form minor differences due to little (but unavoidable) differences in the aggregation process of the single records (roundings, boundary conditions, slight differences in time partitioning of the datasets…). • When significant differences are found an in-depth analysis is performed to highlight its causes. When a bug is found in DGAS it is fixed, otherwise if the problem is in the site configuration, the latter is changed and checks performed again when new information are available.
LRMS logs vs. Site HLR (2) Sites HLR/LRMS logs Green: x <= 0.25 % Yellow: 0.25<x<=1% Red: x> 1%
LRMS logs vs. Site HLR (3) Green: x <= 0.25 % Yellow: 0.25<x<=1% Red: x> 1% Site HLR/LRMS logs T1. This cross-check has been performed after the latest DGAS upgrade at the T1 site and covers the period from 2007-05-23 to 2007-06-03 (boundaries included). In this view the cross-checks have been done for each of the major VOs. There’s no need for comments.
Site HLR vs. DGAS2APEL (1) Green: x <= 0.25 % Yellow: 0.25<x<=1% Red: x> 1% HLR/DGAS2APEL consistency check in ‘Torino’ This is cross-check in the period 01/05/2007 - 14/06/2007 of the information available in the HLR database and the LcgRecords table generated by DGAS2APEL local to the ‘Torino’ site.
What we have learnt • The cross-checks for the sites have been performed on the period September’06 - January’07. As it can be seen, although results where almost good (the average discrepancies where around 1%, and mainly concentrated just on some sites), we started from these results to analyse the records and found the source of those errors. A certain number of bugs where found and fixed in two subsequent releases of DGAS. • Not all the sites where affected by the bugs, since these usually involved just sites with complex configurations (or as in the case of ‘Bari’, mainly running long-lasting jobs). • The latest available release of DGAS is that deployed at CNAF-T1 (using LSF), and being deployed all over INFNGrid, whose consistency checks are illustrated in the previous slide. • Note that the checks do require a huge amount of work and are very time consuming. During the period of the checks form September’06 till January’07, one of the DGAS developers was full time dedicated to these checks. And all the involved sites also spent a non negligible amount of time on it. • For this reason further checks are no more performed systematically but just on some sites after new release deployment (as for example the T1 checks illustrated in this talk), or when it is needed (as in case of major changes in the site configuration).
Consistency Checks for APEL? • In order to perform some consistency check also for APEL we tried to set up the apel-pbs-log-parser and the apel-publisher on one production CE, in order to compare APEL accounting data with the LRMS and DGAS. • We configured the apel-pbs-log-parser and run it manually. • We configured the apel-publisher in order to avoid sending data to GOC. • <Republish>nothing</Republish> • However we didn’t manage to fill the LcgRecords table, since we continuously hit some problem, such as the following: • Unable to locate an available Registry Service • Read timed out to: https://grid009.to.infn.it:8443/RGMA/PrimaryProducerServlet/declareTable?connectionId=783683734&tableName=LcgRecords&predicate=&hrpSec=600&lrpSec=3600 • No records joined (apparently failed to merge with the gatekeeper log files ??) • In nearly one month of tests this made it impossible to compare the two systems. • However, it is even worse, that the same errors are found many times when trying to publish data from DGAS2APEL LcgRecords local table to GOC (GGUS ticket 21637). Trying to track and fix these failures is frustrating and time consuming. • The source for these errors seems to be RGMA, its standard configuration on the sites, or the way apel-publisher uses it. As far as we know it is foreseen the possibility for APEL to send LcgRecords to GOC using different transport mechanisms other than RGMA. Is it eventually possible to agree on another transport mechanism and switch to this? (directly use MySQL? L2HLR at GOC?)
gLite Restructuring Concerning the status of the code restructuring, the main activities are: • Restructuring of the sensors (pushd/urcollector): • to achieve a better decoupling between the production of the UR on the CE (needed also for interoperability with OSG), and their forwarding to the HLR. • Rewrite DGAS2APEL in order to: • Drop dependencies over perl-DBD,perl-DBI (in the past source of portability problems). • Be able to run DGAS2APEL also on Second Level HLRs (L2HLRs) and not just on Site HLRs. • C++ implementation allows for better performance and reuse of code already developed for the HLR, achieving easier maintenance of the code. (Work 50% Done.) • Adoption of common logging format: • Production release already able to log via SYSLOG facility. • Waiting for proper definition of the logging format to complete the task.
Future plans Once these activities are over. Including the full support via ETICS for the reference platforms, we plan to freeze the code as much as possible (I.e. just critical bug fixes) and proceed with a deeper restructuring, focusing on: • Easier configuration:Introduce as much automatic tuning of the configuration parameters as possible, in order to reduce the effort required to system managers. • Code clean-up:Remove obsolete and unused code to allow for better understanding of the code itself to new (and also old) developers. • Database schema clean-up:Many years of on-demand new features without proper general planning result in a complex database schema that needs to be revised. • Code profiling and optimization:In order to tune the (already good) performances, mainly in the query engines.
Web Interface to DGAS HLR: HLRMon(Work in progress) HLRMon (1) HLRMon, the web interface to DGAS is being developed by: F. Pescarmona S. Dalpra F. Rosso G. Misurelli E. Fattibene G. Patania
HLRMon (2) • Shows accounting data in aggregate form • A set of predefined aggregates are built using data available on DGAS HLRs. • It is mainly intended as an interface toward Second Level HLRs. • User is identified by means of his certificate and is allowed to plot charts according to his own VO role. • These pre-defined roles are actually available: • Normal User • VO Manager • Site manager • ROC Manager • Capability to completely customize the queries, as for the CLI interface, is foreseen (but need to pay special attention with authorizations).
Conclusions • DGAS is deployed on the Italian Production Grid. During the last year it has been thoughtfully evaluated and was subject to a fast turnaround cycle of user-driven improvements. • Our experience demonstrated that it is crucial to pedantically cross-check the information available in the relational databases with the raw source for these information. This allows for immediate discovery of configuration problems, bugs or undesired behaviours. • However this task is very difficult and time consuming. • DGAS sensors and HLR server infrastructure is proven to be able to account job usage metrics with good levels of reliability and precision, up to the scale of the average output of a T1. • We had many problems (with R-GMA??) using ‘apel-publisher’ to send the Usage Records produced by DGAS2APEL to the GOC repository. Are alternative transport mechanisms available? (directly use MySQL? L2HLR at GOC?) • A full featured web interface is in development, and a first public version will be presented shortly. • Now that the core system is proven to be stable enough and presents all the required functionalities, we plan to freeze the development of new features and concentrate on cleaning up the code and improve the overall user friendliness.
References • Information on DGAS can be found at: • http://www.to.infn.it/grid/accounting • Problems with DGAS can be signalled to dgas-support[AT]to.infn.it