1 / 16

CREAM: ALICE Experience

This talk discusses the deployment status of CREAM-CE and the future version, including feedback and contributions from various sites and developers. It covers issues related to job status reporting and purging, as well as disk space management.

jconrad
Download Presentation

CREAM: ALICE Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE), Catalin Condurache (RAL), Sergio Fantinel (INFN-Legnaro), Stefano Lusso (INFN-Torino), Patricia Méndez Lorenzo (CERN, IT/GS), Francesco Noferini (INFN-CNAF), Derek Ross (RAL) and Massimo Sgaravatto (CREAM development team, INFN-Padova)

  2. Thanks • This talk includes the feedback and the contributions from • Subatech: Jean-Michel Barbet • INFN-Torino: Stefano Lusso and Stefano Bagnasco • INFN-CNAF: Francesco Noferini • RAL-LCG2: Catalin Condurache and Derek Ross • INFN-Legnaro: Sergio Fantinel • INFN-Padova and CREAM CE developers team: Massimo Sgaravatto CREAM: ALICE Experience

  3. CREAM-CE: Deployment status • Current CREAM-CE service • Production version: CREAM1.5 (glite-CREAM-3.1.20) • Deployed in production by the 6th of October (patch #3259 for SLC4/i386) • https://savannah.cern.ch/patch/?func=detailitem&item_id=3259#options • Features: • Important bug and security fixes (pointed by the GSVG) • http://www.gridpp.ac.uk/gsvg/advisories/advisory-55615.txt • http://www.gridpp.ac.uk/gsvg/advisories/advisory-55616.txt • Migration of sites to CREAM1.5 was highly encouraged by that time and ALICE fully support it for all sites providing this service for the experiment • Outlook of this talk: • During the last GDB (14/10/09) we made a list of all issues reported by the site admins in terms of CREAM-CE and based on the experiences gained with the ALICE production • Now (one month of operations later) we have collected the feedback from several sites already using CREAM1.5 CREAM: ALICE Experience

  4. CREAM-CE: Future version • Future CREAM-CE service • Production version: CREAM1.6 • Status ready for certification/certified (expected) by December 2009 • TASK #9734: • https://savannah.cern.ch/task/?9734 • PATCHES #3179 and #3209 • https://savannah.cern.ch/patch/?3179 Release 1.6 of CREAM CE for sl5_x86_64 • https://savannah.cern.ch/patch/?3209 YAIM-CREAM-CE for release 1.6 of CREAM CE • Features: • Many of the issues reported during the last GDB (and not included in CREAM1.5) will be now solved CREAM: ALICE Experience

  5. CREAM-CE: site admins and developers reports (I) • Purge issues: • ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized • ALICE REQUIREMENT: Method to purge jobs in a non terminal status • CREAM STATUS: • CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser candidates for CREAM1.6 • BUG #55078: « Possible final state not considered in BLParserPBS and BUpdaterPBS » • CURRENT STATUS: Integration Candidate included in patch #3179 • BUG #54949: « Some job can remain in running state when BLParser is restarted for both lsf and pbs » • CURRENT STATUS: Integration candidate included in patch #3179 • There is an specific bug which covers the ALICE requirement • BUG #55420: « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) • CURRENT STATUS: Integration Candidate included in patch #3179 • CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before) CREAM: ALICE Experience

  6. CREAM-CE: site admins and developers feedback (I) • Purge issues: Site admin reports • Desynchronization issues has not been observed recently at sites running CREAM1.5 • Several sites have used the script created by the CREAM developers to purge manually the CREAM DB • Very good feedback on regard with this toolkit • It requires however a manual operation and the purge criteria variates from site to site CREAM: ALICE Experience

  7. CREAM-CE: site admins and developers report (II) • DISK SPACE issues: Areas to monitor and purge or clean • ALICE REPORT: The local mysql DB grown up to 2.5 GB • CREAM STATUS: Issue associated to mysql engine. While deleting entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB • CURRENT RISK FOR ALICE: low • ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox • CREAM STATUS: Solved in CREAM1.5 • #48144: « Problems with purge in CREAM when the mapped group name is different than the VO name » • RISK FOR ALICE: none once sites upgrade to CREAM1.5 CREAM: ALICE Experience

  8. CREAM-CE: site admins and developers feedback (II) • Disk space issues: Site admins report • Grow up of the local mysql DB • Some tables in the DB still growing up • Purge of the input Sandboxes in /opt/glite/var/cream_sandbox area • Sandbox auto-purge procedure included in CREAM1.5 working fine now (after 10 days outputs are purged) • No further issues observed by the site admins on regards with the purge of the Sandbox after the migration to CREAM1.5 CREAM: ALICE Experience

  9. CREAM-CE: site admins and developers report (III) • DISK SPACE issues (cont.) • ALICE REPORT: issues regarding /opt/glite/var/log and /var/log • ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files can grow forever • CREAM STATUS: policies exist for all these files and can be customized file by file: • Only the blah accounting log files are out of the CREAM developer’s control (files cannot be deleted before having been processed by the accounting system) • For /opt/glite/var/log/glite-ce-cream.log and /opt/glite/var/log/glite-ce-monitor.log, the policy is defined under /var/lib/tomcat5/webapps/ce-cream/WEB-IFN/classes/log4j.properties and the default values can be changed • Relevant info under: http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues • For /opt/glite/var/log/glite-xxxparser.log the policy is available under /opt/logrotate.d/glite-xxxparser • For /etc/logrotate.d/globus-gridftp manages the gridftp log files under /var/log • RISK FOR ALICE: low since the size is manageable by site admins CREAM: ALICE Experience

  10. CREAM-CE: site admins and developers report (IV) • DISK SPACE issues (cont.) • ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy • CREAM STATUS: bug reported and accepted not available in CREAM1.5 • #49497: « User proxies on CREAM do not get cleaned up » • CURRENT STATUS: Already solved, it will be included in CREAM1.6 (bug fix implementation still pending) • CREAM developers could increase the priority of this bug if needed • DISK SPACE issues: site admins report • No issues observed by the sites in the last month CREAM: ALICE Experience

  11. CREAM-CE: site admins and developers report (V) • LOAD issues (reported by Subatech): • ALICE REPORT: UNIX load going up to 5 (during start up or high rate of submission) • CREAM STATUS: problem reported by GRNET and the origin of the problem was a missed index in the CREAM DB • #52876: « The extra attribute table in the CREAM DB has no key/indexes defined » • CURRENT STATUS: solved in CREAM1.5 • RISK FOR ALICE: low once upgrading the CREAM version • ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs • CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status • #51978: «CREAM can be slow to start» bug in progress • CURRENT STATUS: not included in CREAM1.5 but will be released in CREAM1.6 • RISK FOR ALICE: Purge actions should speed this start up and therefore decrease the risk for the experiment CREAM: ALICE Experience

  12. CREAM-CE: site admins and developers feedback (V) • Load issues: Site admins report • Grow up of the UNIX load • Reported by Subatech, still visible at the site • Load increases during automatic purge operations. Also visible during high job submission rates • Site admin report: At this site CREAM is running in a Vmware VM and the load might be due to lack of MySQL performance in such environment. Slow down of MySQL could increase the Unix load • CREAM-CE developers report: Issue tracked in bug #58103. the GRNET report “CREAM performance report”: very heavy queries are performed during purge operations • CURRENT STATUS: Fix already committed to CVS and will be released with the next CREAM1.6. Developers have not yet assessed the level of optimization of this fix to reduce the load • Report from Legnaro • After closing the queues the load increased without saturating the CPU (60% CPU load) for about 12h. The issues seems to come from the ALICE submissions which continued although the queues were closed. • Tomcat restart slows down the submission of jobs • Solved in CREAM1.6 • No further reports from the site admins CREAM: ALICE Experience

  13. Some other interesting feedback • We asked the site admins for: • Requirements for the system maintenance • Since the last update sites spend much less time monitoring CREAM-CE • Keeping control of the disk space basically and consistency between jobs reported by CREAM and the local batch system (Subatech) • In some cases, the baby-sitting of the site is almost negligible (Legnaro and CNAF) • Issues observed at RAL before the upgrade of the system seems to be gone after the deployment of CREAM1.5 • i.e.,Tomcat related issues already solved with this new version CREAM: ALICE Experience

  14. Some other interesting feedback (II) • We also asked the site admins for: • Monitoring applied to the system at the sites • In some cases (Subatech and RAL) the site is using Nagios with also specific probes: • gLite-LB-logd and tomcat daemons • User_nbfiles: number of files used for ALICE production • Inactive_jobs: jobs not consuming CPU • Open_file_desc: number of file descriptors used • Standard fabric (Ganglia) for Legnaro CREAM: ALICE Experience

  15. Some other interesting feedback (III) • In addition CNAF reported: • blparser is not automatically restarted at boot time (only tomcat). Blparser has to be restarted by hand in order to recover the queue info • Developers feedback: issue included in bug #56518 • CURRENT STATUS: Fix already committed and will be provided in CREAM1.6 • Finally INFN-Torino feedback • Running CREAM-CE since one day • Stefano Lusso has reported the useful added value of the script CheckCreamConf.pl used at the site to set variables: • http://grid.pd.infn.it/cream/field.php?n=Main.CheckYourCREAMCEConfiguration CREAM: ALICE Experience

  16. Summary • ALICE remarks again their high interest in the generalized deployment of CREAM-CE • Vibrant and very involved user community provides helpful feedback • Fantastic quality developer support and advice • ALICE and the sites involved want a fast version certification and deployment cycle • Time is very short, data is coming CREAM: ALICE Experience

More Related