html5-img
1 / 25

WLCG Service Report

WLCG Service Report. Andrea.Valassi@cern.ch ~~~ WLCG Management Board , 24 th April 2012. Introduction. 5 busy weeks since the last MB report on March 20 th LHC beam commissioning and data taking (first stable beams on April 5)

hollis
Download Presentation

WLCG Service Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WLCG Service Report Andrea.Valassi@cern.ch ~~~ WLCG Management Board, 24th April 2012

  2. Introduction • 5 busy weeks since the last MB report on March 20th • LHC beam commissioning and data taking (first stable beams on April 5) • Busy but successful – smooth activity also over the Easter break • 5 Service Incident Reports received: • CASTOR name server stuck, 3 CMS files truncated, on Apr 4 (ALARM and SIR) • GGUS unreachable for some regions due to DNS update on Mar 20 (SIR) • RAID corruption (Adaptec 6445) at PIC on Mar 15, 1269 ATLAS files lost (SIR) • Defective Enstore LT05 cartridge at PIC on Mar 9, one ATLAS file lost (SIR) • Database server upgrades to Oracle 11g at T0 and T1 in Q1 2012 (SIR) • 10 real GGUS ALARMS (7 for ATLAS, 1 for CMS, 2 for LHCb) • Five at CERN, two at KIT, one at INFN, Taiwan, IN2P3 • Many other issues reported at the daily meetings, most notably: • FTS upgrades to 2.2.8 and related issues at several sites • LHCb file corruption at IN2P3 (GGUS:80338) • Large fraction of short (<200s) pilot jobs at IN2P3 from ATLAS and LHCb • One node of CMS online DB rebooted due to too high load while data-taking • GEANT network problem on April 13 (preliminary SIR) • New CVMFS client deployed to fix cache issue reported by LHCb

  3. GGUS summary (5 weeks)

  4. Support-related events since last MB There were 11 real ALARM tickets since the 2012/03/20 MB (5 weeks). 8 submitted by ATLAS (of which GGUS:81429 turned out to be a false – not test – ALARM, hence not drilled here). 1 by CMS. 2 by LHCb. Ticket closing is now automatic after 10 working days as per EGI reporting requirements. (ticket closing in CERN SNOW is also automatic after only 3 working days). The GGUS monthly release took place on 2012/03/20. Bugs related to the Remedy upgrade, preventing email notifications and attachments from being delivered, were discovered and fixed thanks to the regular test ALARMs’ suite. Details Savannah:127010 Details follow…

  5. ATLAS ALARM-> INFN-T1 SRM can’t be contacted GGUS:80582

  6. ATLAS ALARM-> Taiwan Transfers to CALIBDISK fail GGUS:80586

  7. WLCG MB Report WLCG Service Report LHCb ALARM->Tape recall rate very low at GridKa GGUS:80589 7

  8. ATLAS ALARM-> CERN-IN2P3 transfers not processed by FTS GGUS:80602

  9. CMS ALARM-> CERN Storage mgnt system shows issues with file copying GGUS:80905 (SIR)

  10. LHCb ALARM-> FZK fail to download files to WNs GGUS:81028

  11. ATLAS ALARM-> IN2P3 transfer errors due to destination SRM AuTH GGUS:81286

  12. ATLAS ALARM-> CERN Raw data retrieval problem from Castor GGUS:81352

  13. ATLAS ALARM-> CERN Slow LSF response GGUS:81401

  14. ATLAS ALARM-> CERN LSF downGGUS:81445

  15. 1.1 1.2 1.3 4.1 3.2 3.1 3.1

  16. Analysis of the reliability plots: Week of 19/03/2012 – 25/03/2012 Trans-VO events [None] ATLAS 1.1 IN2P3 (25/03). CreamCE tests failing on cccreamceli01 for entire week & for 50% of 25/03 on ccreamceli06. 1.2 NIKHEF (25/03). Juk & stremsel.nikhef.nl failing CREAM-CE tests for ~35% of 25/03. 1.3 SARA-MATRIX (21/03). Creamce & creamce2.gina.sara.nl failing tests for ~35% & ~55% of 21/03. ALICE [Nothing to report] CMS 3.1 ASGC (24 & 25/03). Srm2.grid.sinica.edu.tw failing VO Put tests on 24 & 25/03; cream03.grid.sinica.edu.tw failing JobSubmit tests from 0700 on 25/03 onwards. CMS 3.2 IN2P3-CC (22/03-23/03). cccreamceli05.in2p3.fr failing org.cms.WN-swinst tests for 13 hours + service availability unknown for another 20 hours. cccreamceli07.in2p3.fr failing org.cms.WN-swinst tests for 9 hours + service availability unknown for another 12 hours. LHCb 4.1 CNAF (19/03). SRM-VOLs test failing from 0000 to 0900 on 19/03.

  17. 1.1 3.1 3.2 4.1

  18. Analysis of the reliability plots: Week of 26/03/2012 – 01/04/2012 Trans-VO events [None] ATLAS 1.1 NIKHEF (26/03). JobSubmit tests cancelled/timed out, no ticket opened for it ALICE [Nothing to report] CMS 3.1 IN2P3 (28&29/03). CREAM-CE tests failures (SAV) 3.2 ASGC (29&30/03). SRMv2 tests failures (GGUS) LHCb 4.1 RAL (30/03). DirectJobSubmit CREAM CE tests failures for ~3 hours

  19. 4.1

  20. Analysis of the reliability plots: Week of 02/04/2012 – 08/04/2012 Trans-VO events [None] ATLAS [Nothing to report] ALICE [Nothing to report] CMS [Nothing to report] LHCb 4.1 PIC (02/04-03/04). Annual power supply check. Since 02/04 17h UTC org.sam.CREAMCE-DirectJobSubmit SAM tests are cancelled, since 03/04 2am UTC SRM SAM test org.lhcb.SRM-VOLsDir, org.lhcb.SRM-VOLs, and org.lhcb.SRM-VODe were failing. Failures disappeared on 03/04 17 hrs UTC (when the downtime finished).

  21. 1.1 3.1

  22. Analysis of the reliability plots: Week of 09/04/2012 – 15/04/2012 Trans-VO events [None] ATLAS [Nothing to report] 1.1 TRIUMF (10/04-11/04). CREAM-CE and SRMv2 SAM/Nagios tests failed between 8am UTC 10/04 and 6am 11/04 due to ongoing unscheduled downtime at TRIUMF-LCG2 induced by 2 site-wide powercuts. ALICE [Nothing to report] CMS 3.1 TW_ASGC (11/04-12/04). CREAM-CE and SRMv2 SAM/Nagios tests failed between 5pm UTC 11/04 and 11am 12/04 due to ongoing storage unscheduled downtime. LHCb [Nothing to report]

  23. 1.1 1.2

  24. Analysis of the reliability plots: Week of 16/04/2012 – 22/04/2012 Trans-VO events [None] ATLAS 1.1 INFN-T1 (18/04). Storage test results degraded for 9 hrs during downtime for tape facility upgrade. 1.2 NDGF-T1 (20/04). Storage test results degraded for 7 hrs due to issue with dCache. GGUS:81447 ALICE [Nothing to report] CMS [Nothing to report] LHCb [Nothing to report]

  25. Conclusions • Business as usual – busy (again) but successful • First stable beams on April 5th • Upgrade to FTS 2.2.8 has been completed • Several issues with the 2.2.8 release have been reported by the sites • All such issues have been addressed by patches over FTS 2.2.8 • These (yet unreleased) patches will be included in the next EMI release

More Related