90 likes | 260 Views
AMOD Report June 24-30, 2013. Torre Wenaus, BNL July 2, 2013. Activities. Stable operations, utilization tapering off on the weekend – few pending tasks ~ 4.3 M analysis jobs, 7M jobs total ~ 560 analysis users Ops issues in the week: Recovering from BNL disk pool failure
E N D
AMOD Report June 24-30, 2013 Torre Wenaus, BNL July 2, 2013
Activities • Stable operations, utilization tapering off on the weekend – few pending tasks • ~4.3M analysis jobs, 7M jobs total • ~560analysis users • Ops issues in the week: • Recovering from BNL disk pool failure • Low disk space at many T1s, T2s 140k Torre Wenaus
Production & Analysis Production Analysis ~17k min – 37k max Torre Wenaus
Data transfers Torre Wenaus
Tier 0, Central Services, ADC • Mon: HC DB problem from previous weekend fixed with DB restart. DB connections saturated when session count grew with no release of connections. “A follow-up is being discussed.” GGUS:95033 • Ongoing issue “CERN-PROD: file transfer failure from T2 sites due to SECURITY_ERROR” closed because it had been resolved in early June (as pointed out by Maria in the WLCG meeting). GGUS:92166 • Smooth incident-free interventions on Castor, Oracle production DBs, Bourricot, Tracer/Consistency Service • Problems (curl SSL failure) using pandamon cloud/site control on lxplus (SL6 issue?), experts investigating Torre Wenaus
Tier 1 • Mon: FZK-LCG2 transfer failures, “all ATLAS jobs/transfers are forced onto the same disk cluster, because all other disks are full to the brim. Consequently, the load cannot get distributed anymore and we now observe higher failure rate.” GGUS:95021 • Mon-Thu: SARA-MATRIX problems with dest/source transfers. gPlazma service interruption, SRM problems fixed with restart. GGUS: 95071 • Tue: FZK-LCG2 missing AOD file reported, they can find no trace, waiting for reply. GGUS:95092 • Tue-Wed: IN2P3-CC file transfer failures, SRM crashed during night, fixed in the morning with restart. Site established auto recovery to avoid delays in such cases in the future. GGUS:95093 Torre Wenaus
Tier 1 • Thu: BNL provided incident report on disk pool failure. Recovery worked on through the week • Fri-Sun: RAL-LCG2 DDM errors due to Castor problems on Fri, downtime over weekend, cloud set brokeroff, downtime ended Sunday when problems were resolved, restored to production. GGUS:95160 • Sat: SARA-MATRIX storage errors due to full DATADISK, blacklisted. Space cleaned up over weekend. GGUS:95175 • Mon 7/1: IN2P3-CC NO_SPACE_LEFT errors but no auto blacklisting, site not publishing that it is full. Inconsistency in SRM DB found, bad space calculation, fixed. GGUS:95204 • Several Tier 1s (and Tier 2s) over the week: low space. FZK, SARA, IN2P3 Torre Wenaus
Other • Clouds were running out of assigned tasks during the week. Would be very desirable to sustain a deeper todo queue of tasks. • [this was the first item on the ‘Other’ slide in my last (Feb) AMOD report; it still applies] • New manual whitelisting policy • Armen in last ADC weekly: "Consider an option of manual whitelisting (by expert shifter, AMOD), not reversible by SAAB. May be needed in some exceptional cases.” • Ueda has put this in place • “on” (whitelisting = ignore auto-exclusions) added as savannah site exclusion ticket option • dq2-set-location-status documentation for the “on” case added to the CentralizedSiteExclusiontwiki Torre Wenaus
Thanks! • Big thanks to very attentive and effective ADCoS shifters Torre Wenaus