1 / 11

AMOD report 24 – 30 September 2012

AMOD report 24 – 30 September 2012. Fernando H. Barreiro Megino CERN IT-ES. Workload. Data transfers. > 1M files a day. High number of transfer failures caused by a few NL T2s. Tue25 - High load on PanDA Servers.

Download Presentation

AMOD report 24 – 30 September 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD report 24 – 30 September 2012 Fernando H. Barreiro Megino CERN IT-ES

  2. Workload

  3. Data transfers > 1M files a day High number of transfer failures caused by a few NL T2s

  4. Tue25 - High load on PanDA Servers • Average time for DQ2+LFC registration increased dramatically causing high load on PanDA Servers • Some LFC timings in the logs indicated that the registration slowness was in DQ2 CC writer 1 CC writer 2 Number of sessions open on ADCR3 instance. Mostly by ATLAS_LFC_W user

  5. Tue25 - High load on PanDA Servers • Other observations that came up during the investigation • Some improvements on the LFC client are going to be discussed during “DB technical meeting on the LFC” on Wednesday 3rd Oct • PanDA server LFC registration should be activated for all sites in order to avoid individual registrations by the pilot • aCT registers in bursts without bulk methods: In the LFC logs we saw 4k accesses over 1 hour and only 7 access over another hour • There were 2 SS machines serving the DE cloud (i.e. the same sites twice) with similar configuration

  6. Thu27- SS callbacks to dashboard piling up SS-FR • Initially we thought it was exclusively due to the CERN network intervention • After checking the logs we have seen slow callbacks before the intervention on different SS machines • D. Tuckett is checking the situation

  7. Other incidents and downtimes • Monday • New PanDA proxy had not been updated on PanDAMonitor machines (Savannah: 97737) • INFN-T1 scheduled downtime for ~1 hour • Tuesday • RAL 6h upgrade to CASTOR 2.1.12-10. Alastair set UK cloud brokeroff on previous evening • Thursday • CERN network intervention to replace some switches. Services under risk were CASTOR, EOS, elogand dashboard. Smooth intervention - NTR. • Friday • BNL to ASGC transfer errors. Being investigated by both sides during the weekend. ASGC FTS is blocked to access BNL SRM and routing path is changed. (GGUS:86537)

  8. Other incidents and downtimes (2) • Sunday • PVSS DCS replication with large delays due to high insertion rate. DCS expert had to be called on Sunday • RAL had failing jobs due to put errors and transfer errors – including T0 export. Caused by problem with Stager databases and resolved during Sunday late evening(GGUS:86552) • Saturday • SS-SARA had CRITICAL errors. MySQL DB corruption? Problem to be understood by DDM experts.

  9. Acknowledgements • Except for occasional highlights it has been a very quiet week • Thanks a lot to • ADCoSexpert&shifters, and to the Comp@P1 shifter for the good work • experts of the different components and sites for the quick reaction • Alessandro, Ueda for their support

  10. Backup slides

  11. NL transfer errors

More Related