1 / 11

AMOD report 6 Feb – 12 Feb 2012

AMOD report 6 Feb – 12 Feb 2012. Fernando H. Barreiro Megino CERN IT-ES-VOS. Overview: Analysis. Overview: Production. Claire Gwenlan: “ […] we are now on the tail end of MC11c […] the load is not going to be like what you've seen for the past few weeks/months […]

bly
Download Presentation

AMOD report 6 Feb – 12 Feb 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD report 6 Feb – 12 Feb 2012 Fernando H. Barreiro Megino CERN IT-ES-VOS

  2. Overview: Analysis

  3. Overview: Production Claire Gwenlan: “[…] we are now on the tail end of MC11c […] the load is not going to be like what you've seen for the past few weeks/months […] Until… MC12… coming soon…”

  4. Overview: DDM ATLAS membership of ddmadmincertificate expired on 11 Feb 2012 and transfer jobs were rejected or failed

  5. CERN and ADC • Sun 5th CERN-PROD_DATADISK: GGUS:78923 • lcg-cr failures • Caused by latest EMI release on "preprod" WNs (10%) • Rolled back to LCG WN on Wed morning • Mon 6thSchedconfig failed to update • Set IT and TW clouds offline in Panda over the morning • Recovery from dump - only expert procedures available • Dedicated postmortem • Tue 7th ADCR & ATLR intervention: • Oracle security updates • Almost transparent. Unavailability of Panda&DDM for a few minutes at 9:00

  6. CERN and ADC: PandaMon issues Voatlas140&141 out of production • 2 out of 6 servers out of production for a week to prevent session count overload errors • Wed 8th-Thu 9th curl control commands failing intermittently • Machines using large amount of swap space: Alarm about voatlas180 using 50GB during Thu night Utilization of swap space 9th Feb 10th Feb

  7. ddmadmin certificate renewal (1) • ddmadminis the robot certificate used to authenticate DDM and other ADCops agents • Yearly ddmadmin proxy expired 9th Feb • 23rd Jan (>2 weeks before) a campaign was started to renew the proxy on all DDM and ADCops machines • Some machines were forgotten • ddmusr01@voatlas125: Victor • ddmusr03@voatlas161: Functional Test subscription • ddmusr01@voatlas244: ADC monitoring collector • Maybe more  Need to elaborate a clear list of places where the ddmadmin proxy is installed

  8. ddmadmin certificate renewal (2) • The ATLAS membership of ddmadmin expired on Sat 11th Feb…and caught everybody by surprise • All FTS job submissions were rejected • Few hours after the problem was reported, the membership was renewed • Proxies are cached via proxy delegation and it took several hours until new change was propagated to all services (FTS, SEs, …) • glite-delegation-destroy&init did not seem to make any effect • e.g. Hiro deleted all proxies from /tmp on all FTS agent hosts to speed up the recovery in the US cloud • RAL had to roll out the grid-mapfiles manually after the incident GGUS:79137

  9. ddmadmin certificate renewal (3) Need recovery procedures, a tested backup proxy and notifications about the proxy sent out to the AMOD mailing list

  10. Tier1s • IN2P3-CC downtime Tue 7th • Maintenance and upgrade of the various services and servers. • Affecting LFC, dCache, FTS, batch system, Worker nodes, etc. • Complete cloud offline in Panda and DDM • Downtime for CE and SE extended until Wed 8th • SARA downtime Tue 7th • Replacement of 6620 SAN storage hardware and firmware updates • Affecting services such as SRM, dCache and UI • RAL downtime Wed 8th • Intervention on core network • Affecting all services (LFC, FTS, SE, CE…) • UK cloud set offline • Failing jobs at SARA on Thu 9thGGUS:79089 • Not site issue • Panda brokerage did not recognize NIKHEF-ELPROD_PHYS-TOP as NIKHEF location • Tadashi fixed immediately • FZK transfer and staging failures on Sun 12thGGUS:79145 • High load and full disks • INFN-MILANO-ATLASC SRM problems GGUS:78998 • Recurring problem over many days: “failed to contact on remote SRM [httpg://t2cmcondor.mi.infn.it:8444/srm/managerv2]” • /etc/grid-security/vomsdir/atlas/vo.racf.bnl.gov.lsc missing on StoRM servers and therefore rejecting all proxies with VOMS extensions provided by BNL VOMS server • Later problem with the fetch-crl cronjob

  11. Thanks to ADC experts and ADCoS shifters for their support • BEWARE: No AMODs in the next weeks

More Related