1 / 10

AMOD Report Aug 20-26, 2012

AMOD Report Aug 20-26, 2012. Torre Wenaus, BNL August 28, 2012. Activities. Good week of datataking LHC back to record luminosities 7.7 x 10 33 /cm 2 /s 1.1 / fb collected ~2M analysis jobs ~500 analysis users. Production & Analysis Activity. panglia service interruption.

goggans
Download Presentation

AMOD Report Aug 20-26, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD Report Aug 20-26, 2012 • Torre Wenaus, BNL • August 28, 2012

  2. Activities • Good week of datataking • LHC back to record luminosities • 7.7 x 1033 /cm2/s • 1.1 /fbcollected • ~2M analysis jobs • ~500 analysis users

  3. Production & Analysis Activity panglia service interruption Sustained activity, production and analysis ~35k peak concurrent analysis jobs, ~17k minimum

  4. Data Transfer Activity TW CA

  5. Central Services, T0, ADC • Tue: T0 export failures to EOS. Auto restart (a ~weekly recent occurrence) was (unusually) very slow, resulting in failures. EOS software was patched to fix it and EOSATLAS moved to better hardware. Intervention Wed Aug 28 for full EOS software update. GGUS:85370 • Tue: Major svn problem began. AFS-based ATLAS repos inadvertently wiped by sysadmin with root privs. Need better backup/mirroring… • http://itssb.web.cern.ch/service-incident/major-incident-svn-repositories/21-08-2012 • Post mortem: https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/SvnAfsRepoIncident21082012 • Tue: T0 bsub submit time spike, ~1hr, ~10sec avg submit time. Recurred all week. • Script found that was suspected to be the cause, but removing script cron hasn’t cured it • https://cern.service-now.com/service-portal/view-incident.do?n=INC156015 • Tue night: holding job accumulation. One Panda server way overloaded. How can load balancer let this happen?? Initial fix: increasing number of processes on affected server. Later (Thu): change LB from serving 5/8 to 7/8. • Wed: File missing on EOS (restored). Bigger issue is failure to detect transfer error. Transfer unsuccessful on EOS side and deleted after transfer by EOS; failure undetected on ATLAS side. Xrdcp checksum compare insufficient, not using the post-transfer checksum. Being worked. GGUS:85421

  6. CS, T0, ADC (2) • Wed: DDM site service problems at a number of sites, for different reasons: TRIUMF (slow LFC registration), Lyon (glitch in ToA retrieval via AGIS), FZK (known Fetcher problem from slowness in retrieving Dashboard info used in source selection). • Alarm thresholds were recently lowered in SLS, so issues more easily show red alarm. May need some tuning to convey appropriate level of severity to shifters (eg. in Lyon and TRIUMF cases, site services continued to operate through the alarm periods). • Thu: T0 data export to CERN-PROD degraded by "Mapped user 'atlas003' is invalid" errors. Promptly addressed but origin of the problem not understood. GGUS:85455 • Thu: New diagnostic information added to PanDA server SLS • Cf. Tadashi’s talk • Fri: CERN-PROD transfer failure due to "No such file or directory", ticketed Fri evening, prompt response, dialogue underway sorting out the issue. GGUS:85488 • Sat/Sun: DDM central catalog load balancing not functioning correctly over the weekend -- LB reported the correct machine that should be used but this was not ultimately the machine that was used, resulting in machines with no activity for long periods triggering SLS alarms. SNOW ticket INC156560 • Mon: T0: Morning spikes in lsf bsub time still an issue, 1hr this morning with over 10sec (15 sec peak) bsub times.

  7. Tier 1/Calib centers • Wed: PIC: Missing files at PIC_DATADISK, declared lost. GGUS:85426 • Wed: RAL-LCG2: A number of functional test errors from srm-atlas.gridpp.rl.ac.uk to a number of sites in the FR and IT cloud. Files used by FT were corrupted/lost. GGUS:85438 • Wed: NIKHEF-ELPROD: Transfers failing with 'file exists' errors. We (ATLAS) are using overwrite option in FTS 2.2.8, removal of existing files works elsewhere but not here. Site: it was the initial delete of pre-existing file that was failing because of a disk server problem. Very misleading error message. GGUS:85439 • Thu: INFN-T1_DATADISK full, free space was down to 1TB. Initial cleanup yielded 10TB. Removed from T0 export pending adequate space availability. • Cf. Ueda’s talk • Fri: Taiwan-LCG2: Castor problems, excluded from T0 export (~8% T0 export success rate at time of exclusion). Resolution reported later in the morning, restored T0 export Sat morning. GGUS:85461 • Fri: Staging failures at NDGF-T1, checksum mismatch, several hundred job failures. Still awaiting site response as of today (Tue 28). GGUS:85483 • Fri: Transfer failures to AGLT2 muon calibration center, promptly resolved. GGUS:85480

  8. Tier 1/Calib centers (2) • Sat: Brief PIC SRM service interruption, overload on the PgSQL server, promptly resolved. GGUS:85491 • Sat: TRIUMF job failures due to LFC registration failures, known issue of heavy LFC load, load from local LFC cleaning underway was reduced. Still some errors Sun. GGUS:85494 • Sat: IN2P3-CC jobs failing with "Staging input file failed”. File server hardware problem, repaired Monday. GGUS:85498 • Sat: TRIUMF FTS failures on inbound CA transfers, "Not authorized to query request", due to proxy expiration. T0 export to CA not seriously impacted. Resolved Sat night. GGUS:85502 • Sun: SRM failures at TAIWAN-LCG2_DATADISK, T0 export degraded to ~40% success rate, promptly addressed, storage overheating. GGUS:85499 • Sun: T0 export failures to IFIC calibration center, 100% failures when reported, SRM problems promptly resolved (stuck disk server). GGUS:85504 • Mon: Taiwan transfer failures to TAIWAN-LCG2_DATATAPE and MCTAPE, including T0 export. Site reports tape service degraded, units shut off to cool computer room pending completion of A/C repair. GGUS:85506

  9. Other • Added to ongoing issues tracked in WLCG ops: Problem seen in PanDA pilot with JSON unicode when using python 2.6 and lfc_addreplicas() • Cannot be fully solved by the pilot due to a bug in the LFC API • Until this issue has been resolved, we cannot use python 2.6 in combination with LFC registrations • GGUS:84716 • Not all shifters are able to read (some?) SNOW tickets, they should be able to • https://cern.service-now.com/service-portal/view-incident.do?n=INC156015 • FYI, up to this shift I used a jabber.org account for chat room; very unstable this week, switched to jabber.cern.ch, rock stable

  10. Thanks • Several T0 export issues, including on weekend, but no ALARMs – prompt effective responses from sites. Thanks! • Gold stars of the week to T1s with no tickets/issues: BNL, FZK • Excellent coverage by shifters -- thanks! AMOD’ding with shifters like Kai Leffhalm, Andrew Washbrook, Elena Oliver Garcia, Michal Svatos and others on duty is a pleasure • Thanks for off-hours help from many core ADC experts

More Related