1 / 12

AMOD Report February 11-17 2013

AMOD Report February 11-17 2013. Torre Wenaus, BNL February 19, 2013. Activities. Datataking until the 14th Tail end of high priority Moriond processing MC not quite keeping the grid full Sustained high levels of user analysis

marla
Download Presentation

AMOD Report February 11-17 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD Report February 11-17 2013 Torre Wenaus, BNL February 19, 2013

  2. Activities • Datataking until the 14th • Tail end of high priority Moriond processing • MC not quite keeping the grid full • Sustained high levels of user analysis • ~1.1M production jobs (group, MC, validation, reprocessing) • ~3.5M analysis jobs • ~680 analysis users Torre Wenaus

  3. Sayonara T0 prompt reco Torre Wenaus

  4. Production & Analysis Production Analysis ~24k min – 43k max Torre Wenaus

  5. Data transfers Torre Wenaus

  6. Concurrent jobs, daily completed jobs Torre Wenaus

  7. Tier 0, Central Services, ADC • Hammercloud SLS grey through the week. HC OK, problem is with SLS. SLS server being replaced. • The usual occasional T0 bsub submit time spikes • Tue: CMS usage visible in monitoring. ANALY_T2_CH_CERN queue had 306k jobs in transferring – not our issue, fortunately! Ale will take up monitoring changes with Valeri. Need VO attribute on site • Wed evening: T0 ALARM ticket submitted by T0 team, pending job accumulation. When reviewed by LSF expert 90min later, accumulation was gone, system had been operating normally. Was really below threshold for an alarm ticket. GGUS:91501 • Thu: T2 transfer monitor plots briefly missing in part, password issue, quick;y fixed • Fri: reported recurrence of lcg-utils problem in 3pm meeting at Rod’s request. First ticketed at TW Feb 6, appeared Feb 14 at RAL. Seen also by LHCb on SL6. lcg-cp with –srm-timeout option sleeps for full timeout period if completion takes more than ~2sec. GGUS:91223 • Reported in 3pm meeting to prompt some developer action, developer responded, ticket is now in ‘waiting for [our] reply’ Torre Wenaus

  8. Tier 1 Centers • CNAF, FZK, NIKHEF out of T0 export through the week, DATADISKs almost full • Thu am: Taiwan-LCG2: transfer errors to DATATAPE. Castor stager DB deadlock. Recovered in ~6hrs. GGUS:91505 • Fri am: BNL: fraction of transfers failing from several sources to BNL-OSG2_DATADISK. Fixed by rebalancing load across dCache pools. GGUS:91548 • Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket closed Sun night. GGUS:91581 • Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Resolved Mon pm. GGUS:91588 Torre Wenaus

  9. Tier 2 calibration centers • Mon am: IFIC-LCG2: resolved SRM server glitch causing transfer failures since Sun. GGUS:91327 • Sun am: IFIC-LCG2: SRM down, CALIBDISK failures of functional test transfers, all file transfers failing. Failure in one RAID group, taken offline, restored Lustre and SRM. GGUS:91586 Torre Wenaus

  10. Frequent bouts of T2 transfer stalls FTS congestion an issue during week, eg. This on a Moriond task with jobs stuck in transferring http://savannah.cern.ch/support/?135872 Tomas talked to Cedric - Rucio will use point to point FTS so shouldn't see this problem Torre Wenaus

  11. Other • Clouds were running out of assigned tasks during the week. Would be very desirable to sustain a deeper todo queue of tasks. • More clarity in monitoring and/or documentation needed for shifters on what sites are Tier 3s and how they should be treated • Shifter asked, how do we identify what sites are Tier 3s? • Answer offered by expert shifter: • Look at number of WN in Panda. If it's 5-10 than it's Tier 3 • Look at the space tokens. If there are only 3, like Scratch, Localgroup,... then it's Tier 3. Tier 2 required to have more space tokens like Datadisk etc. • Is there someplace that makes it clear? Would be better if it was obvious – apparent in the same monitoring that leads shifters to conclude there’s a site problem, which maybe should be treated differently (low priority if addressed at all?) if it’s a Tier 3 • A suggestion from shifter: add in https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS#How_to_Submit_GGUS_Team_Tickets some words about how handle Tier-3s Torre Wenaus

  12. Thanks • Thanks to all shifters and helpful experts! Torre Wenaus

More Related