200 likes | 312 Views
This report summarizes the status of the Atlas ADC Federated XRootD Working Group as of June 2014. Key topics include system stability, coverage rates, traffic levels, and failover strategies. New sites have been added, and ongoing improvements in monitoring and local setup changes are highlighted. The report also discusses upcoming meetings and tutorials, with a focus on enhancing user experience and system performance. Overall, the working group is making steady progress toward its objectives with an aim for a 95% coverage target.
E N D
FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014
Content • Status • Coverage • Traffic • Failover • Overflow • Changes in localSetupFAX • Monitoring changes • Changes in GLED collector, dashboard • Failover & overflow monitoring • FaxStatusBoard • Meetings • Tutorial – 23 -27 June – dedicated to instructing on xAOD and the new analysis model • ROOTIO – 25-27 June
FAX topology • Topology change in North America • added East and West • will serve CA cloud • all hosted at BNL • Will need NL cloud redirector
FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann
FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)
FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks)
Status • Most sites running stably • Glitches do happen but are fixed usually in few hours • SSB issues solved • New sites added • IFAE • PIC • IN2P3-LPC • In need of restart: • UNIBE-LHEP
Coverage • Now auto-updated Twiki page • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage • Coverage is good (~85%), but we should aim for >95% ! • Info fetched from http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary
Traffic • Slowly increasing • Max peak output record broken • Still small to what we expect will come
Failover • Running stably
Overflow status • All the chain ready • I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. • Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. • This does not test the JEDI decision making (the one based on cost matrix) • Waiting for actual jobs to check the full chain • Users not yet instructed to use JEDI client • Waiting for JEDI monitor
Overflow tests • Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. • Site specific FDR datasets (10 DSs, 744 files, 2.7TB) • All the source/destination combinations of US sites • All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. • Three input files per job. • If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.
Overflow tests • Error rate • Total 9188 jobs • Finished 9052 • Failed 117 – 1.3% • 24 – OU reading OU (no FAX involved) • 66 – reading from WT2 (files are corrupted) • 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. • The rest are “Payload run out of memory”
Overflow tests • Jobs reading from local scratch - for comparison Scout jobs Scout jobs • Direct access site • Reading locally • Per job: • 7.2 MB/s • 67% CPU eff • 71 ev/s • Copy2scratch site • Per job: • 11.0 MB/s • 97% CPU eff • 109 ev/s
Overflow tests • Jobs reading remote sources No saturation Possibly a start of saturation • Direct access site • Reading remotely • Per job: • 4.2 MB/s • 43% CPU eff • 42 ev/s • Direct access site • Reading remotely • Per job: • 3.5 MB/s • 29% CPU eff • 34 ev/s
Overflow tests • MWT2 reading from OU and SWT2 simultaneously • In aggregate reached 850 MB/s – limit for MWT2 at that time.
Cost matrix source destination http://1-dot-waniotest.appspot.com/
localSetupFAX • Added command fax-ls – Made by Shuwei YE. • Will finally replace isDSinFAX • He will move all the other tools to Rucio • Change in fax-get-best-redirector • Each time does three queries • SSB to get endpoints and their status • AGIS to get sites, hosting the endpoints • AGIS to get site coordinates • Each call returns hundreds of kb’s • Can’t scale to large number of requests • Solution: • Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory • Information slimmed to what is actually needed: ~several kb • Now requests served in few tens of ms. • “Infinitely” scalable
Monitoring – collector, dashboard • Problem: support of multi-VO sites • Meeting: Alex, Matevz, me • Issues: • Site name: • ATLAS reports it • CMS not or badly, will fix it • Requesting user’s VO • ATLAS does it • CMS not strict about it. US-CMS uses GUMS. Will fix it. • Proposal: • During the summer Matevz develops XrdMon that can handle multi-VO messages • Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details: https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#
Monitoring • Failover • Not flexible enough • Overflow • No monitoring yet • Need to compare jobs grouped by transfer type