Atlas ADC Federated XRootD Working Group Status Report - June 2014

FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014

Content • Status • Coverage • Traffic • Failover • Overflow • Changes in localSetupFAX • Monitoring changes • Changes in GLED collector, dashboard • Failover & overflow monitoring • FaxStatusBoard • Meetings • Tutorial – 23 -27 June – dedicated to instructing on xAOD and the new analysis model • ROOTIO – 25-27 June

FAX topology • Topology change in North America • added East and West • will serve CA cloud • all hosted at BNL • Will need NL cloud redirector

FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann

FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)

FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks)

Status • Most sites running stably • Glitches do happen but are fixed usually in few hours • SSB issues solved • New sites added • IFAE • PIC • IN2P3-LPC • In need of restart: • UNIBE-LHEP

Coverage • Now auto-updated Twiki page • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage • Coverage is good (~85%), but we should aim for >95% ! • Info fetched from http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary

Traffic • Slowly increasing • Max peak output record broken • Still small to what we expect will come

Failover • Running stably

Overflow status • All the chain ready • I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. • Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. • This does not test the JEDI decision making (the one based on cost matrix) • Waiting for actual jobs to check the full chain • Users not yet instructed to use JEDI client • Waiting for JEDI monitor

Overflow tests • Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. • Site specific FDR datasets (10 DSs, 744 files, 2.7TB) • All the source/destination combinations of US sites • All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. • Three input files per job. • If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.

Overflow tests • Error rate • Total 9188 jobs • Finished 9052 • Failed 117 – 1.3% • 24 – OU reading OU (no FAX involved) • 66 – reading from WT2 (files are corrupted) • 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. • The rest are “Payload run out of memory”

Overflow tests • Jobs reading from local scratch - for comparison Scout jobs Scout jobs • Direct access site • Reading locally • Per job: • 7.2 MB/s • 67% CPU eff • 71 ev/s • Copy2scratch site • Per job: • 11.0 MB/s • 97% CPU eff • 109 ev/s

Overflow tests • Jobs reading remote sources No saturation Possibly a start of saturation • Direct access site • Reading remotely • Per job: • 4.2 MB/s • 43% CPU eff • 42 ev/s • Direct access site • Reading remotely • Per job: • 3.5 MB/s • 29% CPU eff • 34 ev/s

Overflow tests • MWT2 reading from OU and SWT2 simultaneously • In aggregate reached 850 MB/s – limit for MWT2 at that time.

Cost matrix source destination http://1-dot-waniotest.appspot.com/

localSetupFAX • Added command fax-ls – Made by Shuwei YE. • Will finally replace isDSinFAX • He will move all the other tools to Rucio • Change in fax-get-best-redirector • Each time does three queries • SSB to get endpoints and their status • AGIS to get sites, hosting the endpoints • AGIS to get site coordinates • Each call returns hundreds of kb’s • Can’t scale to large number of requests • Solution: • Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory • Information slimmed to what is actually needed: ~several kb • Now requests served in few tens of ms. • “Infinitely” scalable

Monitoring – collector, dashboard • Problem: support of multi-VO sites • Meeting: Alex, Matevz, me • Issues: • Site name: • ATLAS reports it • CMS not or badly, will fix it • Requesting user’s VO • ATLAS does it • CMS not strict about it. US-CMS uses GUMS. Will fix it. • Proposal: • During the summer Matevz develops XrdMon that can handle multi-VO messages • Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details: https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#

Monitoring • Failover • Not flexible enough • Overflow • No monitoring yet • Need to compare jobs grouped by transfer type

Atlas ADC Federated XRootD Working Group Status Report - June 2014

Atlas ADC Federated XRootD Working Group Status Report - June 2014

Presentation Transcript

FAX Dress Rehearsal Status Report

Status Report

FAX status report

Status of the FAX federation

Status Report

STATUS REPORT

Status Report

FAX status report

FAX status developments performance future

Status Report

Status report

Status Report

Status report

Status Report

Status report

FAX Deployment Status

Status Report

Status Report

Status Report

Status Report