1 / 20

FAX status report

FAX status report. Ilija Vukotic on behalf of the atlas- adc -federated-xrootd working group S&C week Jun 2 , 2014. Content. Status Coverage Traffic Failover Overflow Changes in localSetupFAX Monitoring changes Changes in GLED collector, dashboard Failover & overflow monitoring

justin
Download Presentation

FAX status report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014

  2. Content • Status • Coverage • Traffic • Failover • Overflow • Changes in localSetupFAX • Monitoring changes • Changes in GLED collector, dashboard • Failover & overflow monitoring • FaxStatusBoard • Meetings • Tutorial – 23 -27 June – dedicated to instructing on xAOD and the new analysis model • ROOTIO – 25-27 June

  3. FAX topology • Topology change in North America • added East and West • will serve CA cloud • all hosted at BNL • Will need NL cloud redirector

  4. FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann

  5. FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)

  6. FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks)

  7. Status • Most sites running stably • Glitches do happen but are fixed usually in few hours • SSB issues solved • New sites added • IFAE • PIC • IN2P3-LPC • In need of restart: • UNIBE-LHEP

  8. Coverage • Now auto-updated Twiki page • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage • Coverage is good (~85%), but we should aim for >95% ! • Info fetched from http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary

  9. Traffic • Slowly increasing • Max peak output record broken • Still small to what we expect will come

  10. Failover • Running stably

  11. Overflow status • All the chain ready • I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. • Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. • This does not test the JEDI decision making (the one based on cost matrix) • Waiting for actual jobs to check the full chain • Users not yet instructed to use JEDI client • Waiting for JEDI monitor

  12. Overflow tests • Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. • Site specific FDR datasets (10 DSs, 744 files, 2.7TB) • All the source/destination combinations of US sites • All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. • Three input files per job. • If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.

  13. Overflow tests • Error rate • Total 9188 jobs • Finished 9052 • Failed 117 – 1.3% • 24 – OU reading OU (no FAX involved) • 66 – reading from WT2 (files are corrupted) • 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. • The rest are “Payload run out of memory”

  14. Overflow tests • Jobs reading from local scratch - for comparison Scout jobs Scout jobs • Direct access site • Reading locally • Per job: • 7.2 MB/s • 67% CPU eff • 71 ev/s • Copy2scratch site • Per job: • 11.0 MB/s • 97% CPU eff • 109 ev/s

  15. Overflow tests • Jobs reading remote sources No saturation Possibly a start of saturation • Direct access site • Reading remotely • Per job: • 4.2 MB/s • 43% CPU eff • 42 ev/s • Direct access site • Reading remotely • Per job: • 3.5 MB/s • 29% CPU eff • 34 ev/s

  16. Overflow tests • MWT2 reading from OU and SWT2 simultaneously • In aggregate reached 850 MB/s – limit for MWT2 at that time.

  17. Cost matrix source destination http://1-dot-waniotest.appspot.com/

  18. localSetupFAX • Added command fax-ls – Made by Shuwei YE. • Will finally replace isDSinFAX • He will move all the other tools to Rucio • Change in fax-get-best-redirector • Each time does three queries • SSB to get endpoints and their status • AGIS to get sites, hosting the endpoints • AGIS to get site coordinates • Each call returns hundreds of kb’s • Can’t scale to large number of requests • Solution: • Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory • Information slimmed to what is actually needed: ~several kb • Now requests served in few tens of ms. • “Infinitely” scalable

  19. Monitoring – collector, dashboard • Problem: support of multi-VO sites • Meeting: Alex, Matevz, me • Issues: • Site name: • ATLAS reports it • CMS not or badly, will fix it • Requesting user’s VO • ATLAS does it • CMS not strict about it. US-CMS uses GUMS. Will fix it. • Proposal: • During the summer Matevz develops XrdMon that can handle multi-VO messages • Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details: https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#

  20. Monitoring • Failover • Not flexible enough • Overflow • No monitoring yet • Need to compare jobs grouped by transfer type

More Related