1 / 23

Ramping up FAX and WAN direct access

Ramping up FAX and WAN direct access. Rob Gardner on behalf of the atlas- adc -federated- xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3 , 2014. Examine the Layers – as in prior reports.

dessa
Download Presentation

Ramping up FAX and WAN direct access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3, 2014

  2. Examine the Layers – as in prior reports • New new results at increasing scale and complexity. Limit tests to renamed Rucio sites Panda re-broker (future) HammerCloud Functional HammerCloud Stress WAN testing Failoverto Federation (production) Capability Network cost matrix (continuous) SSB Functional (continuous)

  3. The New Global Logical Filename • With Rucio we are no longer dependent on LFC • Brings a lot in stability, scalability • Simplifies joining the Federation • Speeds up file lookups • Makes for much nicer looking gLFNs • New format gLFN /atlas/rucio/scope:filename • N2N recalculates Rucio LFN /rucio/scope/xx/yy/filename • Checks each space token at the site if there is such a path • Reducing # space token paths will make this even more efficient

  4. Summary of FAX Site Deployments • Standardize the deployment procedures • Goals are largely achieved: Twiki doc, FAX rpms in WLCG repo, etc. • Software Components • Xrootd release requirement, X509 are mostly achieved • Rucio N2N deployment in progress (dCache, DPM, Xrootd, Posix (GPFS, Lustre)) • ~60% of sites deployed the N2N: Sites are either cautious on this or are delayed by a libcurl bug on SL5 platform • FIx is ready but would still like to hear from the DPM team of their validation result. • EOS has its own, functioning N2N plug-in • Redirection network has been stable since switch to Rucio • Recommending scalable FAX site configuration for Tier1s • Use a small xrootd cluster instead of a single machine • Similar to multiple GridFTPs doors • BNL and SLAC use this configuration

  5. Infrastructure: 10 redirectors

  6. Infrastructure: 44 SE’s with XROOTD

  7. Active FAX sites

  8. Basic redirection functionality • Direct access from clients to sites • Redirection to non-local data (“upstream”) • Redirection from central redirectors to the site (“downstream”) • Waiting: • Rucio-based gLFN  PFN mapper plugin • Storage software upgrade • Rucio renaming Uses a host at CERN which runs set of probes against sites

  9. Regular Status Testing from the SSB • Functional tests run once per hour • Checks whether direct Xrootd access is working • Sends an email to cloud support, fax-ops w/info Problem notification Problem resolved

  10. FAX Throughput

  11. Status of Cost Matrix • Submits jobs into 20 largest ATLAS compute sites (continuously) • Measures average IO to each endpoint (an xrdcp of 100 MB file) • Stores in SSB, along with FTS and perfsonar BW data • Data sent to Pandafor use in WAN brokering decisions

  12. Comparison of data used for cost matrix collectionbetween a representative compute site-storage site pair.

  13. WAN performance map Performance mapfor the selection of WAN links Can be used asa rough control factorfor WAN load Track as we seenetwork upgradesin the next year

  14. In Production: Failover-to-FAX • Two month window • Mix of PROD and ANALY • Failover rates are relatively modest • About 60k jobs, 60% recovered

  15. Failover-to-FAX rate comparisons Low rate of usage is a measure of existing reliability of ATLAS storage sites # jobs Storage issues

  16. Failover-to-FAX rate comparisons WAN failover IO reasonable Thus no penaltyfor queue by usingWAN failover

  17. Failover-to-FAX enabled queue Any queue Pandaresource can be easily enabled touse FAX for thefallback case.

  18. WAN Direct Access Testing • Directly access a remote FAX endpoints • Reveals an interesting WAN landcape Relative WAN event rates and CPU eff very good in DE(at 10’s of jobs scale) Question is at whatjob scale does onereach diminishingreturns? (HammerCloud results from FriedrichHoenig)

  19. WAN Load Test (200 job scale) • Using HC framework in DE cloud; SMWZ HWW • Some uncertainty of #concurrently running jobs (not directly controllable) • Indicates reasonable opportunity for re-brokering

  20. Load Testing with Direct WAN IO • 744 files (~3.7 GB ea.) reading FDR dataset over WAN, TTC=30MB • Limited to 250 jobs in test queue • “Deep read”: 10% events, all 8k branches • Used most of 10g connection

  21. FAX user tools • Useful for Tier 3 users or access to ATLAS data from non-grid clusters (e.g. cloud, campus cluster, etc.) • AtlasLocalRootBase package: localSetupFAX • Sets up dq2-client • Sets up grid middleware • Sets up xrootd client • Sets up an optimal FAX access point • Uses geographical distance from client IP to FAX endpoints • FAX tools • isDSinFAX.py • FAX-setRedirector.sh • FAX-get-gLFNs.sh • Removes need for redirector knowledge • Eases support

  22. Conclusions, Lessons, To-do • Significant stability improvements for sites using the Rucio namespace mapper • Also, with removal of LFC callouts, no redirector stability issues observed • Tier 1 Xrootd proxy stability issues • Have been observed for very large loads during stress tests (O(1000) clients) (but no impact on backend SE) • Adjustments made and success on re-test • Suggests configuration for protecting Tier 1 storage • The WAN landscape is obviously diverse • Cost matrix captures capacities • Probes of 10g link scale • Indicate appropriate WAN job level < 500 jobs • (typically 10% CPU capacity) • Controlled load testing on-going

More Related