1 / 21

FAX update

FAX update. 27 th MAY 2013. Status. authentication. With not much HC testing traffic we are back to Tier3 and functional tests traffic patterns. Looks like HC stress test. Discussion points: TIM debriefing allowfax =true Moving to 3.2.2 security Response times R ucio Cost matrix

corbin
Download Presentation

FAX update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FAX update 27th MAY 2013

  2. Status authentication Ilija Vukotic ivukotic@uchicago.edu

  3. With not much HC testing traffic we are back to Tier3 and functional tests traffic patterns. Looks like HC stress test Ilija Vukotic ivukotic@uchicago.edu

  4. Discussion points: • TIM debriefing • allowfax=true • Moving to 3.2.2 • security • Response times • Rucio • Cost matrix • Caching • FAX tutorial for Jun S&C week, offline software tutorial. • Understanding EOS setup • Expansion Ilija Vukotic ivukotic@uchicago.edu

  5. TIM • A lot of important issues discussed (mostly over meals) • While it’s ok to report on issues (finally that was TECHNICAL Interchange meeting) we should have discussed solutions. • One example: we all know ATLAS has no single analysis framework and that most users are incapable of properly using TTC. We should have listed all the possible solutionsas this is important and difficult question. Ilija Vukotic ivukotic@uchicago.edu

  6. Allowfax=true • The first FAX use is fallback to FAX in case pilot could not stage input files. • Should have been on simple option to turn it on: allowfax=true • Pilot is ready, but (except MWT2) no site has it on. • There was no campaign to enable it, as there are no simple instructions to follow. • Questions: • Is it useful for direct access sites – probably yes as they still sometimes do stage in for production jobs. • Are any other changes needed. • Need a simple instruction for the site admins to apply. This is really ALL, not only FAX enabled sites. • When pilot comes with rucio format file names can fallback work? Ilija Vukotic ivukotic@uchicago.edu

  7. Moving to xrootd 3.3.2 • How many sites upgraded ? • Are VOMS libraries ready ? • Do we have anyone turning back on their access authentication? Ilija Vukotic ivukotic@uchicago.edu

  8. WAITING Developments • FAX MONITOR • Organize meeting with Gerd. • Jarka to add FAX related mailing. • Julia – automatic validation of monitoring chain • N2N • Rucio enabled C++ version, DPM version? • Rucio DS listing tools. Ilija Vukotic ivukotic@uchicago.edu

  9. Response times Importance: If redirector does not wait sufficient time for the connected endpoints to report, files that do exist could be reported as not found. In FDR tests this is seen as error in file copy and is now a largest single cause of failed FDR jobs. Recent changes: unneeded 5s delays at MWT2 and AGLT2 removed EOS 15 s, National redirectors at 30 s, EU at 45 s, Ilija Vukotic ivukotic@uchicago.edu

  10. Response times • During TIM we discussed the most expensive LFC lookup. • Hiro feels we can remove it. • This would prevent getting a file with gLFN having datacontainer instead of dataset name before filename. • Cedric promised to change “dq2-list-files –p” which is currently the only way for users to get gLFNs to not give back datacontainer name in gLFN. • As soon as that’s done I’ll remove the 4th lookup from c++ and java N2N and re-measure delays. • Nothing yet done on SSB delays monitor. Ilija Vukotic ivukotic@uchicago.edu

  11. Move to Rucio • Java n2n supports rucio • Not all sites have “aprotocols” • Need native xrootd and dpm xrootd doors n2n • Rucio equivalent of “dq2-list-files -p” won’t come soon. • Cedric promised to change dq2-list-files –p to return rucioformatedgLFNs for all the files that have been translated. This will already significantly reduce load on LFCs and make transition smooth. Ilija Vukotic ivukotic@uchicago.edu

  12. Cost matrix • To combine different sources of information in cost matrix we have to understand them and make sure that all the sources are telling the same story. • Huge difference between perfsonar and HC SSB cost matrix. • As presented at TIM – we know that single xrdcp is not using all available bandwidth. • Normally xrdcp is single stream but starting simultaneously 20 chunks of 4MB. Ilija Vukotic ivukotic@uchicago.edu

  13. Cost matrix • Now tested with 10 streams. Client at lxplus. Ilija Vukotic ivukotic@uchicago.edu

  14. Cost matrix • Tested different ReadCacheSizes According to man xrdcp, the default value is 4M. Obviously default changes something else as well. Ilija Vukotic ivukotic@uchicago.edu

  15. Cost matrix • From Lukasz – client 3.3.2 (xrdcopy not xrdcp) can do 700MB/s in one stream! • I’ve tried that one. • File from LRZ can’t be transferred (always fails at 94-98%) using xrdcopy while shows no problems with xrdcp. • We should make sure everything will work ok when we move to 3.3.2. (even command line syntax changed a bit) • Have to finish performance measurements with 3.3.2. Ilija Vukotic ivukotic@uchicago.edu

  16. Caching • Started work on simulating cache behavior on historical data. • Historical data are the same export used by Wahid for his TIM talk. • Preliminary figures (mostly MWT2 and Prague) show relatively high reuse. • Still data are probably not very representative of what is to come and for that we need users. • Will work in parallel in establishing • real life cache at MWT2. Ilija Vukotic ivukotic@uchicago.edu

  17. User engagement • S&C week – 10-16th Jun • Offline software tutorial 24-28th Jun • 30 min enough? • Communicate with organizers. • What about users close to already FAXed sites? US Analysis Support team? Ilija Vukotic ivukotic@uchicago.edu

  18. Expansion • Working on PIC (Spanish T1) - dCache 1.9.12.23 • During TIM agreed on adding all 3 Australia sites (Sidney, Melbourne, Adelaide) • Will Need AU redirector • Closest to west US • Make a us-west redirector and attach AU there? • Should we aim for Lyon Tier1? French users are ideally positioned to use FAX. Ilija Vukotic ivukotic@uchicago.edu

  19. Reserve Ilija Vukotic ivukotic@uchicago.edu

  20. N2N and LFC • LFC server is very fast on individual queries (~150 ms). • Number of queries grows quickly with number of “misses” • Not starting from an endpoint assumed to be largest contributor • True misses • Misses much more expensive than hits: • N2N tests 4 possible gLFNs against LFC • N2N does very expensive “is this DataContainer or DataSet?” lookup.* Is it needed? • Is LFC client library parallel or not? * ALSO SUPPORTS CASE: // trying in case the name space is using the old format. // current: /atlas/dq2/data11_7TeV/NTUP_TOP/f369_m812_p530_p577/data11_7TeV.00180309.physics_Egamma.merge.NTUP_TOP.f369_m812_p530_p577_tid367204_00/ // old: /atlas/dq2/data11_7TeV/NTUP_TOP/data11_7TeV.00180309.physics_Egamma.merge.NTUP_TOP.f369_m812_p530_p577_tid367204_00_sub021131151/ Ilija Vukotic ivukotic@uchicago.edu

  21. Delays – to do • Increase delays on all endpoints to 25 seconds. Probably will fix all the “copy errors” observed from FDR. • Check if the expensive lookup can be dropped • Hiro – can pilot’s function generate gLFN with dataset name • Cedric – change dq2-list-files –p to return only dataset names in path • Get lookup times in SSB monitor Ilija Vukotic ivukotic@uchicago.edu

More Related