1 / 7

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group. disclaimer these slides only report the T0 experience there‘s more CASTOR usage in ATLAS. T0 test 25-9 -> 21/10 ~ 1.5 PB of data was moved using 2.6 million rfcps only 13K errors (0.5 %)

draco
Download Presentation

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

  2. disclaimer • these slides only report the T0 experience • there‘s more CASTOR usage in ATLAS

  3. T0 test 25-9 -> 21/10 • ~ 1.5 PB of data was moved using 2.6 million rfcps • only 13K errors (0.5 %) • moved 0.5 PB in week 3 of test ! • that is 140% wrt nominal rate • ~ 200.000 rfcps per day • CASTOR was main source of problems • unexpectedly because this was the 3rd time we were doing the same test

  4. problems • usually found by our monitoring • ex. service degradation over three days ending with a complete stop

  5. problems (cont‘ed) • software/design bugs • stuck rfcps due to an LSF API bug • rfcps are serviced in random order • can not safely ctrl-c rfcp • 3 orders of magnitude difference in rfcp pending time [our hypothesis] • migration does not respect time-order • recent unprocessed RAW files are purged from disk while plenty of old processed files stay on disk

  6. … on data export • CASTOR ok, SRM not • some mysterious zero length files • likely FTS/SRM bug being unable to delete failed transfer attempts • strange combination of nsrm, rfrm, stager_rm, srm-advisory-delete to remove files and “free up” “locked” namespace entries • question: why so many user commands? • rfrm = nsrm + stager_rm ? • mapping of (Grid) users, groups to service classes… • … and security: protecting the namespace!!! • so far thanks Jan, Olof, Miguel for the manual interventions! • unclear error messages within the SRM layer • “Failed to retrieve TURL” when the source file simply doesn’t exist or is zero size • have seen the same message on other conditions (overload) • will need improvements to find free space: • parsing stager_qry output undesirable • SRM v2.2 should solve this one!

  7. conclusions • we feel insufficient thought has been given to non-ideal CASTOR operation mode • e.g. when big backlogs exist • we are worried about the not-yet-exercised tape-recall

More Related