CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

disclaimer • these slides only report the T0 experience • there‘s more CASTOR usage in ATLAS

T0 test 25-9 -> 21/10 • ~ 1.5 PB of data was moved using 2.6 million rfcps • only 13K errors (0.5 %) • moved 0.5 PB in week 3 of test ! • that is 140% wrt nominal rate • ~ 200.000 rfcps per day • CASTOR was main source of problems • unexpectedly because this was the 3rd time we were doing the same test

problems • usually found by our monitoring • ex. service degradation over three days ending with a complete stop

problems (cont‘ed) • software/design bugs • stuck rfcps due to an LSF API bug • rfcps are serviced in random order • can not safely ctrl-c rfcp • 3 orders of magnitude difference in rfcp pending time [our hypothesis] • migration does not respect time-order • recent unprocessed RAW files are purged from disk while plenty of old processed files stay on disk

… on data export • CASTOR ok, SRM not • some mysterious zero length files • likely FTS/SRM bug being unable to delete failed transfer attempts • strange combination of nsrm, rfrm, stager_rm, srm-advisory-delete to remove files and “free up” “locked” namespace entries • question: why so many user commands? • rfrm = nsrm + stager_rm ? • mapping of (Grid) users, groups to service classes… • … and security: protecting the namespace!!! • so far thanks Jan, Olof, Miguel for the manual interventions! • unclear error messages within the SRM layer • “Failed to retrieve TURL” when the source file simply doesn’t exist or is zero size • have seen the same message on other conditions (overload) • will need improvements to find free space: • parsing stager_qry output undesirable • SRM v2.2 should solve this one!

conclusions • we feel insufficient thought has been given to non-ideal CASTOR operation mode • e.g. when big backlogs exist • we are worried about the not-yet-exercised tape-recall

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

CASTOR in ATLAS (T0) Luc GOOSSENS CERN, Geneva/Switzerland PH/ATC Group

Presentation Transcript

MIPS Procedures

In Class Execise

Pseudo instructions

Improving Intersection Performance The Peek ATC D elivers

Mitigating Culture: Human Error in the Operational ATC World

Question

“ATLAS” GROUP

EvtGen in ATLAS

The ATLAS Computing Model

Use of ALICE T0 Electronics for BPTX Readout

Control in ATLAS TDAQ

Pt C#64-T0

T0 offline status

T0 DCS Status

T0 offline status

A CONNECTIVITY SOLUTION FOR RELIEF WORKERS ON THE FRONTLINE

Content: T0 status T0 SOR, EOR, Busy Issues Other Running Issues High Background Data Taking

Solution 2 nd Exam

ATLAS Computing Model – US Research Program Manpower

EvtGen in ATLAS