LHCb Status report June 08

LHCb Status report June 08

Activities since February • Applications and Core Software • Preparation of applications for real data • Simulation with real geometry (from survey) • Certification of GEANT4 9.1 • Alignment and calibration procedures in place • Production activities (aka DC06) • New simulations on demand • Continue stripping and re-processing • Dominated by data access problems (and how to deal with them) • Core Computing • Building on CCRC08 phase 1 • Improved WMS and DMS • Introduce error recovery, failover mechanism etc… • Commission DIRAC3 for simulation and analysis • Only DM and reconstruction exercised in February LHCb WLCG-MB report, June 08

Sites configuration • Databases • ConditionsDB and LFC replicated (3D) to all Tier1s • LFC mirror read-only at all Tier1s available • Sites SE migration • RAL migration from dCache to Castor2 • Very painful exercise (bad pool configurations) • Took over 8 months to complete, but… it’s over!!! • PIC migration from Castor1 to dCache • Fully operational, Castor decommissioned since March • CNAF migration of T0D1 and T1D1 to StoRM • Went very smoothly for migrating existing files from Castor2 • SRM v2 spaces • All spaces needed for CCRC were well in place • SRM v2 still not used for DC06 production (see later) LHCb WLCG-MB report, June 08

DC06 production issues • Still using DIRAC2, i.e. SRM v1 • No plan to backport SRM v2 to DIRAC2 • We could update the LFC entries to srm-lhcb.cern.ch (v2) for reading but… • Still need srm-durable-lhcb.cern.ch for T0D1 upload • Need srm-get-metadata that works for SRM v2 • When SRM v2 and v1 are the same end-point, no problem • DIRAC2 checked against StoRM: no problem • File access problems • As for CCRC, dominating the re-processing problems • Castor sites: OK, would like to have rootd at RAL (problems known for 4 years with rfio plugin in root, aleviated with rootd) • Many (>7,000) files lost at CERN-Castor (mostly from autumn 2006), to be marked as such in the LFC (some irrecoverable) • dCache sites: no problem using dcap protocol (PIC, GridKa), many problems with gsidcap (IN2P3, NL-T1) • How to deal with erratic errors (files randomly unaccessible) LHCb WLCG-MB report, June 08

CCRC’08Summary of last week’s reportcourtesy of Nick Brook LHCb WLCG-MB report, June 08

Planned tasks • May activities • Maintain equivalent of 1 month data taking • Assuming a 50% machine cycle efficiency • Run fake analysis activity in parallel to production type activities • Analysis type jobs were used for debugging throughout the period • GANGA testing ran for last weeks at low level

Activities across the sites • Planned breakdown of processing activities (CPU needs) prior to CCRC08

Pit -> Tier 0 • Use of rfcp to copy data from pit to CASTOR • rfcp is the recommended approach from IT • A file sent every ~30 sec • Data remains on online disk until CASTOR migration • Rate to CASTOR - ~70MB/s In general ran smoothly: - Stability problems with online storage area - solved with firmware update during CCRC - Internal issues with sending bk-keeping info Problems with online storage area

Tier 0 -> Tier 1 • FTS from CERN to Tier-1 centres • Transfer of RAW will only occur once data has migrated to tape & checksum is verified • Rate out of CERN - ~35MB/s averaged over the period • Peak rate far in excess of requirement • In smooth running sites matched LHCb requirements

Tier 0 -> Tier 1 • To first order all transfers eventually succeeded • Plot shows efficiency on 1st attempt… Issue with UK certificates Restart IN2P3 SRM endpoint CERN outage CERN SRM endpoint problems

Reconstruction • Used SRM 2.2 SE • LHCb space tokens are: • LHCb_RAW (T1D0) • LHCb_RDST (T1D0) • Data shares need to be preserved • Important for resource planning • Input 1 RAW file & output 1 rDST file (1.6 GB) • Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) • In order to fit within the available queues • Need to get queues at all sites that match our processing time • Alternative: reduce file size!

Reconstruction • After data transfer file should be online, as job submitted immediately, but… • LHCb pre-stage files & then checks on the status of the file before submitting pilot job - use gfal_ls • Pre-stage should ensure access availability from cache • Only issue at NL-T1 with reporting of file status • Discussed last week during Storage session (dCache version) • (Problem developed at IN2P3 right at end of CCRC08 - 31st May)

Reconstruction 41.2k reconstruction jobs submitted 27.6k jobs proceeded to done state Done/created ~67%

Reconstruction 27.6k reconstruction jobs in done state 21.2k jobs processed 25k events Done/25k events ~77% 3.0k jobs failed to upload rDST to local SE (only 1 attempt before trying Failover) Failover/25k events ~13%

Summary of reconstruction issues • File access problems • Random or permanent failure to open files using gsidcap • No problem observed with dcap • Request IN2P3 and NL-T1 to allow dcap protocol for local read access • (Temporary?) solution: download file to WN • Had to get it back from CERN is some cases • Wrong file status returned by dCache SRM after a put • Discussed and understood last week • Problem was that bringOnline was not doing anything • Questionable however whether all read should happen in a shared cache pool with disk-to-disk copy (even for TxD1 files) • Software area access problems • Site banned for a while until problem is fixed • Creates inefficiency an unavailability, but “easy” to recover • Application crashes • Fixed within a day or so (new SW release and deployment) i/p dataset download start LHCb WLCG-MB report, June 08

Reconstruction CPU efficiency based on ratio of wall-clock to CPU time on running jobs • Low efficiency at CNAF due : • s/w area access • more jobs than cores on a WN … • Low efficiency at RAL & IN2P3 due to data download • Resolved through tuning timeout

dCache Observations • Official LCG recommendation - 1.8.0-15p3 • LHCb ran smoothly at half of T1 dCache sites • PIC OK - version 1.8.0-12p6 (unsecure) • GridKa OK - version 1.8.0-15p2 (unsecure) • IN2P3 - problematic - version 1.8.0-12p6 (secure) • Seg faults - needed to ship version of GFAL to run • Could explain CGSI-gSOAP problem???? • NL-T1 - problematic (secure) • Many versions during CCRC to solve number of issues • 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4 • “Failure to put data - empty file”->”missing space token” problem -> incorrect metadata returned, NEARLINE issue

Stripping • Stripping on rDST files • 1 rDST files & associated RAW file • Space tokens: LHC_RAW & LHCb_rDST • DST files & ETC produced during the process stored locally on T1D1 (add storage class) • Space tokens: LHCb_M-DST • DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) • Space tokens: LHCb_DST (LHCb_M-DST)

Stripping • 31.8k stripping jobs were submitted • 9.3k jobs ran to “Done” • Major issues with LHCb bk-keeping

Stripping: T1-T1 transfers Stripping reduction factor too small CNAF PIC GridKa RAL Stripping limited to 4 T1 centres

Lesson Learnt for DIRAC3 • Improved error reporting in workflow & pilot logs • Careful checking of log files was required for detailed analysis • Full failover mechanism is in place but not yet deployed • only CERN was used for CCRC08 • Alternative forms of data access • Minor tuning of the timeout for downloading input data was required • 2 timeouts needed: time of copy & activity timeout

Summary • Data transfer of CCRC08 using FTS was successful • Still plagued with many issues associated data access • Issues improved since Feb CCRC08 but… • 2 sites problematic for large chunks of CCRC08 - 50% of LHCb resources!! • Problems mainly associated with access with dCache • Commencing tests with xrootd • DIRAC3 tools improved significantly from Feb • Still need improved reporting of problems • LHCb bk-keeping remains a major concern • New version due prior to data taking • LHCb need to implement a better interrogation of log files

Outlook • Continue CCRC-like exercise for testing new releases of DIRAC3 • One or two 6-hour runs at a time • Features under test • Full failover for file upload, LFC registration, BK registration, job status reporting (using VOBoxes) • Commission DIRAC3 fully for simulation • Easier than processing workflows • Adapt ganga for DIRAC3 submission • Delayed due to an accident of the developer… • We would like to test a “generic” pilot agent mode of running even in absence of glexec • Certify the ”time left” utility on all sites • Assess the mode of running (full test of proxy handling) • We can limit to running LHCb applications (no user scripts) • No security risk higher than for production jobs LHCb WLCG-MB report, June 08

LHCb Status report June 08

LHCb Status report June 08

Presentation Transcript

RD50 Status Report – June 2013

LHCb computing status

LHCb status report

LHCb Detector Global Status

ZEUS Status Report: June 1999

Use of Generators in LHCb (Status Report)

CCRC’08 Status Report

LHCb upgrade status

LHCb: Operational Report

LHCb status and plans

LHCb Status November2008

Daily Derivatives Report:08 June 2018

Proton Plan Status June Report

LHCb DC’04 Status

Proton Plan Status June Report

LHCb Computing Status Report DAQ , ECS , Software, Facilities