1 / 24

LHCb Status report June 08

LHCb Status report June 08. Activities since February. Applications and Core Software Preparation of applications for real data Simulation with real geometry (from survey) Certification of GEANT4 9.1 Alignment and calibration procedures in place Production activities (aka DC06)

claudettej
Download Presentation

LHCb Status report June 08

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb Status report June 08

  2. Activities since February • Applications and Core Software • Preparation of applications for real data • Simulation with real geometry (from survey) • Certification of GEANT4 9.1 • Alignment and calibration procedures in place • Production activities (aka DC06) • New simulations on demand • Continue stripping and re-processing • Dominated by data access problems (and how to deal with them) • Core Computing • Building on CCRC08 phase 1 • Improved WMS and DMS • Introduce error recovery, failover mechanism etc… • Commission DIRAC3 for simulation and analysis • Only DM and reconstruction exercised in February LHCb WLCG-MB report, June 08

  3. Sites configuration • Databases • ConditionsDB and LFC replicated (3D) to all Tier1s • LFC mirror read-only at all Tier1s available • Sites SE migration • RAL migration from dCache to Castor2 • Very painful exercise (bad pool configurations) • Took over 8 months to complete, but… it’s over!!! • PIC migration from Castor1 to dCache • Fully operational, Castor decommissioned since March • CNAF migration of T0D1 and T1D1 to StoRM • Went very smoothly for migrating existing files from Castor2 • SRM v2 spaces • All spaces needed for CCRC were well in place • SRM v2 still not used for DC06 production (see later) LHCb WLCG-MB report, June 08

  4. DC06 production issues • Still using DIRAC2, i.e. SRM v1 • No plan to backport SRM v2 to DIRAC2 • We could update the LFC entries to srm-lhcb.cern.ch (v2) for reading but… • Still need srm-durable-lhcb.cern.ch for T0D1 upload • Need srm-get-metadata that works for SRM v2 • When SRM v2 and v1 are the same end-point, no problem • DIRAC2 checked against StoRM: no problem • File access problems • As for CCRC, dominating the re-processing problems • Castor sites: OK, would like to have rootd at RAL (problems known for 4 years with rfio plugin in root, aleviated with rootd) • Many (>7,000) files lost at CERN-Castor (mostly from autumn 2006), to be marked as such in the LFC (some irrecoverable) • dCache sites: no problem using dcap protocol (PIC, GridKa), many problems with gsidcap (IN2P3, NL-T1) • How to deal with erratic errors (files randomly unaccessible) LHCb WLCG-MB report, June 08

  5. CCRC’08Summary of last week’s reportcourtesy of Nick Brook LHCb WLCG-MB report, June 08

  6. Planned tasks • May activities • Maintain equivalent of 1 month data taking • Assuming a 50% machine cycle efficiency • Run fake analysis activity in parallel to production type activities • Analysis type jobs were used for debugging throughout the period • GANGA testing ran for last weeks at low level

  7. Activities across the sites • Planned breakdown of processing activities (CPU needs) prior to CCRC08

  8. Pit -> Tier 0 • Use of rfcp to copy data from pit to CASTOR • rfcp is the recommended approach from IT • A file sent every ~30 sec • Data remains on online disk until CASTOR migration • Rate to CASTOR - ~70MB/s In general ran smoothly: - Stability problems with online storage area - solved with firmware update during CCRC - Internal issues with sending bk-keeping info Problems with online storage area

  9. Tier 0 -> Tier 1 • FTS from CERN to Tier-1 centres • Transfer of RAW will only occur once data has migrated to tape & checksum is verified • Rate out of CERN - ~35MB/s averaged over the period • Peak rate far in excess of requirement • In smooth running sites matched LHCb requirements

  10. Tier 0 -> Tier 1 • To first order all transfers eventually succeeded • Plot shows efficiency on 1st attempt… Issue with UK certificates Restart IN2P3 SRM endpoint CERN outage CERN SRM endpoint problems

  11. Reconstruction • Used SRM 2.2 SE • LHCb space tokens are: • LHCb_RAW (T1D0) • LHCb_RDST (T1D0) • Data shares need to be preserved • Important for resource planning • Input 1 RAW file & output 1 rDST file (1.6 GB) • Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) • In order to fit within the available queues • Need to get queues at all sites that match our processing time • Alternative: reduce file size!

  12. Reconstruction • After data transfer file should be online, as job submitted immediately, but… • LHCb pre-stage files & then checks on the status of the file before submitting pilot job - use gfal_ls • Pre-stage should ensure access availability from cache • Only issue at NL-T1 with reporting of file status • Discussed last week during Storage session (dCache version) • (Problem developed at IN2P3 right at end of CCRC08 - 31st May)

  13. Reconstruction 41.2k reconstruction jobs submitted 27.6k jobs proceeded to done state Done/created ~67%

  14. Reconstruction 27.6k reconstruction jobs in done state 21.2k jobs processed 25k events Done/25k events ~77% 3.0k jobs failed to upload rDST to local SE (only 1 attempt before trying Failover) Failover/25k events ~13%

  15. Summary of reconstruction issues • File access problems • Random or permanent failure to open files using gsidcap • No problem observed with dcap • Request IN2P3 and NL-T1 to allow dcap protocol for local read access • (Temporary?) solution: download file to WN • Had to get it back from CERN is some cases • Wrong file status returned by dCache SRM after a put • Discussed and understood last week • Problem was that bringOnline was not doing anything • Questionable however whether all read should happen in a shared cache pool with disk-to-disk copy (even for TxD1 files) • Software area access problems • Site banned for a while until problem is fixed • Creates inefficiency an unavailability, but “easy” to recover • Application crashes • Fixed within a day or so (new SW release and deployment) i/p dataset download start LHCb WLCG-MB report, June 08

  16. Reconstruction CPU efficiency based on ratio of wall-clock to CPU time on running jobs • Low efficiency at CNAF due : • s/w area access • more jobs than cores on a WN … • Low efficiency at RAL & IN2P3 due to data download • Resolved through tuning timeout

  17. dCache Observations • Official LCG recommendation - 1.8.0-15p3 • LHCb ran smoothly at half of T1 dCache sites • PIC OK - version 1.8.0-12p6 (unsecure) • GridKa OK - version 1.8.0-15p2 (unsecure) • IN2P3 - problematic - version 1.8.0-12p6 (secure) • Seg faults - needed to ship version of GFAL to run • Could explain CGSI-gSOAP problem???? • NL-T1 - problematic (secure) • Many versions during CCRC to solve number of issues • 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4 • “Failure to put data - empty file”->”missing space token” problem -> incorrect metadata returned, NEARLINE issue

  18. Stripping • Stripping on rDST files • 1 rDST files & associated RAW file • Space tokens: LHC_RAW & LHCb_rDST • DST files & ETC produced during the process stored locally on T1D1 (add storage class) • Space tokens: LHCb_M-DST • DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) • Space tokens: LHCb_DST (LHCb_M-DST)

  19. Stripping • 31.8k stripping jobs were submitted • 9.3k jobs ran to “Done” • Major issues with LHCb bk-keeping

  20. Stripping: T1-T1 transfers Stripping reduction factor too small CNAF PIC GridKa RAL Stripping limited to 4 T1 centres

  21. Lesson Learnt for DIRAC3 • Improved error reporting in workflow & pilot logs • Careful checking of log files was required for detailed analysis • Full failover mechanism is in place but not yet deployed • only CERN was used for CCRC08 • Alternative forms of data access • Minor tuning of the timeout for downloading input data was required • 2 timeouts needed: time of copy & activity timeout

  22. Summary • Data transfer of CCRC08 using FTS was successful • Still plagued with many issues associated data access • Issues improved since Feb CCRC08 but… • 2 sites problematic for large chunks of CCRC08 - 50% of LHCb resources!! • Problems mainly associated with access with dCache • Commencing tests with xrootd • DIRAC3 tools improved significantly from Feb • Still need improved reporting of problems • LHCb bk-keeping remains a major concern • New version due prior to data taking • LHCb need to implement a better interrogation of log files

  23. Outlook • Continue CCRC-like exercise for testing new releases of DIRAC3 • One or two 6-hour runs at a time • Features under test • Full failover for file upload, LFC registration, BK registration, job status reporting (using VOBoxes) • Commission DIRAC3 fully for simulation • Easier than processing workflows • Adapt ganga for DIRAC3 submission • Delayed due to an accident of the developer… • We would like to test a “generic” pilot agent mode of running even in absence of glexec • Certify the ”time left” utility on all sites • Assess the mode of running (full test of proxy handling) • We can limit to running LHCb applications (no user scripts) • No security risk higher than for production jobs LHCb WLCG-MB report, June 08

More Related