1 / 15

LHCb on the Grid

LHCb on the Grid. Raja Nandakumar (with contributions from Greig Cowan) ‏. GridPP21 3 rd September 2008. LHCb computing model. CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data and DSTs All T1s have a full copy of dst-s

gaenor
Download Presentation

LHCb on the Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan)‏ GridPP21 3rd September 2008

  2. LHCb computing model • CERN (Tier-0) is the hub of all activity • Full copy at CERN of all raw data and DSTs • All T1s have a full copy of dst-s • Simulation at all possible sites (CERN, T1, T2)‏ • LHCb has used about 120 sites on 5 continents so far • Reconstruction, Stripping and Analysis at T0 / T1 sites only • Some analysis may be possible at “large” T2 sites in the future • Almost all the computing (except for development / tests) will be run on the grid. • Large productions : production team • Ganga (Dirac) grid user interface

  3. LHCb on the grid • Small amount of activity over past year • DIRAC3 has been under development • Physics groups have not asked for new productions • Situation has changed recently...

  4. LHCb on the grid • DIRAC3 • Nearing stable production release • Extensive experience with CCRC08 and follow-up exercises • Used as THE production system for LHCb • Now testing of the interfaces by Ganga developers • Generic pilot agent framework • Critical problems found with the gLite WMS 3.0, 3.1 • Mixing of VOMS roles under certain reasonably common conditions • Cannot have people with different VOMS roles! • Savannah bug #39641 • Being worked on by developers • Waiting for this to be solved before restarting tests

  5. DIRAC3 Production >90,000 jobs in past 2 months Real production activity and testing of gLite WMS

  6. DIRAC3 Job Monitor https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/display

  7. LHCb storage at RAL • LHCb storage primarily on the Tier-1s and CERN • CASTOR used as storage system at RAL • Fully moved out of dCache in May 2008 • One tape damaged and file on it marked lost • Was stable (more or less) until 20 Aug 2008 • Not been able to take great load on servers • Low upper limit (8) on lsf job slots on various castor diskservers • Too many jobs (>500) can come into the batch system. The concerned service class hangs then • Temporarily fixed for now. Needs to be monitored (probably by the shifter on duty?)‏ • Increase limit to >100 rfio jobs per server • Not all hardware can handle a limit of 200 jobs (start using swap space)‏ • Problem seen many times now over the last few months • Castor now in downtime • This is worrying given how close we are to data taking

  8. LHCb at RAL • Move to srm-v2 by LHCb • Needed to retire srm-v1 endpoints, hardware for RAL • When DIRAC3 becomes baseline for User analysis • Already used for almost all production • Ganga working on submitting through DIRAC3 • Needs LHCb also to rename files in the LFC • All space tokens, etc have been setup • Target : Turn off srm-v1 access by end September • Currently use srm-v1 for user analysis • DIRAC2 does not support srm-v2 • Batch system : • Pausing of jobs during downtime? • Not clear about the status of this • For now, stop the batch system from accepting LHCb jobs a few hours before scheduled downtimes • No LHCb job should run for >24 hours • Announce beginning and end of downtimes • Problems with broadcast tools • GGUS ticket opened by Derek Ross

  9. LHCb and CCRC08 • Planned tasks : Test the LHCb computing model • Raw data distribution from pit to T0 centre • Use of rfcp into CASTOR from pit - T1D0 • Raw data distribution from T0 to T1 centres • Use of FTS - T1D0 • Recons of raw data at CERN & T1 centres • Production of rDST data - T1D0 • Use of SRM 2.2 • Stripping of data at CERN & T1 centres • Input data: RAW & rDST - T1D0 • Output data: DST - T1D1 • Use SRM 2.2 • Distribution of DST data to all other centres • Use of FTS

  10. LHCb and CCRC08 Reconstruction Stripping

  11. LHCb CCRC08 Problems • CCRC08 highlighted areas to be improved • File access problems • Random or permanent failure to open files using gsidcap • Request IN2P3 and NL-T1 to allow dcap protocol for local read access • Now using xroot at IN2P3 – appears to be successful • Wrong file status returned by dCache SRM after a put • bringOnline was not doing anything • Software area access problems • Site banned for a while until problem is fixed • Application crashes • Fixed with new SW release and deployment • Major issues with LHCb bookkeeping • Especially for stripping • Lessons learned • Better error reporting in pilot logs and workflow • Alternative forms of data access needed in emergencies • Downloading of files to WN (used at IN2P3, RAL)‏

  12. LHCb Grid Operations • Grid Operations and Production team has been created

  13. Communications • LHCb sites • Grid operations team keep track of problems • Report to sites via GGUS and eLogger • All posts are reported on lhcb-production@cern.ch • Please subscribe if you want to know what is going on • LHCb users • Mailing lists • lhcb-distributed-analysis@cern.ch • All problems directed here • Specific lists for each LHCb application and Ganga • Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps • User by developers and “power” users • Software weeks provide training sessions for using Grid tools • Weekly distributed analysis meetings (starts Friday)‏ • DIRAC, Ganga, core software developers along with some users • Aims to identify needs and coordinate release plans http://lblogbook.cern.ch/Operations http://lblogbook.cern.ch/Operations RSS feed available

  14. Summary • Concerned about CASTOR stability close to data taking • DIRAC3 workload and data management system now online • Has been extensively tested when running LHCb productions • Now moving it into the user analysis system • Ganga needs some additional development • Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently • LHCb looking forward to imminent switch on of the LHC!

  15. Backup - CCRC08 Throughput

More Related