1 / 23

Results of the LHCb experiment Data Challenge 2004

Results of the LHCb experiment Data Challenge 2004. Joël Closier CERN / LHCb CHEP’ 04. The LHCb DC04 team. Dirac Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees Production management Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers

earroyo
Download Presentation

Results of the LHCb experiment Data Challenge 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04

  2. The LHCb DC04 team • Dirac • Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees • Production management • Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew Pickford … and the LHCb site managers • LHCb Bookkeeping, Monitoring & accounting • Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya • LCG-LHCb liaison • Flavia Donno, Roberto Santinelli • The LCG-GDA team • Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz, David Smith, Zdenek Sekera, Marco Serra… Result of LHCb DC04

  3. Outline • Aims of the LHCb Data Challenge 2004 • Production model • Performances of DC’04 • Lessons from DC’04 • Conclusions Result of LHCb DC04

  4. LHCb DC’04 aims • Main goal :gather information to be used for writing the LHCb computing Technical Design Report • Robustness test of the LHCb software and production system • Using software as realistic as possible in terms of performance • Test of the LHCb distributed computing model • Including distributed analyses • Realistic test of analysis environment, need realistic analyses • Incorporation of the LCG application area software into the LHCb production environment • Use of LCG resources (at least 50% of the production capacity) • 3 phases • Production : MC simulation and reconstruction • Stripping : Event pre-selection • Analysis Result of LHCb DC04

  5. LHCb DC04 aims (cont’d) • Physics goals • HLT studies, consolidating efficiencies • Background/Signal studies, consolidate background estimates + background properties • Requires quantitative increase in number of signal and background events compared to DC03: • 30 106 signal events • 15 106 specific background • 125 106 background (B inclusive + minimum bias, ratio 1:1.8) Result of LHCb DC04

  6. Production • Production done with DIRAC system • Track 4 - Distributed Computing Services : id 377 • DIRAC is deployed to each site participating to DC’04 • Central Services supporting the Data Challenge • Production database • Workload Management System • Monitoring, Accounting • Bookkeeping, ALIEN File Catalog • Technologies used by the production services • C++, python, XML-RPC • ORACLE and mysql databases Result of LHCb DC04

  7. Non LCG site DIRAC deployment (CE). DIRAC JobAgent: Check CE status. Request a DIRAC task (jdl). Install LHCb software if needed Submit to Local Batch System the job. Execute task: Check Steps. Upload results DIRAC TransferAgent. LCG site Input SandBox: Small bash script (~50 lines). Check environment: Site, hostname, CPU, Memory, Disk Space… Install DIRAC: Download DIRAC tarball (~1 MB). Deploy DIRAC on WN. Execute the job: Request a DIRAC task (LHCb Simulation job) Execute task: Check Steps Upload results: Retrieval of SandBox Analysis of Retrieved Output SandBox LHCb job Result of LHCb DC04

  8. Strategy • Test sites: • Each site is tested with special and production-like jobs. • Enable site : • DIRAC Workload Management System. • Always keep jobs in the queues DIRAC • Run Local Agent continuously: • Via cron jobs • Via runsv • Via daemon LCG • Submit jobs continuously: • Via cron job on User Interface PS: LCG is considered as a site for DIRAC point of view Result of LHCb DC04

  9. Data Storage • All the output of the reconstructed phase (DST) are send to CERN (as Tier0) • All the intermediate files are not kept. • DSTs are also stored in one of our 5 TIER1 • CNAF (Italy) • Karlsruhe (Germany) • Lyon (France) • PIC (Spain) • RAL (United Kingdom) Result of LHCb DC04

  10. DC’04 performances Result of LHCb DC04

  11. 186 M Produced Events Phase 1 Completed 3-5 106/day LCG restarted LCG paused LCG in action 1.8 106/day DIRAC alone Phase 1 results Result of LHCb DC04

  12. 5 million/day Daily performance Result of LHCb DC04

  13. Sites involved 20 DIRAC Sites Used resources from non-LHCb countries e.g. Hungary produced ~2M events 43 LCG Sites (8 also DIRAC sites) Result of LHCb DC04

  14. Simultaneous jobs (a snapshot) Result of LHCb DC04

  15. TIER storage Result of LHCb DC04

  16. DIRAC-LCG : events share 50% of events were produced using LCG Result of LHCb DC04

  17. DIRAC – LCG : CPU share 376 CPU · Years May: 88%:12% 11% of DC’04 Jun: 78%:22% 25% of DC’04 Jul: 75%:25% 22% of DC’04 Aug: 26%:74% 42% of DC’04 Result of LHCb DC04

  18. LCG performance 211k Submitted Jobs to LCG After Running: 113 k Done (Successful) 34 k Aborted LCG Efficiency: 61 % Result of LHCb DC04

  19. DC’04 lessons Result of LHCb DC04

  20. Lessons learnt: DIRAC • The concept of the light, customizable and simple to deploy agents proved to be very effective • Easy update procedure - propagate bug fixes quickly of DIRAC tools • Applications software installation triggered by a running job • Most of the central services were running on the same machine • Too many processes, high loads • Improve Server Availability • Improve Error Handling and Reporting. Result of LHCb DC04

  21. Lessons learnt: LCG • ImproveOutputSandBoxUpload | Retrieval mechanism: • Should also be available for Failed and Aborted Jobs. • Improve reliability ofCE statuscollection methods (timestamps?). • Add intelligence on CE or RB todetectand avoid large number ofaborted jobson start-up: • Avoid miss-configured site to become a black-hole. • Need tocollect LCG-log infoand tool to navigate them (including different JobIDs). • Need a way tolimit the CPU(and Wall-clock time): • LCG Wrapper must issue appropriated signals to User Job to allow graceful termination. • How tomanuals: • Clear instruction to Site Managers on the procedure to shutdown a site (for maintenance and/or upgrade). • Problems with site configurations (LCG config, firewalls, gridFTP servers..) Result of LHCb DC04

  22. Conclusions • LHCb DC’04 Phase 1 is over. • The Production Target was achieved: • 186 M Events in 424 CPU years. • ~ 50% on LCG Resources (75-80% at the last weeks). • LHCb Strategy successful: • Submitting “empty” DIRAC Agents to LCG has proven to be very flexible allowing a success rate above LCG alone. • Big room for improvements, both on DIRAC and LCG • DIRAC needs to improve in the reliability of the Servers: • big step already during DC. • LCG needs improvement on the single job efficiency: • ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint. • In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. • Success due to dedicated support from LCG team and DIRAC Site Managers Result of LHCb DC04

  23. Other links • CHEP04 talks: • File-Metadata Management System for the LHCb Experiment • (Track 4 - Distributed Computing Services) id 392 • 27-Sep-2004 17:30 • DIRAC Workload Management System • (Track 5 - Distributed Computing Systems and Experiences) id 365 • 29-Sep-2004 10:00 • Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC • (Track 4 - Distributed Computing Services) id 368 • 29-Sep-2004 10:00 • DIRAC - The Distributed MC Production and Analysis for LHCb • (Track 4 - Distributed Computing Services) id 377 • 30-Sep-2004 18:10 • A Lightweight Monitoring and Accounting System for LHCb DC04 Production • (Track 4 - Distributed Computing Services) id388 • 30-Sep-2004 17:30 Result of LHCb DC04

More Related