1 / 32

LHCb Computing

Learn about the computing principles and infrastructure of the LHCb experiment, dedicated to studying CP-violation and matter-antimatter differences using the b-quark. Explore the software, data processing, and workload management system used by LHCb, along with the requirements for 2008 and the DIRAC grid management software.

joelford
Download Presentation

LHCb Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb Computing A.Tsaregorodtsev, CPPM, Marseille 14 March 2007, Clermont-Ferrand

  2. LHCb in brief • Experiment dedicated to studying CP-violation • Responsible for the dominance of matter on antimatter • Matter-antimatter difference studied using the b-quark (beauty) • High precision physics (tiny difference…) • Single arm spectrometer • Looks like a fixed-target experiment • Smallest of the 4 big LHC experiments • ~500 physicists • Nevertheless, computing is also a challenge….

  3. LHCb Basic Computing principles • Raw data shipped in real time to Tier-0 • Registered in the Grid (File Catalog) • Raw data provenance in a Bookkeeping database (query-enabled) • Resilience enforced by a second copy at Tier-1’s • Rate: ~2000 evts/s (35 kB)  70 MB/s • 4 main trigger sources (with little overlap) • b-exclusive; dimuon; D*; b-inclusive • All data processing up to final Tuple or histogram production distributed • Not even possible to reconstruct all data at Tier0… • Processing at Tier1 centers • LHCb runs jobs where data are • All data are placed explicitly

  4. LHCb data processing software Event model / Physics event model DST Conditions Database GenParts Trigger Moore Simul. Gauss Analysis DaVinci Recons. Brunel Digit. Boole MCHits (r)DST Raw Data AOD MCParts Gaudi

  5. LHCb dataflow Simulation. Simulation. Simulation. Simulation. Simulation. Simulation. Simulation. Tier1 Tier1 Tier1 Tier1 Tier1 MSS-SE Online Tier0 Tier2 Raw MSS-SE Tier1 Digi Recons. Raw/Digi rDST Analysis Stripping rDST+Raw DST DST

  6. Computing Model

  7. The LHCb Tier1s • 6 Tier1s • CNAF (IT, Bologna) • GridKa (DE, Karlsruhe) • IN2P3 (FR, Lyon) • NIKHEF (NL, Amsterdam) • PIC (ES, Barcelona) • RAL (UK, Didcot) • Contribute • Reconstruction • Stripping • Analysis • Keeps copies on MSS of • Raw (2 copies shared) • Locally produced rDST • DST (2 copies) • MC data (2 copies) • Keeps copies on disk of • DST (7 copies)

  8. LHCb Computing: a few numbers TDRestimate Current estimate Event Size kB RAW 25 35 rDST 25 20 DST 100 110 Evt processing kSI2k.s Reconstruction 2.4 2.4 Stripping 0.2 0.2 Analysis 0.3 0.3 • Event sizes • on persistent medium (not in memory) • Processing time • Best estimates as of today • Requirements for 2008 • 4 106 seconds of beam

  9. Summary resources needs for 2008 Data on tape 2008 (TB) RAW rDST Stripped Simulation Analysis 560 320 483 128 - Data on disk 2008 (TB) RAW rDST Stripped Simulation Analysis 76 43 775 375 114 CPU needs in 2008 (MSI2k.yr) Recons. Stripping Simulation Analysis 1.4 0.5 4.6 0.5

  10. DIRAC grid management software • DIRAC is a distributed data production and analysis system for the LHCb experiment • Includes workload and data management components • Uses LCG services whenever possible • Was developed originally for the MC data production tasks • The goal was: • integrate all the heterogeneous computing resources available to LHCb • Minimize human intervention at LHCb sites • The resulting design led to an architecture based on a set of services and a network of light distributed agents

  11. DIRAC Services, Agents and Resources GANGA Job monitor Production Manager DIRAC API BK query webpage FileCatalog browser FileCatalogSvc BookkeepingSvc DIRAC Job Management Service Services JobMonitorSvc ConfigurationSvc MessageSvc JobAccountingSvc Agent Agent Agent Resources LCG Grid WN Site Gatekeeper Tier1 VO-box

  12. WMS Service • DIRAC Workload Management System is itself composed of a set of central services, pilot agents and job wrappers • Realizes the PULL scheduling paradigm • Pilot agents deployed at LCG Worker Nodes pull the jobs from the central Task Queue • The central Task Queue allows to apply easily the VO policies by prioritization of the user jobs • Using the accounting information and user identities, groups and roles • The job scheduling is late • Job goes to a resource for immediate execution

  13. DIRAC workload management Task Queue 1 Task Queue 1 Task Queue 1 Central Services VOMS info LHCb policy, quotas Accounting Service Priority Calculator Job requirements, ownership Job priority Optimizer Prioritizer Job Receiver Optimizer Data Optimizer XXX Job Database Match Maker Agent Director Resources (WNs) Agent 1 Agent 2 Agent 2 …

  14. VO-boxes • LHCb VO-boxes are machines offered by Tier1 sites to insure safety and efficiency of the grid operations • Standard LCG software is maintained by the site managers; • LHCb software is maintained by the LHCb administrators; • Recovery of failed data transfers and bookkeeping operations. • VO-boxes are behaving in a completely non-intrusive way • Access site grid services via standard interfaces • Main advantage – geographical distribution • VO-boxes are set up now in all the Tier1 centers. • Any job can set requests on any VO-box in a round-robin way for redundancy and load-balancing

  15. DM Components • DIRAC Data Management tools are built on top of or provide interfaces to the existing services • The main components are: • Storage Element client and Storage access plug-ins • SRM, GridFTP, HTTP, SFTP, FTP, … • Replica Manager – high level operations • Uploading, replication, registration • Best replica finding • Failure retries with alternative data access methods • File Catalogs • LFC • Processing Database • High level tools for automatic bulk data transfers • T0-T1 raw data distribution • T1-T1 reconstructed data distribution

  16. Bulk Data Management LCG File Catalog FC Interface Tier1 SE C Transfer network Tier1 SE A Tier1 SE B Tier0 SE Request DB LCG – Machinery LHCb - DIRAC DMS • Bulk asynchronous file replication • Requests set in RequestDB • Transfer Agent executes periodically • ‘Waiting’ or ‘Running’ requests obtained from RequestDB • FTS bulk transfer jobs submitted and monitored File Transfer Service Replica Manager Transfer Agent Transfer Manager Interface

  17. T0-T1 data transfer tests Results:

  18. Processing Database • The suite of Production Manager tools to facilitate the routine production tasks: • define complex production workflows • manage large numbers of production jobs • Transformation Agents prepare data reprocessing jobs automatically as soon as the input files are registered in the Processing Database via a standard File Catalog interface • Minimize the human intervention, speed up standard production

  19. DIRAC production performance Total • Permanent production (no more DC) • Up to 10K simultaneous production jobs • The throughput is only limited by the capacity available on LCG • ~80 distinct sites accessed through LCG or through DIRAC directly IN2P3

  20. CPU usage by country (since Dec. 2006) CERN German UK Spain France Italy

  21. CPU usage since April 2006 by sites • 50% of production done at T1s RAL CERN CNAF Manchester

  22. Status of various activities • The LHCb production system is working in a stable way for the MC production • Pilot agent model screens the LCG inefficiencies • The data distribution and reprocessing is more difficult • Unstable storage systems • Flaws in the data access middleware • Many problems will be resolved (and created) with the new SRM2.2 release

  23. Status of various activities • User analysis is starting • Reliable data access is a crucial point • Efficient job prioritization is a must • User and production jobs are competing for the same resources • Full chain tests involving DAQ, T0-T1 data distribution, distributed reconstruction, T1-T1 data distribution, final analysis in June • Automatic “real time” data movement and processing • Close to a real scale involving all the T1 sites

  24. Conclusions • LHCb has proposed a Computing Model adapted at its specific needs (number of events, event size, low number of physics candidates) • Reconstruction, stripping and analysis resources located at Tier1s (and possibly some Tier2s with enough storage and CPU capacities) • CPU requirements dominated by Monte-Carlo, assigned to Tier2s and opportunistic sites • With DIRAC, even idle desktops / laptops could be used ;-) • LHCb@home ? • Requirements are modest compared to other experiments • DIRAC is well suited and adapted to this computing model • Integrated WMS and DMS • LHCb’s Computing should be ready when first data come

  25. LHCb software stack Gauss Boole Brunel DaVinci Panoramix Moore Applications Component projects Lbcom Rec Phys Online LHCb Event Model Gaudi Framework POOL SEAL COOL Geant4 LCG Root Ext.Libs CORAL GENSER • Uses CMT for build and configuration (handling dependencies) • LHCb projects: • Applications • Gauss (simulation), Boole (digitisation), Brunel (reconstruction), Moore (HLT), DaVinci (analysis) • Algorithms • LBCOM (commone packages), Rec (reconstruction), Phys (physics), Online • Event model • LHCb • Software framework • Gaudi • LCG Applications area • POOL, root, COOL • Lcg/external • External SW: boost, xerces… also middleware client (lfc, gfal,…)

  26. Last month activities ALL Record of running jobs 9654 • Average of 7.5K running jobs in the last month • Temporary problems at PIC and RAL CERN PIC CNAF GRIDKA NIKHEF RAL IN2P3

  27. Community Overlay Network TQ Monitoring Logging WN WN WN Pilot Agent Pilot Agent Pilot Agent • DIRAC Central Services and Pilot Agents form a dynamic distributed system as easy to manage as an ordinary batch system • Uniform view of the resources independent of their nature • Grids, clusters, PCs • Prioritization according to VO policies and accounting • Possibility to reuse batch system tools • E.g. Maui scheduler DIRAC WMS GRID

  28. Reconstruction requirements b-exclusive Dimuon D* b-inclusive Total Input fraction 0.1 0.3 0.15 0.45 1.0 Number of events 8108 2.4109 1.2109 3.6109 8109 MSS storage (TB) 16 48 24 72 160 CPU (MSI2k.yr) 0.15 0.45 0.23 0.68 1.52 • 2 passes per year: • 1 quasi real time over ~100 day period (2.8 MSI2k) • re-processing over 2 month period of shutdown (4.3 MSI2k) • Make use of Filter Farm at pit (2.2 MSI2k) - data back to the pit

  29. Stripping requirements Exclusive-b dimuon D* Inclusive-b Total Input fraction 0.1 0.3 0.15 0.45 1.00 Reduction factor 10 5 5 100 9.57 Event yield per stripping 8107 4.8108 2.4108 3.6107 8.4109 CPU (MSI2k.year) 0.02 0.06 0.03 0.02 0.11 Storage requirement per stripping (TB) 9 26 13 4 52 TAG (TB) 1 2 1 4 8 • Stripping 4 times per year - 1 month production outside of recons • Stripping has at least 4 output streams • Only rDST stored for “non-b” channels+RAW i.e. 55 kB • RAW+full DST for “b” channels - i.e. 110kB • Output on disk SE at all Tier-1 centres

  30. Simulation requirements Application Nos. of events CPU time/evt (kSI2k.s) Total CPU (MSI2k.year) Signal Gauss 8108 75 1.9 Boole 8108 1 0.03 Brunel 8107 2.4 0.01 Inclusive Gauss 8108 75 1.9 Boole 8108 1 0.03 Brunel 8107 2.4 0.01 Total 3.87 • studies to measure performance of detector & event selection in particular regions of phase space • use large statistics dimuon & D* samples for systematics - reduced Monte Carlo needs

  31. Simulation storage requirements Output Nos. of events Storage/evt (kB) Total Storage (TB) Signal DST 8107 400 32 TAG 8107 1 0.1 Inclusive DST 8107 400 32 TAG 8107 1 0.1 Total 64 • Simulation still dominate LHCb CPU needs • Current evt size for Monte Carlo DST (with truth info) is ~400kB/evt; • Total storage needs 64TB in 2008 • Output at CERN and another 2 copies distributed over Tier-1 centres

  32. Analysis requirements Nos. of physicist performing analysis 140 Nos. of analysis jobs per physicist/week 4 Event size reduction factor after analysis 5 Number of “active” Ntuples 10 2008 CPU needs (MSI2k.years) 0.31 2008 Disk storage (TB) 80 • user analysis accounted in model predominantly batch - ~30k jobs/year • predominantly analysing ~106 events • CPU of 0.3 kSI2k.s/evt • Analysis needs grow linearly with year in early phase of expt

More Related