1 / 38

LHCb Data Challenge 2004

LHCb Data Challenge 2004. A.Tsaregorodtsev, CPPM, Marseille. LCG-France Meeting, 22 July 2004, CERN. Goals of DC’04. Main goal: gather information to be used for writing the LHCb computing TDR/TP Robustness test of the LHCb software and production system

Download Presentation

LHCb Data Challenge 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN LCG-France, 22 July 2004, CERN

  2. Goals of DC’04 • Main goal: gather information to be used for writing the LHCb computing TDR/TP • Robustness test of the LHCb software and production system • Using software as realistic as possible in terms of performance • Test of the LHCb distributed computing model • Including distributed analyses • realistic test of analysis environment, need realistic analyses • Incorporation of the LCG application area software into the LHCb production environment • Use of LCG resources as a substantial fraction of the production capacity LCG-France, 22 July 2004, CERN

  3. DC 2004 phases • Phase 1 – MC data production • 180M events of different signals, bg, mbias • Simulation+reconstruction • DST’s are copied to Tier1 centres • Phase 2 – Data reprocessing • Selection of various physics streams from DST’s • Copy selections to all Tier1 centers • Phase 3 – User analysis • User analysis jobs on DST data distributed in all the Tier1 centers LCG-France, 22 July 2004, CERN

  4. Phase 1 MC production LCG-France, 22 July 2004, CERN

  5. DIRAC Services and Resources User interfaces Job monitor Production manager GANGA UI User CLI BK query webpage FileCatalog browser BookkeepingSvc FileCatalogSvc DIRAC Job Management Service DIRAC services JobMonitorSvc InfomarionSvc MonitoringSvc JobAccountingSvc AccountingDB Agent Agent Agent DIRAC resources DIRAC Storage LCG Resource Broker DIRAC Sites CE 3 DIRAC CE gridftp bbftp DIRAC CE DIRAC CE DiskFile CE 2 CE 1 rfio LCG-France, 22 July 2004, CERN

  6. Software to be installed • Before an LHCb application can run on a Worker Node the following software components should be installed: • Application software itself; • Software packages on which the application depends; • Necessary databases (file based) • DIRAC software • Single untar command to install in place • All the necessary libraries are included – no assumption made about the availability of whatever software on the destination site (except recent python interpreter): • External libraries; • Compiler libraries; • ld-linux.so • Same binary distribution running on RH 7.1-9.0 LCG-France, 22 July 2004, CERN

  7. Software installation • Software repository: • Web server (http protocol) • LCG Storage Element • Installation in place DIRAC way: • By Agent upon reception of a job with particular software requirements; OR • By a running job itself. • Installation in place LCG2 way: • Special kind of a job running standard DIRAC software installation utility LCG-France, 22 July 2004, CERN

  8. Software installation in the job • A job may need extra SW packages not in place on CE • Special version of geometry; • User analysis algorithms. • Any number of packages can be installed in the job itself (up to all of them) • Packages are installed in the job user space • Imitate the structure of the LHCb standard SW directory tree with symbolic links LCG-France, 22 July 2004, CERN

  9. 3’d party components • Originally DIRAC aimed at producing the following components: • Production database; • Metadata and job provenance database; • Workload management. • Expected 3’d party components: • Data management (FileCatalogue, replica management) • Security services • Information and Monitoring Services • Expectations for early delivery of the ARDA prototype components failed LCG-France, 22 July 2004, CERN

  10. File catalog service • The LHCb Bookkeeping was not meant to be used as a File (Replica) Catalog • Main use as Metadata and Job Provenance database • Replica catalog based on specially built views • AliEn File Catalog was chosen to get a (full) set of the necessary functionality: • Hierarchical structure: • Logical organization of data – optimized queries; • ACL by directory; • Metadata by directory; • File system paradigm; • Robust, proven implementation • Easy to wrap as an independent service: • Inspired by the ARDA RTAG work LCG-France, 22 July 2004, CERN

  11. AliEn FileCatalog in DIRAC • AliEn FC SOAP interface was not ready in the beginning of 2004 • Had to provide our own XML-RPC wrapper • Compatible with XML-RPC BK File Catalog • Using AliEn command line “alien –exec” • Ugly, but works • Building service on top of AliEn which is run by the lhcbprod AliEn user • Not really using the AliEn security mechanisms • Using AliEn version 1.32 • So far in DC2004: • >100’000 files with >250’000 replicas • Very stable performance LCG-France, 22 July 2004, CERN

  12. File catalogs AliEn FileCatalog Service FileCatalog Client AliEn FC Client AliEn FC AliEn UI XML-RPC server MySQL DIRAC Application, Service FC Client BK FileCatalog Service BK FC Client LHCb BK DB XML-RPC server ORACLE LCG-France, 22 July 2004, CERN

  13. Data Production – 2004 • Currently distributed data sets • CERN: • Complete DST (copied directly from production centres) • Tier1: • Master copy of DST produced at associated sites • DIRAC sites: • Bologna, Karlsruhe, Spain (PIC), Lyon, UK sites (RAL), all otherwise CERN • LCG sites: • Currently only 3 Grid (MSS) SE sites - CASTOR • Bologna, PIC, CERN • Bologna:ru,pl,hu,cz,gr,it • PIC: us,ca,es,pt,tw • CERN: elsewhere LCG-France, 22 July 2004, CERN

  14. DIRAC DataManagement tools • DIRAC Storage Element: • IS description + server (bbftpd, sftpd, httpd, gridftpd, xmlrpcd, file, rfio, etc) • Need no special service installation on the site • Description in the Information Service: • Host, Protocol, Local path • ReplicaManager API for common operations: • copy(), copyDir(), get(), exists(), size(), mkdir(), etc • Examples of usage: • dirac-rm-copyAndRegister <lfn> <fname> <size> <SE> <guid>dirac-rm-copy dc2004.dst CERN_Castor_BBFTP • Tier0SE and Tier1SE’s are defined in the central IS LCG-France, 22 July 2004, CERN

  15. Reliable Data Transfer • Any data transfer should be accomplished despite temporary failures of various services or networks: • Multiple retries of failed transfers with any necessary delay: • Until services are up and running; • Not applicable for LCG jobs. • Multiple retries of registration in the Catalog. • Transfer Agent: • Maintains a database of Transfer requests; • Transfers datasets or whole directories with log files; • Retries transfers until success LCG-France, 22 July 2004, CERN

  16. DIRAC DataManagement tools Job Data Manager Transfer requests Data Optimizer Transfer DB SE 1 Transfer Agent SE 2 cache LCG-France, 22 July 2004, CERN

  17. DIRAC DC2004 performance • In May-July: • Simulation+Reconstruction • >80000 jobs • ~75M events • ~25TB of data • Stored at CERN,PIC,Lyon,CNAF,RAL Tier1 centres • >150’000 files in the catalogs • ~2000 jobs running continuously • Up to 3000 in a peak LCG-France, 22 July 2004, CERN

  18. DC2004 at CC/IN2P3 • The main DIRAC development site • The CC/IN2P3 contribution is very weak • Production runs stable continuously; • Resources are very limited • HPSS performance is stable LCG-France, 22 July 2004, CERN

  19. Note on the BBFTP • Nice product • Stable, performant, complete, grid enabled • Light weight • Easy deployment of the statically linked executable • Good peformance • Would be nice to have a parallelized load balancing server • Functionality not complete with respect to GRIDFTP: • Remote storage management (ls(), size(), remove() ) • Transfers between remote servers LCG-France, 22 July 2004, CERN

  20. LCG experience LCG-France, 22 July 2004, CERN

  21. Production jobs • Long jobs – 23 hours on average 2GHz PIV • Simulation+Digitization+Reconstruction steps • 5 to 10 steps in one job • No event input data • Output data – 1-2 output files of ~200MB • Stored to Tier1 and Tier0 SE • Log files copied to an SE at CERN • AliEn and Bookkeeping Catalogues are updated LCG-France, 22 July 2004, CERN

  22. Using LCG resources • Different ways of scheduling jobs to LCG • Standard: jobs got via RB; • Direct: jobs go directly to CE; • Resource reservation • Using Reservation mode for the DC2004 production: • Deploying agents to WN as LCG jobs • DIRAC jobs are fetched by the agents in case the environment is OK • Agent steers the job execution including data transfers, update of the catalogs and bookkeeping. LCG-France, 22 July 2004, CERN

  23. Using LCG resources (2) • Using DIRAC DataManagement tools: • DIRAC SE + gridftp + sftp • Starting to populate RLS from DIRAC catalogues: • For evaluation • For use with ReplicaManager of LCG LCG-France, 22 July 2004, CERN

  24. Resource Broker I • No trivial use of tools for large number of jobs i.e. production • Command re-authenticated for every job • Produce errors with list of jobs (e.g. retrieve non-terminated jobs) • Slow to response when few 100 jobs in RB • e.g. 15 seconds for job scheduling • Ranking mechanism to provide even distribution of jobs • Number of CPUs published is for site & not for user/VO - request for free CPU in JDL doesn’t help LCG-France, 22 July 2004, CERN

  25. Resource Broker II • LCG, in general, does not advertise normalised time units • Solution: request CPU resources for the slowest CPU (500Hz) • Problem: only v. few site have long enough queues • Solution: DIRAC agent scales CPU for particular WN before request to DIRAC • Problem: some sites have normalised their units! • Jobs with ∞loops • 3 day job in week queue - killed by proxy expiry rather than CPU reqt • Jobs aborted by “proxy expired” • RB was re-using old proxies !!!! LCG-France, 22 July 2004, CERN

  26. Resource Broker III • Job cancelled by RB but with message “cancelled by user” • Due to loss of communication between RB & CE - job rescheduled & killed on original CE • Some job are not killed until they fail due to inability to transfer data • DIRAC also re-schedules! • RB lost control of the status of all jobs • RB “stuck” - not responding to any request - solved without loss of jobs LCG-France, 22 July 2004, CERN

  27. Disk Storage • Job runs in directory without enough space • Jobs running need ~2GB - problem where site has jobs sharing same disk server rather than local WN space LCG-France, 22 July 2004, CERN

  28. Reliable Data Transfer • In case of data transfer failure the data on LCG is lost. There is no retry mechanism if the destination SE is temporarily not available • Problems with GRIDFTP server at CERN: • Certificates not understood • Refused connections LCG-France, 22 July 2004, CERN

  29. Odds & Sods • LDAP of globus-mds server stops • OK - no jobs can be submitted to site • BUT also problems with authentication of GridFTP transfer • Empty output sandbox • Tricky to debug ! • Jobs cancelled by retry count • Occurs on sites with many jobs running • DIRAC just submits more agents LCG-France, 22 July 2004, CERN

  30. Conclusions LCG-France, 22 July 2004, CERN

  31. Demand 2004 • CPU: • 14 M UI hours (1.4 M UI hours consumed so far) • Storage • HPSS 20 TB • Disk 2 TB • Accessible from the LCG grid LCG-France, 22 July 2004, CERN

  32. Demand 2005 • CPU: • ~15 M UI hours • Storage • HPSS 30 TB ( ~15 TB recycled) • Disk 2 TB LCG-France, 22 July 2004, CERN

  33. Tier2 centers • Feasible • Good network connectivity is essential • Limited functionality: • Number crunches (production simulation type tasks) • Standard technical solution • Hardware (CPU+storage) • Cluster software • Central consultancy support • Housing space • Adequate rooms in the labs (cooling, electric power, etc) LCG-France, 22 July 2004, CERN

  34. Tier2 centers (2) • Local support • Stuff to be found (remote central watch tower ?) • 24/24, 7/7 or best effort support • Serving the community • Regional • Possible financing source • Extra clients (security, resource sharing policies issues) • National • French grid (segment) ? LCG-France, 22 July 2004, CERN

  35. LHCb DC'04Accounting LCG-France, 22 July 2004, CERN

  36. Next Phases Reprocessing and Analysis LCG-France, 22 July 2004, CERN

  37. Data reprocessing and analysis • Preparing data reprocessing phase: • Stripping – selecting events on DST files into several output streams by groups of physics • Scheduling jobs to sites where the needed data are • Tier1’s (CERN, Lyon, PIC, CNAF, RAL, Karlsruhe) • The workload management is capable of automatic job scheduling to a site having data • Tools are being prepared to formulate reprocessing tasks. LCG-France, 22 July 2004, CERN

  38. Data reprocessing and analysis (2) • User analysis: • Interfacing GANGA to submit jobs to DIRAC • Submitting user jobs to DIRAC sites: • Security concerns – job are executed by the agent account on behalf of user • Submitting user jobs to LCG sites: • Through DIRAC to have a common job Monitoring and Accounting • Using user certificates to submit to LCG • No agent submission: • Expecting high failure rate LCG-France, 22 July 2004, CERN

More Related